An Iterative Scheme for Leverage-based Approximate Aggregation

11/06/2017 ∙ by Shanshan Han, et al. ∙ Harbin Institute of Technology 0

Currently data explosion poses great challenges to approximate aggregation on efficiency and accuracy. To address this problem, we propose a novel approach to calculate aggregation answers in high accuracy using only a small share of data. We introduce leverages to reflect individual differences of samples from the statistical perspective. Two kinds of estimators, the leverage-based estimator and the sketch estimator (a "rough picture" of the aggregation answer), are in constraint relations and iteratively improved according to the actual conditions until their difference is below a threshold. Due to the iteration mechanism and the leverages, our approach achieves high accuracy. Moreover, some features, including not requiring recording sampled data and easy to extend to various execution modes (such as, the online mode), make our approach well suited to deal with big data. Experiments show that our approach has extraordinary performance, and when compared with the uniform sampling, our approach can achieve high-quality answers with only 1/3 of the same sample size.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The development of intelligent devices and informatization has brought about an unprecedented data explosion, which brings great challenges to data aggregation. When dealing with big data, usually it is impractical to compute an accurate answer by a full scan of the data sets due to the high computation cost, while an approximate aggregation is more economical. Meanwhile, users today often expect high-quality answers but do not want to wait too long. They also would like the data analysis system to be flexible and easy to extend. In light of this situation, an efficient, high-precision, and flexible approximate aggregation approach is in great demand.

To effectively execute approximate aggregation on big data and balance the accuracy and efficiency, researchers proposed bi-level sampling [1]

, which considers the local variance of the data when generating the sampling rate. However, it does not consider the individual differences in the data, while data with different features contribute differently to the aggregation answers. For example, in

SUM

aggregation, some data (outliers) are much too large but have too limited probabilities to be sampled. However, once they are sampled, due to their extremely high values, significant effects occur about the aggregation answers. Under this condition, these data should not be handled identically with others, while neglecting their individual differences produces a loss of accuracy.

To solve this problem, researchers introduced leverages to reflect the different influences of data on the global answers [2]. The leverage of data is calculated using the data value as well as all the other data. To reflect the individual differences in the data, a biased sampling process is performed in  [2]

, and for each data point, its biased sampling probability is generated using its leverage. This technique provides an unbiased estimation of the accurate value. It also considers the individual differences in the data, thus leading to a high accuracy. However, several drawbacks make it unsuitable to deal with big data. Most important, this technique requires recording all the data for the leverage of data is calculated based on the individual difference compared to all the other data. As a result, all the data are involved in calculating the leverages, which would cost much computation time when dealing with big data.

A solution to this problem is to draw a uniform and random sample from the data set, calculate the “leverage-based” probabilities to re-weight samples in the same way as the biased sampling probabilities, and use the samples and the leverage-based probabilities to generate the final answer. Considering the sample set, the expectation of the average, calculated by accumulating the products of the leverage-based probability and the value of each sample, is an unbiased estimate of the accurate average of the samples [2]. Since the distribution of the sample is considered to be the same as the whole data distribution [3], the accurate average of the sample is considered the same as the accurate average of the data set, suggesting that the average, calculated with the leverage-based probabilities and the samples, is an unbiased estimate of the accurate average of the data set.

However, when dealing with big data, calculating the leverages of samples requires recording all the samples, which would decrease efficiency. A solution is to define the leverages of the samples according to the current and previous samples while sampling. We can set several variables to record the “general conditions” (, average or median) of the previous samples instead of all the samples to calculate the leverage for the current sample, which requires much less storage space. However, this approach is sensitive to the sampling sequence, and samples with the same value may have different leverages. For example, suppose the individual difference of a sample is defined as dividing the sample value by the sum of the values of the current sample and all the previous samples (at this time, only the average of the previous samples needs to be recorded). Considering a sampling sequence , the leverages of 10 can be 1 or 0.5, while for the sampling sequence , the leverages of 10 may be or . Certain samples of different sampling sequences may produce different leverages and different aggregation answers, leading to a poor robustness.

Another solution is to calculate the leverages off-line to accelerate the online processing. For example, similar to [21] [4], we could refer to previous query results or compute summary synopses in advance. However, the off-line processing may also be unpractical, since it is usually too expensive when dealing with big data due to the constrained time and resources. Additionally, they may be less flexible when dealing with queries on new data sets.

Some other drawbacks also make the previous approaches less efficient when dealing with big data. The degree of the leverage effects, , how much influence the leverages have on the aggregation answers, is fixed in [2]. However, to obtain better results, the actual conditions of the data should be considered to determine whether the leverage effects should be “strong” or “weak”. When a “weak” leverage effect is enough, applying a “strong” leverage effect leads to poor answers, and it is the same the other way round. Thus, the fixed degree of the leverage effects in [2] would bring about a loss of accuracy to some degree. Besides, leverages are calculated in a single way in [2], while data with different features should be assigned with different leverages due to their different contributions to the aggregation answers. Moreover, to reflect the individual differences in the data, the biased sampling is adopted in [2], which is much more difficult to implement than uniform sampling, thus may decrease the efficiency when dealing with big data.

Contributions. In this paper, we propose a novel leverage-based approximate aggregation approach to overcome the stated limitations, which efficiently computes aggregation answers with a precision assurance. To overcome the limitation and inherit the advantages of uniform sampling, we draw uniform samples, and use leverages to generate probabilities to re-weight the samples to reflect their individual differences. To overcome the limitation of the traditional simple leverages, we divide the data into regions according to their features and contributions, and assign different leverages to them. To increase the accuracy, we introduce an iteration scheme of improving two constrained estimators, which intelligently determines the degree of the leverage effects according to the actual conditions. An objective function is constructed, which makes our approach insensitive to the sampling sequences and unnecessary to record samples. Our main contributions in this paper are summarized by:

  1. A novel methodology for a high-precision estimate is proposed, which involves generating two estimators using different methods to iteratively process constrained modulations according to the actual conditions of the data.

  2. A sophisticated leverage strategy which considers the nature of data is proposed, in which the data are divided into regions and appropriately handled.

  3. An objective function is constructed with the leverages and the samples, which avoids the sensitivity of the sampling sequences as well as storing the samples.

  4. We conducted experiments compared with a uniform sampling, and the results show that our approach achieves high-quality answers with only 1/3 of the same sample size.

  5. To the best of our knowledge, iterative leveraging is applied to data management for the first time.

In this paper, we focused on AVG aggregation, as AVG aggregation is one of the most common aggregation operations. Meanwhile, the answer of SUM aggregation can be easily obtained by multiplying the average and the number of data points, which could be easily obtained from meta data or computed according to the data size. Other aggregation functions, such as extreme value aggregation, will be studied in detail in the future.

Organization. We overview our approach in Section II, and introduce the preprocessing calculation in Section III. We introduce a sophisticated leverage strategy in Section IV, and propose different modulation strategies for the iteration scheme according to the actual conditions of the data in Section V. The core algorithm is proposed in Section VI, and extensions of our approach are discussed in Section VII. We present the experimental results in Section VIII, survey related work in Section IX, and finally conclude the whole paper in Section X.

Ii Overview

In this paper, we propose a novel methodology of obtaining high-precision estimates. Based on the methodology, we developed a system to process AVG aggregation queries.

Ii-a Methodology

We generated two estimators using different estimation methods and evaluated the bias of the estimators, i.e., relations between the accurate value and the estimators. We then modulate these estimators towards the accurate value according to the bias conditions to obtain the proper answers.

We evaluated the deviations of the estimators according to the actual conditions of the data. Although the accurate value is unavailable, we could tell whether an estimator is larger or smaller than the accurate answer, and which estimator is closer to the accurate answer from the distribution of the sampled data (discussed in detail in Section V). Based on such relations, these two estimators are iteratively “modulated” towards the accurate value. The one with more deviation is modulated more in each iteration. When the estimators are approximately equal to each other, they arrive at the accurate answer, and the high-precision answer is obtained.

Fig. 1: Modulations of two conditions

There are two cases about the relations between the accurate value and estimators, as shown in Fig. 1. One is that the accurate value is between the two estimators. The other is that the two estimators are on the same side of the accurate value. In the first case, the larger of the two estimators is decreased, and the smaller one is increased. In the second case, the estimators are modulated in the same direction. According to the deviations of the estimators, we tell which estimator is farther from the accurate value, then set different step lengths to obtain a high-precision, unbiased estimate.

Theorem 1.

Consider two estimators and with deviations of and + (, 0) from the accurate value. If the modulation step lengths of and are and (01, 0), respectively, an unbiased answer can be obtained when =.

Proof.

We consider the first case in Fig. 1, and denote the accurate value as . Thus =-, and =++. Suppose after rounds of modulation, =. Thus =-+, and =++. When and are both modulated to , an unbiased estimate is obtained, where -=0 and +-=0, leading that =. The proof of the second case is similar. ∎

Fig. 2: System flow chart

Ii-B System Architecture

According to the methodology above, we adopt two estimators, the sketch estimator (), and the leverage-based estimator (-). The sketch estimator, initially generated with a relaxed precision requirement, describes a “rough picture” of the aggregation answer. The leverage-based estimator is calculated with samples and leverages, where the individual differences of samples are considered.

We establish a system to process AVG aggregation using and l-estimator. Queries are of this form: SELECT AVG(column) FROM database WHERE desired precision, where desired precision is the precision requirement indicated by the users. The flow chart of our system is shown in Fig. 2.

When faced with big data, a centralized storage is impractical. Thus, without any loss of generality, we propose the data to be stored in multiple machines, i.e., blocks. In this condition, to process the aggregation, it is effective to compute on each block, then gather the partial results to generate the final answer. Considering this, we divide the main functions into three modules, Pre-estimation, Calculation, and Summarization. The Pre-estimation module calculates the parameters for later computation. The Calculation module processes iterations to obtain partial answers on each block. The Summarization module collects the partial answers to generate the final aggregation answers. We now overview these modules.

Pre-estimation module. This module calculates the sampling rate and the sketch estimator for later computations. To satisfy the desired precision indicated by the users, we calculate a sampling rate to draw samples in the blocks. The sketch estimator is then generated with a relaxed precision as an overall picture of the final answer, which is to be later modulated to increase accuracy in blocks in the Calculation module. Details of this module will be discussed in Section III.

Symbol Meaning
Required precision, indicated by the users.
Sampling rate.
Accurate average aggregation answer.
The value of l-estimator.
The leverage degree.
The leverage allocating parameter.
The step length factor.
The objective function for iterations.
The data size.
The set of S samples. =.
The set of L samples. =.
The normalization factor for S and L data.
The iteration threshold.
The deviation degree of : =S/L.
, Data boundary parameters.
S, L The number of data in the S or L region.
The initial value of the sketch estimator.
, The modulation step lengths for and .
TABLE I: A sumary of the main notations

Calculation module. The Calculation module mainly processes core computations on the blocks. A data division criteria (data boundaries) is established according to the distribution feature to divide data into different regions, thus samples with different features can be treated differently. In each block, samples are drawn according to the sampling rate. Based on the data boundaries and the sketch estimator, partial answers are iteratively computed on the blocks.

In each block, once the samples are picked, they fall into specific regions according to the data boundaries. Only samples in certain regions, which are featured enough to represent the whole distribution, are further considered, since an approximate distribution could be determined from these samples. Using these samples, the leverage-based estimator is generated to reflect the individual differences in the samples, and the l-estimator and are iteratively modulated to generate high-precision answers.

We discuss the measures for individual differences of the samples in Section IV, where data boundaries and leverages are explained in detail. We then illustrate the modulation strategies for the l-estimator and in Section V, and finally talk about the core algorithms to compute the proper partial answers in the blocks in Section VI.

Summarization module. This module collects the partial answers to generate the final answer. We denote the block set and the number of blocks as and , respectively, and denote the partial answers of block , , as , , , respectively. Since these partial answers represent the average conditions in the blocks, and blocks with more data contribute more to the aggregation, for block , the probability of is set just positively relevant to the block size . The final answer is thus calculated as , where is the data size.

The main notations in this paper are summarized in Table I. In this paper, we assume that the blocks provide unbiased samples for their local data. For ease of illustration, we assume data in the blocks are independent and identically distributed (i.i.d.), and extend our approach to the non-i.i.d. distributions in Section VII-C.

We mainly discuss the normal distributions, since normal distributions are the most consistent with the actual situations 

[5]. We also provide extensive discussions to show that our approach can also be adapted to other distributions in Section VII-B

, and experimentally evaluate the performance of our approach on some extreme conditions, such as uniform distribution and exponential distribution, in Section 

VIII-E

. Actually, many models assume that data are normally distributed, such as linear regression 

[6], which assumes that the errors are normally distributed, and non-normal distributions can even be transferred to normal distributions [7].

Iii Pre-estimation

Pre-estimation module calculates the sampling rate and the sketch estimator for later computation.

Iii-a Sampling Rate

To satisfy the desired precision indicated by the users, the system calculates a sampling rate based on which the blocks draw samples.

For different aggregation tasks, the desired precision indicated by the users is different, and a proper sampling rate should be calculated accordingly. We assume that the corresponding sample size is . To calculate

, we introduce the confidence interval 

[8], which is a precision assurance to confirm that the accurate answer is in it.

Definition 1.

Define as a sample set generated from a normal distribution , and is the average of the samples. For confidence , the confidence interval of is , where

is the standard deviation, and

is a parameter determined by .

According to Neyman’s principle [3], the confidence is specified in advance. In our problem, the confidence interval is determined by the aggregation answer and the desired precision , since we would like the accurate answer in the interval of , which is the confidence interval in our problem. According to Definition 1, the length of the confidence interval is , where =. The required sample size is then obtained: =, and the sampling rate is computed as

(1)

where is the number of data points, and is the overall estimated standard deviation. We assume is known (actually, could be easily obtained from the meta data or computed according to the data size). To estimate , a small sample is drawn, with a sample size indicated by the system in advance, and the samples are uniformly and randomly picked from each block with a sample size proportional to the size of each block.

Iii-B Sketch Estimator

The sketch estimator is generated with a pilot sample as an overall picture of the final answer. It is used to determine the data boundaries and is modulated to increase the precision to obtain the proper aggregation answers in the blocks later in the Calculation module.

Denote the sketch estimator as and its initial value as . Note that an arbitrary sample size does not provide any definite precision assurance. If is calculated with an arbitrary sample size, the later modulation of would bring uncertainty and precision loss to the final answers. To ensure accuracy, is generated using a relaxed precision , where , determined by the system, is the relaxed precision parameter. In this condition, is provided with a relaxed confidence interval , +. We generate in the same way as , where uniform samples are picked from each block with the sample size proportional to the block size. Similarly, the sampling rate for is obtained according to Eq. (1).

In this way, is obtained with a relaxed precision assurance. Such a is modulated in each block to increase the accuracy later in the Calculation module.

Iv Bias of Samples

We consider the bias and individual differences of the samples to increase the accuracy. To save cost, samples are uniformly drawn. However, data act differently in aggregation, and regarding them uniformly brings a loss of accuracy. Thus, a re-weight processing is required. In this section, we introduce how our approach reflects individual differences in the samples from the statistical perspective. A sophisticated leverage strategy is introduced in Section IV-A, which considers the nature of data and divides the data into regions then handles them differently. We then illustrate how to use leverages to generate probabilities to reflect the individual differences in the samples in Section IV-B.

Iv-a Leverage Strategy

To overcome the limitations of the traditional leverages, inspired by [20], we propose a sophisticated leverage strategy which considers the nature of data. We divide the data into regions according to the data division criteria (, data boundaries), choose regions which are featured enough to represent the whole distribution, and assign various leverages to reflect individual differences in the samples.

Iv-A1 Data Boundaries

To handle data with different features , we use data boundaries to distinguish the data.

Most of the existing approximate aggregation approaches handle samples identically, regardless of the differences among the samples [9] [10]. However, data with different features contribute differently to the global answers in AVG and SUM aggregation, and neglecting the differences brings a loss of accuracy. For example, in normal distributions, some data are large and can be easily picked, which significantly contributes to the global answers. Some data are even much larger than most of the other data, but have too limited probabilities to be picked, which can be regarded as large outliers in AVG aggregation. However, once the large outliers are picked, due to their extremely large values, significant effects occur to the aggregation answers.

To treat data with different features, we consider the nature of data and divide the data into regions based on their values and positions in the normal distributions referring to the “3 rule” [11]. Since data out of count for a limited proportion (about 4.6% [11]) and are too far away from the middle axis, which have limited contributions to the AVG aggregation, we regard them as outliers and do not consider the boundaries of . Therefore, the boundaries of “3” divide the data distributions into 5 regions. We use and the standard deviation calculated in the Pre-estimation module to define the data boundaries. To control the percentages of data in these regions, we set the data boundary parameters and () to adjust the data boundaries. In this way, the proportion of data involved in the computation is controlled.

The data boundaries are shown in Fig. 4, where the data are divided into the following 5 regions.

(1) Too small (TS). Data in are defined as “too small data”. Such data have extremely low values and can hardly be sampled due to their extremely low probabilities, thus can be treated as a kind of outlier in AVG aggregation, and their effects can be nearly neglected.

(2) Small (S). Data in are defined as “small data”. Such data count for a high proportion and have lower values than most of the others.

(3) Normal (N). Data in are defined as “normal data”. These data are symmetrical around the middle axis in the distribution and have higher probabilities to be sampled than data in other regions.

(4) Large (L). Data in are defined as “large data”. Such data have higher values than most of the others and count for a high proportion, thus significantly contribute to AVG aggregation.

(5) Too large (TL). Data in are defined as “too large data”, which have extremely high values but can hardly be sampled due to the extremely low probabilities. Thus in AVG aggregation they can also be regarded as a kind of outlier. However, different from the TS data, once such data are sampled, a significant influence might occur to the aggregation answers due to their extremely high values. Thus, in AVG aggregation, such a significant influence should be considered to be eliminated or properly handled.

Iv-A2 Leverage Assignment

Due to the different contributions of the data in AVG aggregation, we use different leverages to reflect the differences. We use data in S and L to represent the distributions and directly discard the other data, since the S and L data contribute much to the AVG aggregation, and the shape of the distributions can even be approximately predicted from the S and L regions. As shown in Fig. 4, S and L are symmetric in their distribution, approaching the middle axis from the left and right sides, respectively, and they both account for high proportions. Meanwhile, the parameters of the distributions ( and ) are included in the shapes of the S and L regions, and other regions can even be approximately speculated with S and L, as the dotted line in Fig. 4.

Although only some of the samples are used in the computation, the precision loss is not that significant, since the chosen samples can effectively represent the whole distributions. Meanwhile, proper leverages are assigned to the S and L data to reflect the individual differences, thus the accuracy is even increased.

Fig. 3: Data division
Fig. 4: Data contributions

Due to the different contributions of the samples in different regions, we assign different leverages to the S and L data. We assign values farther from the middle axis with greater leverages. The reason is that, although they have less probabilities, they contribute more to the shapes of the normal distributions when considering the formation of the normal distributions. As shown in Fig. 4, values farther from the middle axis have more reflection on whether a normal distribution is “short and fat” or “tall and thin”, and such information describes the distributions. Data farther from the middle axis provide more information about the shapes of the normal distributions. Thus, larger leverages are assigned to them, and smaller leverages are assigned to the closer ones.

Considering a sample set =, for each , we introduce its deviation factor to calculate its leverage. Commonly, score is used to define whether the data are outliers [13] [14] [15]. Inspired by [2], we use it to calculate the leverages. For sample , =. Obviously, for positive values 111For ease of discussion, we assume that all the data are positive. For aggregation with negative data, we translate the distribution along the x axis by the distance of to make all the data positive to process the computation, and then move back the answer by the distance of to generate the final answer. , is positively correlating to the values. For the S and L data, we assign larger leverages to the samples farther from the middle axis. Considering this, for data , if it is S, its leverage score is ; if it is L, the leverage score is .

In our approach, to save computation time, only the S and L data involve calculation; to increase the precision, samples are assigned with different leverages based on their different contributions in the aggregation.

Iv-A3 Leverage Normalization

Even though we assign different leverages to the S and L data to reflect their individual differences, since such leverages do not satisfy some constraints inherited from the probability calculation, we could not directly use the leverage scores in the probability generation, and a normalization for the leverages is required based on these constraints.

The following theorem describes Constraint 1, which is inherited from the probability generation.

Theorem 2.

The sum of leverages equals to 1.

Proof.

For each in the sample set =, its probability is of this form (details are discussed in Section IV-B): =+- , where is the leverage, and is the uniform sampling probability. When accumulating the probabilities of all the samples, =, that is, =+-=+-=, leading that =. ∎

However, according to THEOREM 2 we could not obtain the concrete leverage sums of the S and L data, for their ratio is not obtained. Thus, we propose Constraint 2.

Constraint 2: The leverage sum of the samples in a specified region is proportional to the number of samples in it.

Fig. 5: Deviation of . Solid lines: real distributions; dotted lines: estimated distributions; Shadows: number of data in S and L.

We establish this constraint according to the following consideration. Data boundaries are established using , the initial value of the sketch estimator. The deviation of leads to a difference in S and L. From Fig. 5, we observe that the accurate average value is closer to the region with more data. Thus, a larger sum of leverages is desired to the region with more data, and we set it proportional to the number of samples in that region. As a result, the leverage-based estimator will be closer to the region with more data, thus it will be much closer to the accurate answer.

According to these discussions, the leverage normalization is performed as follows:

  • Step 1: Leverage sum calculation. Get the sum of the leverage scores of the S and L data.

  • Step 2: Theoretical sum calculation. Calculate the theoretical sum of leverages for the S and L data based on the two constraints we proposed.

  • Step 3: Normalization factor calculation. Divide the sum of the leverage scores by the theoretical sum of the leverages to calculate the normalization factors for S and L.

  • Step 4: Leverage normalization. For each S and L sample, divide its leverage by the corresponding normalization factor .

With the normalized leverages, the probabilities are generated to reflect the individual differences of the samples.

Iv-A4 Sensitivity of

Based on the previous discussions, we note that is important for the aggregation answers, for the data boundaries are established with , which directly influences the classification of samples. A bad may lead to a large difference between and and a large difference between the allocated sums of the leverages for the S and L samples. In this condition, the leverage effects of the region with more samples are too strong, leading to an over-modulation of the leverage effects over the aggregation answers.

A severe deviation of may happen due to unbalanced sampling. Meanwhile, a pilot sample is drawn to calculate in the Pre-estimation module, where a uniform sample may have a significant influence on . In this condition, if an outlier is picked, a significant deviation may happen to .

To overcome the limitation of the sensitivity of , we introduce the deviation degree, denoted as , to evaluate the deviation of , and introduce the leverage allocating parameter to balance such a deviation by controlling the allocated sum of the leverages of the S and L samples. We calculate as = , and within a system-specified range, e.g., , indicates approximately no much deviation of , while out of the range denotes the deviation.

When an obvious deviation exists, e.g., , or , the leverage effect is too strong. To weaken it, we use the leverage allocating parameter to control the allocated sum of the leverages (denoted by ) of S and L in the leverage normalization: . We determine according to the actual conditions. Generally, is set to 1. When an obvious deviation of occurs, we use a positive value to generate . If , we decrease the allocated sum of the leverages to the S data, and = ; otherwise, = . A large is required, since the leverage tuning is subtle. Meanwhile, should not be too large, since a too large leads to a large sum of leverages allocated to the region with less data, which decreases the accuracy. Actually, due to the confidence assurance of , the difference between and is limited, leading to a not too large . Considering these, we vary in in practice according to the deviation conditions of . In this way, we shrink the leverage effects of the region with more data to balance the too strong leverage effects. Based on such a mechanism, our approach can detect and reduce the obvious deviation of . Thus, the limitation of the sensitivity of is overcome.

An effective leverage strategy should understand the nature of data, divide the data into regions, and handle them differently. In this paper, we propose a sophisticated leverage strategy. Various leverages are assigned to samples to reflect their individual differences. As a result, high-quality answers can be obtained with a small sample.

Iv-B Probabilities Generation

In our approach, samples are uniformly picked. Inspired by the SLEV algorithm [2], we use leverages to re-weight samples to reflect their individual differences. In this subsection, we discuss how to generate re-weighted probabilities with the normalized leverages and the uniform sampling probabilities.

For sample in the sample set = , let denote the leverage of , and let denote the uniform distribution sampling probability (i.e., = , for all ). The re-weighted probability of is of the form

(2)

where is the leverage degree, which indicates the intensity of the leverage effect. The aggregation answer is then obtained by accumulating the product of probabilities and values, .

Here we illustrate the leverage effects and the process of leverage-based aggregation by the following example.

Example 1. Consider a data set 2, 2, 3, 4, 4, 5, 5, 6, 6, 7, 8, 9, 10, 15} and a sample set 3, 4, 5, 6, 7, 8, 15} randomly generated from the data set. The accurate average of all the data is 5.8. We now generate the traditional uniform, and the leverage-based aggregation answers.

(1) Traditional. The aggregation answer is generated by equally dividing the sum, and we get an answer of 6.25. The value 15 participates in the computation, which produces a deviation of the result due to its extremely large value.

(2) Leverage-based. Suppose is 6.2, =1, and =3. Thus, the range of the S data is , and the range of the L data is . According to our principled leverage approach, we know that only 4, 5, and 8 participate in the computation, where 4 and 5 fall in the S region, and 8 is in the L region. The calculation processes are recorded in Table II. To generate the leverage-based probabilities of the samples, we first calculate the original leverages (OriLev in Table II), then calculate the normalization factors (Fac) for the S and L data. After that, we obtain the normalized leverages (NorLev). Finally, the probabilities (Prob) are generated with leverages, , and the uniform sampling probabilities. Suppose = . By accumulating the products of the values and probabilities, we obtain the aggregation answer of 5.67. Due to the leverage effects, this answer is much closer to the accurate average of 5.8.

Region Val OriLev Fac NorLev Prob
S 4 89/105
5
L 8
TABLE II: Intermediate results of - in Example 1

In this paper, we introduce leverages to reflect the individual differences in the samples to overcome the limitation of the uniform sampling. The leverage effect is controlled through the modulation of the leverage degree , which is crucial to the quality of the answers. Using a fixed means no modulation ability over the leverage effects, and a bad leads to a low accuracy. For example, if the aggregation answer calculated with uniform sampling probabilities are very close to , only slight leverage effects are required over the aggregation answer, since only a little modulation is required. At this time, a large produces an inaccuracy to the answer.

The quality of has great influence over the aggregation answers, and the difficulty lies in that the actual conditions of the samples should be considered when deciding .

V Modulations

A proper is crucial to the quality of the aggregation answers, thus an intelligent mechanism is in great demand to determine . In our approach, modulations are processed according to the actual conditions of the samples to compute a good .

As already discussed, relies on the actual conditions and can be hardly directly computed. To obtain a proper to achieve a high-quality leverage-based aggregation answer, we adopt the methodology proposed in Section II-A. We adopt the leverage-based estimator - and the sketch estimator , and modulate them in the directions of to gradually increase the precision. As discussed in Section II-A, the deviations of the estimators are evaluated, and the estimator with more deviation from is modulated more in each iteration. When they are approximately equal to each other, they are both approximately arrived at the accurate value , and a proper answer, as well as a good , is obtained. To process the iterative modulations, we construct an objective function with leverages and samples, which avoids calculating the leverages while sampling and does not require recording samples, leading to our approach insensitive to the sampling sequences.

In this section, we discuss the objective function, illustrate how to evaluate the deviations of the l-estimator and , generate the modulation strategies according to the actual conditions of data, and finally, discuss how to determine the modulation step lengths.

V-a Function Construction

We construct an objective function through a subtraction of the l-estimator and sketch. According to Section II-A, the optimization goal is the function value approaching 0 with the estimators evolving, modulated towards .

We denote the value of the l-estimator by , and set the initial value of to , which is calculated in the Pre-estimation module. We generate with samples, leverages, , and the uniform sampling probabilities, and generate by modulating . To evaluate the deviation between these two estimators, we construct an objective function by subtracting from . From the discussions in Section IV-A, only the S and L data are involved in the computing, while other data are directly discarded. For the S and L samples in an aggregation work, suppose = and = . We use the S and L samples to generate the leverage-based answer , and the following theorem holds.

Theorem 3 (The Leverage-based Answer).

Denote the set of the S samples as =, and the set of the L samples as =. The leverage-based answer is computed with a function of :

where =, and =.

We denote the difference between and by . According to THEOREM 3, there exists

(3)

Initially, is set to 0. From THEOREM 3 we know that the initial value of is , which also stands for the aggregation answer calculated with the S, L samples and the uniform sampling probabilities, without leverages.

Note that the parameters in (, and ) are computed with and , as well as the sum, square sum, cube sum of the S and L samples, and these variables can be computed while sampling. It indicates that the storage space for samples is totally unnecessary. Meanwhile, leverages are not directly calculated while sampling, leading to our approach insensitive to the sampling sequences.

According to , the - and are iteratively modulated approaching the directions of , respectively, and the precision gradually increases.

V-B Deviation Evaluation

We now discuss how to evaluate the deviations of and (the initial values of ). We obtain two indicators for further processing. One is the relation among , , and , which reveals the modulation directions of and . The other is the estimators’ deviation conditions from , i.e., which estimator is farther from . Based on these indicators, the modulations are processed on l-estimator and .

We first evaluate the deviation between and from the relation of S and L, then evaluate the deviation between and through the initialization of the objective function , and finally infer the relation of , , and according to the results of the former two steps.

(1) The relation between and . The relation of S and L is evaluated to obtain the relation between and , for the deviation of leads to a difference between the numbers of the data in S and L, as shown in Fig. 5.

The S and L regions are defined by the data boundaries generated with . Under ideal conditions, when is accurate, S=L, due to the symmetry of the S and L data in the distribution. However, in practice, has a deviation from , leading to . Thus, according to the relation between and , we could evaluate the deviation between and . SL indicates (as shown in the left of Fig. 5), where should be increased; SL indicates (as shown in the right of Fig. 5), where should be decreased.

(2)The relation between and . To determine the modulation direction of , we evaluate the difference between and by initializing . The initial value of is , and is modulated through . We denote the initial function value of by . According to Eq. (3), =, which reveals the relation between and . 0 indicates ; otherwise, .

(3)The relation between , , and . We now obtain the relation of and , and the relation of and . We combine them together to obtain the relation of , , and , which reveals the modulation directions of and , as well as the relation of their modulation step lengths.

V-C Modulation Strategies

We previously discuss the deviation evaluation of the estimators and obtained the relations between , , and . We now illustrate how to use such relations to generate different modulation strategies according to the different conditions of the samples.

The modulation strategies include the modulation directions for and (to increase, or decrease), and the relations of the modulation step lengths (for and , which one is modulated more in each iteration). Suppose the modulation step lengths of and are and , respectively, which indicates how much to change in a round of modulation. For ease of discussion, the step lengths are set positive. When or is to be increased, its step length is added to it; otherwise, subtracted from it.

Different conditions of the samples lead to different relations between , , and , and different modulation strategies are required for and . Note that although we suppose each block provides unbiased samples for its local data, in practice, unbalanced sampling happens by small probabilities. To fit this scenario, we considered the unbalanced sampling and generate the corresponding modulation strategies. We obtain the relations of , , and according to the method proposed in the last subsection, and then derive the relation of and based on the optimal goal of the iteration (0). We now discuss different cases and the corresponding modulation strategies as follows.

Case 1: , : ; .

Since both and are smaller than , they both should be increased, thus =+)+-+0, leading to . In this case, unbalanced sampling happens. Since and , as shown in the right of Fig. 5, should be on the right of . However, contradiction exists since , indicating an unbalanced sampling. This seldom happens. Both and increase, and increases more to balance the bias.

Case 2: , : ; +.

We increase and decrease . Thus, =++--, leading that +. When 0, such a relation always holds; otherwise, . In this case, the relation of and is unknown, and the modulation direction of could not be directly determined. However, uniform sampling probabilities cannot reflect the individual differences in the samples, which poorly works when compared with the leverage-based probabilities. Meanwhile, unbalanced sampling does not occur, and we do not need a negative to balance the sampling bias. Therefore, we slightly increase for better answers.

Case 3: , : ; .

In this case both and are increased. Thus, ++, leading to . The explanations are similar to Case 2.

Case 4: , : ; .

Both and should be decreased, thus, , leading that . Similar to Case 1, unbalanced sampling also occurs. When we decrease , we decrease more, and is negative to balance such unbalanced sampling.

Case 5: : return .

In this case, S and L are approximately balanced, indicating that works well as the data division criteria for it is very close to . We do not use any further process, just return as the aggregation answer.

In our approach, different modulation strategies are generated according to the actual conditions, where both and are modulated approaching to to increase the precision.

V-D Step Lengths

Based on the modulation strategies of the different conditions above, and can be determined. Since long step lengths may lead to missing proper answers, while short step lengths result in a slow convergence, analogous to the gradient descent method [12], we developed a self-tuning mechanism for the step lengths to ensure both the accuracy and the convergence speed.

0:    : the block id; : the sampling rate; : data boundary information;
0:    : the block id;: counter, sum, squareSum, cubeSum;: counter, sum, squareSum, cubeSum;
1:  Initialize , ;
2:  ; // Calculate the sample size.
3:  for  to  do
4:     Draw a sample ;
5:

     Classify

; // is classified according to
6:     if  belongs to S then
7:         updateParams(, );
8:     end if
9:     if  belongs to L then
10:         updateParams(, );
11:     end if
12:     Drop ;
13:  end for

updateParams(, )

1:  .counter .counter+1;
2:  .sum .sum+;
3:  .squareSum .squareSum+;
4:  .cubeSum .cubeSum+;
Algorithm 1 Phase 1: Sampling

We determine the step lengths according to and set a convergence speed , where reduces to after an iteration. In this paper, we set to 0.5, which means reduces to half after each iteration. According to the optimal goal of , we generate the relation among , , , , and . To ensure the relation between and generated above, we introduce a (01) as the step length factor. The smaller one of and is set to the larger one multiplied by . In this way, we determine the step lengths in the current iteration using and in the last iteration.

For example, initially, =-. In the first iteration, there exists +-=, and = , . We obtain and with these two equations and update , , and . Similarly, the second iteration is then processed with the new , , and , etc.

Determination of . The deviations of the l-estimator and are evaluated to determine . As discussed above, a severe deviation of leads to a large difference between and as well as the strong leverage effects on l-estimator. According to THEOREM 1, when determining , we should consider the severe deviation of and the strong leverage effects, and adopt a different based on the actual conditions. However, we introduce the leverage allocating parameter to shrink the severe deviation of and the leverage effects in Section IV-A4. Since is determined according to the actual conditions of the samples, a fixed is sufficient.

Terminal Condition. is reduced to half after each iteration, hence is steadily approaching 0 at a high rate of convergence. We introduce a threshold for . When is no greater than , the iteration halts.

Vi Core Algorithm

We introduce the core algorithm of our approach based on the leverage mechanism and the iterative modulation scheme. The algorithm runs in each computing block to compute the partial answer with an evolving and finally returns a proper aggregation answer of the current block. Two phases are included, i.e., the sampling phase, and the iteration phase.

0:    : counter, sum, squareSum, cubeSum;: counter, sum, squareSum, cubeSum;: the initial value of the sketch estimator;: the threshold parameter for iteration; : the convergence rate;
0:    j: the block id; : the aggregation answer;
1:  if (.counter .counter) then
2:     ;
3:     return  ;
4:  end if
5:  Construct the objective function ;
6:  Determine modulation strategies;
7:  , , ;
8:  while  do
9:     Calculate and ;
10:     , , ;
11:  end while
12:  ;
Algorithm 2 Phase 2: Iteration

Vi-a Phase 1: Sampling

In the sampling phase, samples are picked and then decided which regions they fall in. Two arrays, and , are set to record information of the S and L samples, including the counter, sum, square sum, and cube sum. Once a sample is picked, if they fall in the S or L region, the corresponding array is updated; Otherwise, the sample is directly discarded, for it does not participate in the computation. The pseudo code is shown in Algorithm 1.

Two arrays, and , are initialized to record the information for the S and L samples (Line 1). The required sample size in this block is then computed (Line 2). Next, samples are drawn and classified (Line 3-10). Once a sample is drawn, it is classified according to the data boundaries (Line 4, 5). If it is S or L data, the corresponding parameters ( or ) are updated, where the algorithm adds 1, , , to the counter, sum, square sum, and cube sum, respectively (Line 6-9). The samples are then dropped (Line 10).

Complexity analysis. According to the previous discussions, this phase requires time, where is the sample size.

Instead of recording all the samples, the information of the samples are included in and , which are used to compute and in the objective function according to THEOREM 3 later in the next phase.

Vi-B Phase 2: Iteration

In the iteration phase, modulations are processed iteratively, and a proper aggregation answer is obtained. The pseudo code is shown in Algorithm 2.

Initially, whether is approximately equal to is evaluated; if , is directly returned as a proper aggregation answer of the current block (Line 1-3), for is very close to ; otherwise, the algorithm continues. The function is constructed (Line 4), and the modulation strategies for and are determined (Line 5). After initialization (Line 6), it processes (Line 7-9): for each iteration, decreases by a speed of , based on which the step lengths, and , are calculated for the current iteration (Line 8); then the parameters are updated for the next iteration (Line 9). When the function value of is below the threshold , a good is obtained, and the aggregation answer of the current block is obtained with this (Line 10).

Upper bound for iteration. As discussed in Section V-D, in each iteration, is decreased by the speed of . When is no more than the threshold , the iteration halts. We suppose the iteration time is . There exist and . Thus, =.

Convergency. As discussed in Section V-D, the modulation step lengths and are calculated based on the modulation objective (D0) and the relation of , , and . Meanwhile, the difference between and decreases by a convergence speed of () in each iteration, indicating that is converged to 0.

Complexity analysis. According to previous discussions, the iteration phase is processed in .

We use and to construct the objective function for the iterations, which not only requires no storage space for the sampled data but also makes our approach insensitive to the sampling sequences. Due to the iteration scheme, is intelligently determined according to the actual conditions, leading to a high accuracy and efficiency.

Vii Extensions

Our approach can be extended to fit more scenarios.

Vii-a Extension to Online Aggregation.

Our approach can be extended to the online mode to support further computation after accomplishing the current computation. In each computing block, and are stored to record the counter, sum, square sum, cube sum of the S and L samples, respectively, instead of storing all the samples. Further computations can be processed based on and . After the current round of computations is accomplished, if users would like to continue computations to obtain an answer with a higher precision, then our system can continue computations based on the data boundaries, , and , for the information of the previous samples are recorded in and . For S and L samples in the new round of computations, similar updates are applied to the counter, sum, square sum, and cube sum in and . Based on and , the system processes the iterations to achieve a higher precision.

Vii-B Extension to Other Distributions

Our approach is proposed based on normal distributions, since actual data are usually subjected to: 1) normal distributions, 2) similar normal distributions, or 3) distributions generated by superimposing several normal distributions.

Our approach can also handle non-normal distributions due to the leverages, the iteration scheme, and the precision assurance of . In our approach, is generated as a “rough picture” of the aggregation value with a relaxed precision, which provides a constraint for the result. Due to the confidence assurance of , the final answer could not be far away from it. Meanwhile, leverages are assigned to samples to reflect their individual differences, which overcomes the flaws of uniform sampling and increases the precision of the -. In iterations, - and are gradually modulated to increase the accuracy.

In Section VIII-E we test the validation of our approach on some other distributions, such as, the exponential distribution. In our approach, we use and to evaluate the deviation of . When handling some extreme distributions (although we hardly process AVG aggregation on these distributions in practice), such as, =0), and dramatically vary with the increase in . In this condition, a loss of accuracy may occur due to the high increasing rate of .

To solve this problem, we can utilize the confidence interval of to generate a modulation boundary for the estimators. The confidence interval provides an assurance of in this interval. It also indicates that can hardly be out of the range. However, when a severe difference of and occurs, the computed aggregation answer will be out of the interval due to the strong leverage effects. Note that such a feature can be used to test whether there is a high increasing (or decreasing) rate of . Moreover, we can evaluate how much the aggregation answer excesses the interval to evaluate the increasing (or decreasing) rate of , then choose different , the leverage allocating parameter, to optimize the leverage effects.

The extreme condition evaluation and a more detailed definition of parameter will be studied in the future.

Vii-C Extension to Non-i.i.d. Data

In this paper, we suppose that data in the blocks are i.i.d.. We now consider the local variance of blocks and propose ideas about the AVG aggregation on non-i.i.d. blocks. Improvements are mainly from the following two aspects.

Different sampling rates. For aggregation on non-i.i.d. distributions, to balance the accuracy and efficiency, we consider the local variance of blocks and apply different sampling rates to them. Inspired by [1], we apply the blocks, where data show much dispersion (or, variability), with a large sample size to obtain enough information to describe the data distributions. Considering that such dispersion is reflected by , we use to compute leverages to reflect the local variance in the blocks, and blocks with higher are applied with higher sampling rates.

For block , we denote its leverage as , and its sampling rate is computed with , the overall sampling rate , the data size , and the data size . Similarly to the leverages in  [2], we set the leverage of proportional to . Since such leverages are directly used in computing the sampling rates, to avoid the sampling rates of 0, we set as , and compute the sampling rate of as . To calculated , in the Pre-estimation module, a small sample is drawn randomly and uniformly from . Meanwhile, the samples from the blocks are collected to generate the overall sampling rate .

Different data boundaries. Since the identical data boundaries work poorly for different distributions, for non-i.i.d. distributions, in Pre-estimation module, a pilot sample is drawn in each block to calculate and to generate different data boundaries. Similar iterations are then processed to compute the proper answers in these blocks.

Vii-D Extension to Other Aggregation Functions

In this paper, we focused on the AVG aggregation, and the SUM aggregation answer can be easily obtained by multiplying the average and the data size . The work of extreme value aggregation, MAX and MIN, is still in progress, and we now give a brief introduction.

We use a similar framework, and the main differences include 1) the recorded information (only the extreme value is recorded in each block), and 2) the sampling rate, where leverages are used to generate different sampling rates according to the local variance and the general conditions of blocks.

As discussed in Section VII-C, the sampling rates are generated based on the local variance. Blocks which exhibit a higher should be sampled more than blocks with a lower . Meanwhile, considering the particularity of extreme value aggregation, the general conditions of the blocks should also be considered, since data in some specific blocks may be higher or lower than other blocks in general. We take MAX aggregation for example. The MAX value is more likely to be in the blocks with generally higher values, while it is less possible to be in the blocks with lower values.

Under this condition, only considering the local variance is insufficient when generating the sampling rates. Thus, we will develop a leverage-based sampling rate which considers the local variance and the general conditions of the blocks. The local variance is reflected by , and the general conditions of the blocks can be described using the average or median, which indicates a general condition of the data in the block. For blocks with generally higher values, larger leverages are assigned to the sampling rate, while blocks with generally lower data are assigned with smaller leverages.

Vii-E Extension to Distributed Systems

In some scenarios, big data are distributed on multiple machines, , HDFS. Our approach can be easily extended for distributed aggregation due to the architectural features, Meanwhile, it also provides convenience to deal with big data, for there is no requirement to store the samples.

Distributed aggregation could be implemented by performing sample-based aggregation on each machine and then collecting the partial results. We use an example to illustrate. Considering a transnational corporation, massive data are stored distributedly in its subsidiaries all over the world, which brings the requirement of handling big data over its subsidiary corporations. When processing aggregation, according to our approach, computations are processed in each subsidiary. The center node then collects the partial results to generate the final answer.

Vii-F Extension of Time Constraint

In some applications, users set the time constraints for the computation, such as [4][16]. Our system could accomplish aggregation with a time constraint with small adjustments. According to the workload, the relationship of the sample size and the run time could be obtained, based on which our system calculates the required sample size within the time constraint. The system then generates the precision assurance–the confidence interval–to ensure accuracy.

Viii Experiment Evaluation

We conducted extensive experiments to evaluate the performance of our approach (an iterative scheme for leverage-based approximate aggregation, ISLA for short). We first compared ISLA with the uniform sampling method to evaluate the effects of the leverages. Due to the leverages, our approach can achieve high quality answers with a small sample size. We then tested the impact of the parameters on the performance of our approach. Next, we compared ISLA with the measure-biased technique proposed in sample+seek [17], the state-of-art approach. Finally, we evaluated the performance of ISLA on other distributions as well as the real data.

We compared the approximate aggregation answers with the accurate answer. However, when dealing with big data, it is unpractical to compute accurate answers. Therefore, we used synthetic data generated with a determined average as the golden truth. Without special illustration, the data are generated in normal distributions using two parameters, and with the default values of 100 and 20, respectively, and the accurate averages are 100.

Platform. Our experiments were performed using a Windows PC of 2.60GHz CPU and 4GB RAM.

Parameters. The parameters and the default values are as follows: data size =, block number =10, desired precision =0.1, confidence =0.95, step length factor =0.8, data boundaries factor =0.5 and =2.0, and the leverage allocating parameter . Normally, =1. When the deviation of exists, is generated with . When , =5. When , =10.

Without special explanations, the sampling rate is determined according to the precision , the confidence , and the estimated standard deviation . Meanwhile, data are evenly divided into parts to process the computations.

Data set 1 2 3 4 5
ISLA 100.247 100.108 100.206 99.9992 100.238
US 99.6321 100.172 99.7991 99.726 100.138
TABLE III: The effects of leverages. Desired precision: 0.5

Viii-a The Effects of Leverages

To test the effects of the leverages, we compared our approach with uniform sampling (US in Table III). For the convenience of observation, we set the desired precision to 0.5, based on which the sampling rate is generated. To intuitively show the leverage effects, we set the sample rate of US as the required sampling rate , and reduced the sampling rate to for ISLA. We generated 5 data sets to conduct the experiments (Data set 1-5 in Table III), and the results are shown in Table III.

Table III shows that although ISLA use fewer samples than the uniform sampling experiments, all the aggregation answers meet the precision requirement. Moreover, most of the time, the qualities of the answers calculated by ISLA are even better. This is because our approach introduces leverages to reflect the individual differences in the samples. Due to the leverage effects, our approach achieves high-quality answers with even 1/3 of the required sample size.

(a) Varing precision
(b) Varing confidence
(c) Varing number of blocks
(d) Varing data boundaries
Fig. 6: Impacts of parameters. Five data sets are used. Each line stands for a run.

Viii-B Impacts of Parameters

We tested the impact of the data size, the required precision, the confidence, the number of blocks, and the data boundaries.

Varying Data Size. We tested the impact of the data size on the aggregation answers. Data with sizes of , , , , and were tested. For each data size, experiments were performed on 10 data sets to compute an average answer.

The returned results by the data sizes of , , , , and are 99.9927, 99.9999, 100.0119, 100.0035, and 100.0004, respectively. The answers are similar, indicating that the data size has hardly any influence on the aggregation answers. Actually, according to Section III-A, the sample size is only related to , , and , suggesting that the data size has hardly any influence on the aggregation answers.

Varying Precision. We tested the changing trends in the aggregation answers with the change in the desired precision . We varied the precisions from 0.025 to 0.2. The experimental results are in Fig. 6(a), which shows that with the increase in the precision, the aggregation results show a trend of divergence. This indicates that while the precision requirement is relaxed, the accuracy decreases, since the sampling rate is inversely proportional to the value of the desired precision according to Eq. (1), and lower precision requirements lead to a smaller sample rate, which produces a decreased precision.

Dataset 1 2 3 4 5 6 7 8 9 10 Average
ISLA 100.003 100.003 100.058 100.064 99.9831 99.9824 99.995 100.039 100.076 100.092 100.0296
MV 104.049 103.96 104.003 103.991 103.958 104.04 103.989 103.997 104.066 103.983 104.0036
MVB 100.558 100.472 100.523 100.485 100.471 100.541 100.511 100.51 100.598 100.481 100.515
TABLE IV: Comparisons of accuracy. Desired precision: 0.1
Partial 1 2 3 4 5 6 7 8 9 10 Average
ISLA 99.9253 99.9702 99.9208 100.065 100.036 99.9432 100.008 100.193 99.9573 100.016 100.003
MV 104.067 103.949 104.082 104.082 103.987 104.028 103.931 104.117 104.006 104.238 104.049
MVB 100.54 100.499 100.541 100.608 100.496 100.502 100.481 100.654 100.554 100.707 100.558
TABLE V: Comparisons of modulation abilities. Desired precision: 0.1

Varying Confidence. We tested the impact of the confidence using the confidences of 0.8, 0.9, 0.95, 0.98, and 0.99. Experimental results are in Fig. 6(b), which shows that with an increase in the confidence, the aggregation answers show a trend of contracting around the accurate value of 100. This indicates that a higher confidence leads to a better aggregation answer, since the sampling rate increases (according to Eq. (1)), which brings about a more accurate aggregation answer.

Varying Number of Blocks. We tested the impact of the number of blocks on aggregation answers. We generated 5 data sets, varied the number of blocks from 6 to 24, and recorded the aggregation answer of each data set. The results are in Fig. 6(c), which shows that the number of blocks has hardly any influence on the answers. Due to the use of iterations and leverages, high-precision answers are computed according to the actual conditions in each block, leading to the high accuracy of the final aggregation answer.

Varying Data Boundaries. We tested the impact of the data boundaries. As discussed in Section IV-A1, values out of count for very limited proportions. Meanwhile, when processing AVG aggregation, they are too far away from the average in distribution, which has limited contributions to the aggregation answers. Thus, we denote such data as outliers in the AVG aggregation and set =2. Here we test the impact of . We generated 5 data sets, varying from 0.25 to 1.5, and recorded the aggregation answer of each data set. The results are shown in Fig. 6(d).

Fig. 6(d) shows that when is 0.5 or 0.75, ISLA works well. In this condition, the S and L data contain the most featured parts of the normal distributions. Based on such a , the S and L data could well predict the distributions. When =0.25, compared with the former condition, more samples are defined as S or L and involved in computing. However, the results are worse than the former condition, since in this condition, the leverages are assigned to more data, leading to stronger leverage effects, which slightly decreases the accuracy. When gets large, , 1.25 or 1.5, the aggregation answers show a trend of divergence, indicating a low accuracy. In this condition, is much closer to , and the S and L data could not well predict the distributions due to their containing limited features of the distributions. Besides, fewer samples are used in the computation, which also decreases the accuracy. In conclusion, we suggest to be 0.5 or 0.75.

Viii-C Comparisons

We compared ISLA with the measure-biased technique in the sample+seek framework [17]. The measure-biased technique processes SUM aggregation with the off-line samples, where each data is picked with the probability proportional to its value:

(4)

Considering that AVG can be computed by dividing the SUM by COUNT, where larger values contribute more to the SUM aggregation answers, we use Eq. (4) to re-weight the samples in the AVG aggregation. We also consider the measure-biased technique together with the data division criteria in this paper, and propose another kind of probabilities.

  1. Probabilities on values. Probabilities are directly computed with Eq. (4), proportional to values.

  2. Probabilities on values and boundaries. Probabilities are generated based on the values and data boundaries.

For the second kind of probability, the data are divided into regions according to the data boundaries. Similar to the leverages in Section IV-A, the sum of the probabilities in a specified region is proportional to the number of samples in it. Meanwhile, for samples in a certain region, their probabilities are proportional to their values. For example, we assume 5 samples are picked with a sum of 100. Two samples, 30 and 35, fall in the region of L. Fors sample 30, its first kind of probability is , while its second kind of probability is computed as .

In the experiments, we compared ISLA with two measure-biased approaches, the measure-biased approach with probabilities on values (MV), and the measure-biased approach with probabilities on values and boundaries (MVB), to evaluate the accuracy, the modulation effects, and the efficiency of our approach.

Accuracy. We compared ISLA, MV, and MVB for the accuracy and generated 10 data sets (Dataset 1-10 in Table IV) to run algorithms. The experimental results are shown in Table IV.

The average results returned by ISLA, MV, and MVB are 100.0296, 104.0036, and 100.515, respectively. Only answers calculated by ISLA are satisfied with the desired precision of 0.1. Meanwhile, detailed answers in Table IV indicate that ISLA returns the most robust and high-quality answers when compared with MV and MVB.

Modulation abilities. We compared the modulation abilities of ISLA, MV, and MVB to evaluate whether ISLA could properly modulate the sketch estimator in the direction of . We choose the first set of experiments (Dataset 1) in Table IV, and recorded the partial answers (Partial 1-10 in Table V) to study the modulation process in each block to verify whether ISLA returns better partial results than MV and MVB. We recorded , which is , and compare with the partial results to see whether can be properly modulated in each block. The final answers returned by ISLA, MV, and MVB are 100.003, 104.049, 100.558, respectively. The experimental results are recorded in Table V.

Table V shows that the partial results returned by ISLA, with an average of 100.003, are much better, indicating the good modulation abilities of ISLA. Partial results returned by MV and MVB are about 104 and 100.5, respectively, which are both outside of the confidence interval (, leading to poorer answers.

Efficiency. We compared ISLA with MV and MVB on efficiency measured by run time. Each algorithm is run for 50 times to get an average run time. The exact computation, which costs 442,032ms, is also included as a comparison. The average times required to run ISLA, MV, and MVB for one time are 113.12ms, 113.6ms, and 113.36ms, respectively. All save much time when compared to the exact run time. The average run times are nearly the same, indicating that these approaches are all effective, while the results returned by ISLA achieve a much higher precision.

Viii-D Experiments on Non-i.i.d. Distributions

In Section VII-C we extend our approach to process the AVG aggregation on non-i.i.d. distributions, and we now test the performance. We generated 5 blocks in different normal distributions : , , , , and , , with the data size of in each block. The accurate answer is 100, which is calculated by dividing the sum of the accurate averages of each block. The desired precision is set to 0.5. We conducted the experiments 5 times. The aggregation answers are 99.8538, 100.066, 100.194, 100.321, and 99.8333, respectively. All the results satisfy the desired precision, indicating that our approach has a good performance for non-i.i.d. distributions.

Viii-E Other Distributions

We experimentally show that our method is also suitable for other kinds of distributions. Similar to the comparison experiments above, we compare ISLA with MV and MVB. Without a specific explanation, the parameters are set to default values.

Exponential Distributions.

We designed our approach based on the symmetry of normal distributions, and we want to test its performance on asymmetrical distributions. Thus, we choose the exponential distribution with the probability density function

=