I Introduction
Recent advances in digital technology have led to a proliferation of large scale data sets. Examples include climate data, social networking, smart phone and health data, etc. Inferential statistical analysis of such large scale data sets is crucial in order to quantify statistical correctness of parameter estimates and testing hypothesis. However, the volume of the data has grown to an extent that cannot be effectively handled by traditional statistical analysis and inferential methods. Processing and storage of massive data sets becomes possible through parallel and distributed architectures. Performing statistical inference on massive data sets using distributed and parallel platforms require fundamental changes in statistical methodology. Even estimation of a parameter of interest based on the entire massive data set can be prohibitively expensive. In addition, assigning estimates of uncertainty (error bars, confidence intervals, etc) to the point estimates is not computationally feasible using the conventional statistical inference methodology such as bootstrap
[3].The bootstrap method is known as a consistent method of assigning estimates of uncertainty (e.g., standard deviation, confidence intervals, etc.) to statistical estimates
[3, 4] and it is commonly applied in the field of signal processing [5, 6]. However, for at least two obvious reasons the method is computationally impractical for analysis of modern high volume and highdimensional data sets:
First, the size of each bootstrap sample is the same as the original big data set (with about 63% of data points appearing at least once in each sample typically), thus leading to processing and storage problems even in advanced computing systems. Second, (re)computation of value of the estimator for each massive bootstrapped data set is not feasible even for estimators with moderate level of computational complexity. Variants such as subsampling [7] and the out of bootstrap [8] were proposed to reduce the computational cost of bootstrap by computation of the point estimates on smaller subsamples of the original data set. Implementation of such methods is even more problematic as the output is sensitive to the size of the subsamples . In addition extra analytical effort is needed in order to rescale the output to the right size.The bag of little bootstraps (BLB) [1]
modifies the conventional bootstrap to make it applicable for massive data sets. In BLB method the massive data is subdivided randomly into disjoint subsets (i.e., so called subsample modules or bags). This allows the massive data sets to be stored in distributed fashion. Moreover subsample modules can be processed in parallel using distributed computing architectures. The BLB samples are constructed by assigning random weights from multinomial distribution to the data points of a disjoint subsample. Although in BLB the problem of handling and processing massive bootstrap samples is alleviated, yet (re)computation of the estimates for a large number of bootstrap samples is prohibitively expensive. Thus, on the one hand BLB is impractical for many commonly used modern estimators that typically have a high complexity. Such estimators often require solving demanding optimization problems numerically. On the other hand, using the primitive LS estimator in the original BLB scheme does not provide a statistically robust bootstrap procedure as the LS estimator is known to be very sensitive in the face of outliers.
In this paper we address the problem of bootstrapping massive data sets by introducing a low complexity and robust bootstrap method. The new method possesses similar scalability property as the BLB scheme with significantly lower computational complexity. Low complexity is achieved by utilizing for each subset a fast fixedpoint estimation technique stemming from Fast and Robust Bootstrap (FRB) method [9, 10, 2]. It avoids (re)computation of fixedpoint equations for each bootstrap sample via a smart approximation. Although the FRB method possesses a lower complexity in comparison with the conventional bootstrap, the original FRB is incompatible with distributed processing and storage platforms and it is not suitable for bootstrap analysis of massive data sets. Our proposed bootstrap method is scalable and compatible with distributed computing architectures and storage systems, robust to outliers and consistently provides accurate results in a much faster rate than the original BLB method. We note that some preliminary results of the proposed approach were presented in the conference paper [11].
The paper is organized as follows. In Section II, the BLB and FRB methods are reviewed. The new bootstrap scheme (BLFRB) is proposed in Section III, followed by implementation of the method for MMestimator of regression [12]. In Section IV Consistency and statistical robustness of the new method are discussed. Section V provides simulation studies and an example of using the new method for analysis of a real world big data set. Section VI concludes.
Ii Related bootstrap methods
In this section, we briefly describe the ideas of the BLB [1] and FRB [2] methods. The pros and cons of both methods are discussed as well.
Iia Bag of Little Bootstraps
Let be a dimensional observed data set of size . The volume and dimensionality of the data may be so high that it cannot be processed or stored in a single node. Consider as an estimator of a parameter of interest based on . Computation of estimate of uncertainty (e.g., confidence intervals, standard deviation, etc,) for is of great interest as for large data sets confidence intervals are often more informative than plain point estimates.
The bag of little bootstraps (BLB) [1] is a scalable bootstrap scheme that draws disjoint subsamples (which form ”bags” or ”modules”) of smaller size by randomly resampling without replacement from columns of . For example if and , then . For each subsample module, bootstrap samples,
, are generated by assigning a random weight vector
from to data points of the subsample, where the weights sum to . The desired estimate of uncertainty is computed based on the population within each subsample module and the final estimate is obtained by averaging ’s over the modules.In the BLB scheme each bootstrap sample contains at most distinct data points. Thus the BLB approach produces the bootstrap replicas with reduced effort in comparison to conventional bootstrap [3]
. Furthermore, the computation for each subsample can be done in parallel by different computing nodes. Nevertheless, (re)computing the value of estimator for each bootstrap sample for example thousands of times is still computationally impractical even for estimators of moderate level of complexity. This includes a wide range of modern estimators that are solutions to optimization problems such as maximum likelihood methods or highly robust estimators of linear regression. The BLB method was originally introduced with the primitive LS estimator. Such combination does not provide a statistically robust bootstrap procedure as the LS estimator is known to be very sensitive in the face of outliers. Later in section
IV of this paper we show that even one outlying data point is sufficient to break down the BLB results.IiB Fast and Robust Bootstrap
The fast and robust bootstrap method [2, 9, 10] is computationally efficient and robust to outliers in comparison with conventional bootstrap. It is applicable for estimators that can be expressed as a solution to a system of smooth fixedpoint (FP) equations:
(1) 
where . The bootstrap replicated estimator then solves
(2) 
where the function is same as in (1) but now dependent on the bootstrap sample . Then, instead of computing from (2), we compute:
(3) 
where the notation denotes an approximation of in (2) with initial value based on bootstrap sample . In fact is a onestep improvement of the initial estimate. In conventional bootstrap, one uses the distribution of to estimate the sampling distribution of . Since the distribution of the onestep estimator does not accurately reflect the sampling variability of , but typically underestimates it, a linear correction needs to be applied as follows:
(4) 
where is the matrix of partial derivatives w.r.t. . Then under sufficient regularity conditions, will be estimating the limiting distribution of . In most applications, is not only significantly faster to compute than , but numerically more stable and statistically robust as well. However, the original FRB is not scalable or compatible with distributed storage and processing systems. Hence, it is not suited for bootstrap analysis of massive data sets. The method has been applied to many complex fixedpoint estimators such as FastICA estimator [13], PCA and highly robust estimators of linear regression [2].
Iii Fast and Robust Bootstrap for Big Data
In this section we propose a new bootstrap method that combines the desirable properties of the BLB and FRB methods. The method can be applied to any estimator representable as smooth FP equations. The developed Bag of Little Fast and Robust Bootstraps (BLFRB) method is suitable for big data analysis because of its scalability and low computational complexity. Recall that the main computational burden of the BLB scheme is in recomputation of estimating equation (2) for each bootstrap sample . Such computational complexity can be drastically reduced by computing the FRB replications as in (4) instead. This can to be done locally within each bag. Let be a solution to equation (1) for subsample :
(5) 
Let be a bootstrap sample of size randomly resampled with replacement from disjoint subset of size ; or equivalently generated by assigning a random weight vector from to data points of . The FRB replication of can be obtained by
(6) 
where is the onestep estimator and is the matrix of partial derivatives w.r.t. . The proposed BLFRB procedure is given in detail in Algorithm 1. The steps of the algorithm are illustrated in Fig. 1, where , denotes the disjoint subsamples and corresponds to the th bootstrap sample generated from the distinct subsample . Note that the terms and are computed only once for each bag.
While the BLFRB procedure inherits the scalability of BLB, it is radically faster to compute, since the replication can be computed in closedform with small number of distinct data points. Low complexity of the BLFRB scheme allows for fast and scalable computation of confidence intervals for commonly used modern fixedpoint estimators such as FastICA estimator [13], PCA and highly robust estimators of linear regression [2].
Iiia BLFRB for MMestimator of linear regression
Here we present a practical example formulation of the method, where the proposed BLBFR method is used for linear regression. In order to construct a statistically robust bootstrap method, MMestimator that lends itself to fixed point estimation equations is employed for bootstrap replicas. Let , , be a sample of independent random vectors that follow the linear model:
(7) 
where is the unknown parameter vector. Noise terms
’s are i.i.d. random variables from a symmetric distribution with unit scale.
Highly robust MMestimators [12]
are based on two loss functions
and which determine the breakdown point and efficiency of the estimator, respectively. The and functions are symmetric, twice continuously differentiable with , strictly increasing on and constant on for some constant . The MMestimate of satisfies(8) 
where is a Sestimate [14] of scale. Consider Mestimate of scale defined as a solution to
(9) 
where is a constant. Let be the argument that minimizes ,
then .
We employ the Tukey’s loss function:
which is widely used as the functions of the MMestimator, where subscript represents different tunings of the function. For instance an MMestimator with efficiency and breakdown point (i.e. for Gaussian errors) is achievable by tuning into and for and respectively (see [15, p.142, tab.19]). In this paper (8) is computed using an iterative algorithm proposed in [16]. The initial values of iteration are obtained from (9) which in turns are computed using the FastS algorithm [17].
In order to apply the BLFRB method to MMestimator, (8) and (9) need to be presented in form of FP equations scalable to number of distinct data points in the data. The corresponding scalable onestep MMestimates and are obtained by modifying [2, eq. 17 and 18] as follows. Let denote a BLB bootstrap sample based on subsample , and a weight vector ,
(10)  
(11) 
where
The BLFRB replications of , are obtained by applying the FRB linear correction as in [2, eq. 20], to the onestep estimators of (10) and (11).
Iv Statistical properties
Next we establish the asymptotic convergence and statistical robustness of the proposed BLFRB method.
Iva Statistical Convergence
We show that the asymptotic distribution of BLFRB replicas in each bag is the same as the conventional bootstrap. Let be a set of observed data as the outcome of i.i.d. random variables from an unknown distribution . The empirical distribution (measure) formed by is denoted by linear combination of the Dirac measures at the observations . Let and denote the empirical distributions formed by subsample and bootstrap sample respectively. We also use for functional representations of the estimator e.g., and . The notation denotes that both sides have the same limiting distribution.
Theorem IV.1
Consider , and as maps from a Donsker class to such that is measurable for every . Let to be Hadamard differentiable at tangentially to some subspace and be a solution to a system of smooth FP equations. Then as
(12) 
See the proof in the Appendix.
IvB Statistical robustness
Consider the linear model (7) and let be an estimator of the parameter vector based on . Let , denote the
th upper quantile of
, where is the th element of , . In other words . Here we study the robustness properties of BLB and BLFRB estimates of . We only focus on the robustness properties of one bag as it is easy to see that the end results of both methods break down, if only one bag produces a corrupted estimate.Let denote the BLB or BLFRB estimate of the based on a random subsample of size drawn from a big data set . Following [18], we define the upper breakdown point of as the minimum proportion of asymmetric outlier contamination in subsample that can drive over any finite bound.
Theorem IV.2
In the original BLB setting with LS estimator, only one outlying data point in a subsample is sufficient to drive , over any finite bound and hence, ruining the end result of the whole scheme. See the proof in the Appendix.
Let , be an observed data set following the linear model (7). Assume that the explanatory variables are in general position [15, p. 117]. Let be an MMestimate of based on . According to [10, Theorem 2], the FRB estimate of the th quantile of remains bounded as far as in equation (1) is a reliable estimate of and more than of the bootstrap samples contain at least good (i.e., nonoutlying) data points. This means that in FRB, higher quantiles are more robust than the lower ones. Here we show that in a BLFRB bag the former condition guarantees the latter.
Theorem IV.3
Let , be a subsample of size randomly drawn from following the linear model (7). Assume that the explanatory variables are in general position. Let be an MMestimator of based on and let be the finite sample breakdown point of . Then in the BLFRB bag formed by , all the estimated quantiles , have the same breakdown point equal to . See the proof in the Appendix.
Theorem IV.3 implies that in the BLFRB setting, lower quantiles are as robust as higher ones with breakdown point equal to which can be set close to 0.5. This provides the maximum possible statistical robustness for the quantile estimates. In the proof we show that if is a reliable MMestimate of , then all the bootstrap samples of size drawn from are constrained to have at least good data points.
Table 1 illustrates the upper breakdown points of the BLFRB estimates of quantiles for various dimensions of data and different subsample sizes. The MMregression estimator is tuned into breakdown point and efficiency at the central model. The results reveal that BLFRB is significantly more robust than the original BLB with LS estimator. Another important outcome of the table is that, when choosing the size of subsamples , the dimension of the data should be taken into account; For example for a data set of size and , setting or are not the right choices.
p  n  

50  50000  0.425  0.475  0.491 
200000  0.467  0.490  0.497  
1000000  0.488  0.497  0.499  
100  50000  0.349  0.449  0.483 
200000  0.434  0.481  0.494  
1000000  0.475  0.494  0.498  
200  50000  0.197  0.398  0.465 
200000  0.368  0.461  0.488  
1000000  0.450  0.487  0.497 
V Numerical Examples
In this section the performance of the BLFRB method is assessed by simulation studies. We also perform the simulations with the original BLB method for comparison purposes.
Va Simulation studies
We generate a simulated data set of size following the linear model , , where the explaining variables are generated from p
variate normal distribution
with , pdimensional parameter vector, noise terms are i.i.d. from the standard normal distribution and noise variance is
.The MMestimator in the BLFRB scheme is tuned to have efficiency and breakdown point . The original BLB scheme in [1] uses LSestimator for computation of the bootstrap estimates of .
Here, we first verify the result of theorem IV.1 in simulation by comparing the distribution of the left hand side of (12) with the right hand side. Given the above settings, the right hand side of (12) follows in distribution [12, theorem 4.1]. We form the distribution of the left hand side, by drawing a random subsample of size and performing steps 2 and 3 of the BLFRB procedure (i.e., Algorithm 1) for using bootstrap samples. Fig.2 shows the true distribution of along with the obtained empirical distributions of for two elements of with the best and the worst outcomes. The result of averaging all the empirical distributions is illustrated in Fig.3, along with the true distribution. Note that the results are in conformity with theorem IV.1.
Next, we compare the performance of the BLB and BLFRB methods. We compute bootstrap estimate of standard deviation (SD) of by the two methods. In other words, the estimate of uncertainty in step 4 of the procedure (i.e., see Fig.1) for bag k is as follows:
where denotes the th element of and . The step 5 of the procedure for the th element of is obtained by:
The performance of the BLB and BLFRB are assessed by computing a relative error defined as:
where and is (approximation of) the average standard deviation of based on the asymptotic covariance matrix [12] (i.e., is 0.95 for the MMestimator and 1 for the LSestimator). The bootstrap setup is as follows; Number of disjoint subsamples is , size of each subsample is with , maximum number of bootstrap samples in each subsample module is . We start from and continually add a new set of bootstrap samples (while ) to subsample modules. The convergence of relative errors w.r.t. the number of bootstrap samples are illustrated in Fig.4. Note that when the data is not contaminated by outliers, both methods perform similarly in terms of achieving lower level of relative errors for higher number of bootstrap samples.
We study the robustness properties of the methods using the above settings. According to theorem IV.2, only one outlying data point is sufficient to drive the BLB estimates of over any finite bound. To introduce such outlier, we randomly choose one of the original data points and multiply it by a large number . Such contamination scenario resembles misplacement of the decimal point in real world data sets. Lack of robustness of the BLB method is illustrated in Fig.(a)a for and .
According to Table 1, for the settings of our example the upper breakdown point of BLFRB quantile estimates is . Let us asses the statistical robustness of the BLFRB scheme by severely contaminating the original data points of the first bag. We multiply () of the data points by . As shown in Fig.(b)b, BLFRB still performs highly robust despite such proportion of outlying data points.
Now, let us make an intuitive comparison between computational complexity of the BLB and BLFRB methods by using the MMestimator in both methods. We use an identical computing system to compute bootstrap standard deviation (SD) of by the two methods. The computed and the cumulative processing time are stored after each iteration (i.e.,adding new set of bootstrap samples to the bags). Fig.6, reports relative errors w.r.t. the required cumulative processing time after each iteration of the algorithms. The BLFRB is remarkably faster since the method avoids solving estimating equations for each bootstrap sample.
VB Real world data
Finally, we use the BLFRB method for bootstrap analysis of a real large data set. We consider the simplified version of the the Million Song Dataset (MSD) [19]
, available on the UCI Machine Learning Repository
[20]. The data set contains music tracks, where (i.e., represents the released year of the th song (i.e., ranging from 1922 to 2011) and is a vector of different audio features of each song. The used features are the average and nonredundant covariance values of the timbre vectors of the song.The linear regression can be used to predict the released year of a song based on its audio features. We use the BLFRB method to conduct a fast, robust and scalable bootstrap test on the regression coefficients. In other words, considering the linear model , we use BLFRB for testing hypothesis vs. , where (i.e., ) denotes the th element of . The BLFRB test of level
rejects the null hypothesis if the computed
confidence interval does not contain . Here we run the BLFRB hypothesis test of level with the following bootstrap setup; Number of disjoint subsamples is , size of each subsample is with , number of bootstrap samples in each subsample module is . Among all the 90 features, the null hypothesis is accepted only for 6 features numbered: . Fig.7 shows the computedCIs of the features. In order to provide a closer view, we have only shown the results for feature numbers 30 to 80. These results can be exploited to reduce the dimension of the data by excluding the ineffective variables from the regression analysis.
Using the BLFRB method with the highly robust MMestimator, we ensure that the computational process is done in a reasonable time frame and the results are not affected by possible outliers in the data. Such desirable properties are not offered by the other methods considered in our comparisons.
Vi Conclusion
In this paper, a new robust, scalable and low complexity bootstrap method is introduced with the aim of finding parameter estimates and confidence measures for very large scale data sets. The statistical properties of the method including convergence and robustness are established using analytical methods. While the proposed BLFRB method is fully scalable and compatible with distributed computing systems, it is remarkably faster and significantly more robust than the original BLB method [1].
Appendix A
Proof of theorem IV.1: Given that is Donsker class, as :
(13) 
where is the PBrownian bridge process and the notation denotes convergence in distribution. According to [21, theorem 3.6.3] as :
(14) 
Thus, and converge in distribution to the same limit. The functional delta method for bootstrap [22, theorem 23.9] in conjunction with [23, lemma 1] imply that, for every Hadamarddifferentiable function :
(15) 
and conditionally on ,
(16) 
where is the derivative of w.r.t. at . Thus, conditionally on ,
(17) 
According to [2, equation 5 and 6] and given that can be expressed as a solution to a system of smooth FP equations:
(18) 
(19) 
which concludes the proof.
Lemma 1
Let be a subset of size randomly resampled without replacement from a big data set of size . Let be a bootstrap sample of size randomly resampled with replacement from (i.e., or equivalently formed by assigning a random weight vector from to columns of ). Then:
Proof of lemma 1: Consider an arbitrary data point
. The probability that
does not occur in a bootstrap sample of size is:Such probability for and is .
Proof of theorem IV.2: Let be an outlying data point in . According to lemma 1, all bootstrap samples drawn from that subsample will be contaminated by . This is sufficient to break all the LS replicas of the estimator in that bag and consequently ruining the end result of the whole scheme.
Proof of theorem IV.3:
According to [10, Theorem 2], The FRB estimate of remains bounded as far as:

in equation (1) is a reliable estimate of , and

More than of the bootstrap samples contain at least ”good” (i.e., nonoutlying) data points.
The first condition implies that, in a BLFRB bag if is a corrupted estimate then all bootstrap estimates , will break as well. In the rest of the proof we show that if the percentage of outliers in is such that is still a reliable estimate of , then all the bootstrap samples drawn from contain at least good (nonoutlying) data points. This suffices for all , to remain bounded.
Let be an MMestimate of . Let the initial scale of the MMestimator obtain by a high breakdown point Sestimator. The finite sample breakdown point of the Sestimator for a subsample of size is as follows:
[15, Theorem 8]. Given that is a reliable estimate, the initial Sestimate of the scale parameter is not broken. This implies that there exist at least good data points in general position in . It is easy to see that . Applying Lemma 1 for each of the good points concludes that the probability of drawing a bootstrap sample of size with less than good data points goes to zero for large , which is the case of big data sets.
References
 [1] A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan, “A scalable bootstrap for massive data,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 76, no. 4, pp. 795–816, 2014.
 [2] M. SalibiánBarrera, S. Van Aelst, and G. Willems, “Fast and robust bootstrap,” Statistical Methods and Applications, vol. 17, no. 1, pp. 41–71, 2008.
 [3] B. Efron and R. J. Tibshirani, An introduction to the bootstrap. CRC press, 1994.
 [4] A. C. Davison, Bootstrap methods and their application, vol. 1. Cambridge university press, 1997.
 [5] A. Zoubir and D. Iskander, Bootstrap Techniques for Signal Processing. Cambridge University Press, 2004.
 [6] A. Zoubir and B. Boashash, “The bootstrap and its application in signal processing,” Signal Processing Magazine, IEEE, vol. 15, pp. 56–76, Jan 1998.
 [7] M. W. Dimitris N. Politis, Joseph P. Romano, Subsampling. Springer, 1999.
 [8] P. J. Bickel, F. Götze, and W. R. van Zwet, Resampling fewer than n observations: gains, losses, and remedies for losses. Springer, 2012.
 [9] M. SalibiánBarrera, Contributions to the theory of robust inference. PhD thesis, Citeseer, 2000.
 [10] M. SalibianBarrera and R. H. Zamar, “Bootstrapping robust estimates of regression,” Annals of Statistics, pp. 556–582, 2002.
 [11] S. Basiri, E. Ollila, and V. Koivunen, “Fast and robust bootstrap in analysing large multivariate datasets,” in Asilomar Conference on Signals, Systems and Computers, pp. 8–13, IEEE, 2014.
 [12] V. J. Yohai, “High breakdownpoint and high efficiency robust estimates for regression,” The Annals of Statistics, pp. 642–656, 1987.
 [13] S. Basiri, E. Ollila, and V. Koivunen, “Fast and robust bootstrap method for testing hypotheses in the ica model,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 6–10, IEEE, 2014.
 [14] P. Rousseeuw and V. Yohai, “Robust regression by means of sestimators,” in Robust and nonlinear time series analysis, pp. 256–272, Springer, 1984.

[15]
P. J. Rousseeuw and A. M. Leroy,
Robust regression and outlier detection
, vol. 589. John Wiley & Sons, 2005. 
[16]
C. Croux, G. Dhaene, and D. Hoorelbeke, “Robust standard errors for robust estimators,”
CESDiscussion paper series (DPS) 03.16, pp. 1–20, 2004.  [17] M. SalibianBarrera and V. J. Yohai, “A fast algorithm for sregression estimates,” Journal of Computational and Graphical Statistics, vol. 15, no. 2, 2006.
 [18] K. Singh et al., “Breakdown theory for bootstrap quantiles,” The annals of Statistics, vol. 26, no. 5, pp. 1719–1732, 1998.
 [19] T. BertinMahieux, D. P. Ellis, B. Whitman, and P. Lamere, “The million song dataset,” in Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.
 [20] M. Lichman, “UCI machine learning repository,” 2013.
 [21] A. W. Van Der Vaart and J. A. Wellner, Weak Convergence. Springer, 1996.
 [22] A. W. Van der Vaart, Asymptotic statistics, vol. 3. Cambridge university press, 1998.
 [23] A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan, “A scalable bootstrap for massive data supplementary material,” 2013.
Comments
There are no comments yet.