‘Big Data’ presents itself with unique challenges in retrieving, storing and all the way to analysing the data. Technological breakthrough makes generation and collection of huge volume of data possible in many fields like genetics, genomics, health care, customer service, informatics, to name a few. Among various challenges presented by the abundance of data, analysis of data is a well recognized hurdle. While the explosion of information allows us to know more about the process, appropriate methods or algorithms are essential to make correct inference or to reveal hidden pattern.
Recent advancements of technology and targeted methods to Big Data analytics give access to ample capacity for storing the data along with the skill of parallel computing. Much effort has been dedicated to extract information from Big Data in an efficient manner. From a practical standpoint, concern remains about the validity of results from analysis of Big Data. As attested by many recent articles, in most cases the inference based on such data is unacceptable and unreliable. For example, High dimension conventional classification methods are no better than random guesses Fan2008 . Understanding the output of Big Data analytics than to fixate on the technical aspect of it is the most important issue Fan2014 ; Fan2015 ; NewIdeas , because the future decision making process depends only on this output. The aim of this article is to put forward a framework in order to establish the acceptability of the learning from the Big Data. This framework also fits to the paradigm of parallel computing and at the same time provides a robust statistical basis for practical application.
The classical statistical theory of data analysis has its roots in axioms of probability theory. With growing complexity of Big Data, statistical theory needs to be revisitedstatBigData , mainly due to violation of probabilistic independence or exchangeability situation. Statistics community has raised concerns about how the sound and carefully developed theory can help build a structure around it. In this article we exploit an algorithmic architecture used in practice to tackle Big Data and suggest an appropriate mathematical ground for analysis of such architecture.
We propose a partition and repetition approach in a general framework for statistical analysis of Big Data. This approach expands the horizon of standard statistical methods as well as opens new avenues for novel methods to encompass and tackle the challenges arisen due to the specific characteristics of Big Data. With the help of this general framework, we prove consistency and accuracy of the analytic results thus obtained. We have explained this theory through various examples that are usually required in common data analysis paradigm in respect of many fields. We hope that such a framework would help in further development of Big Data analytics.
2 The divide and conquer algorithm
Abundance of digital information is one way to explain what we today understand as ‘Big Data’. There are two aspects to the story. Firstly, human intuition suggests that accuracy of the answer to our question increases if we have more and more information. This intuition works backward; we start with a question, try to comprehend what data we might need to answer the question and then realize that relevant information exists somewhere in digitized format. The catch is that, this retrospective thought process assumes that the skill by which human intelligence finds this answer from the data is transferable to mechanical and algorithmic computing. Secondly, with huge volume of data we can find a question of interest from the data itself and then get the answer to the question. But the inherent complexity of available data makes this task difficult. This whole process is advertised as Big Data analytics.
Principle characteristics of Big Data are its volume, velocity, variety and complexity Characteristics . All of them presents as unique challenges at a technical level of dealing with the data. At the hardware level we have reached a saturation point on the achievable clock pulse on a single processor. Rather, the growth in computing capacity is attained by increasing the number of threaded cores. Moreover, while storage capacity is fairly cheap and scalable, the RAM is not so. Recognizing this hardware restriction the state of the art algorithms (Hadoop, Amazon EC2) for Big Data analytics has adopted a partitioning based method.
However, in view of advancements in computing systems including storage and processing, need for new data analytic tools are required that are adaptive to new technologies petcu . Building such statistical tools and algorithms for monitoring and analysis is needed to achieve success in Big Data analytics. Hence standard statistical methods should be revisited, modified, and validated in the light of scalability to extremely large scale data applications reed .
Fisher et al. BigDataInteracting have identified the standard workflow of data analysis as, 1) acquiring data, 2) choosing an architecture, 3) shaping the data to the architecture, 4) writing and editing the code, and 5) reflecting and iterating on the results. The initial struggle is to adopt a suitable architecture for the data and map the collected data to that architecture. In this article, we are not focusing on this domain of analytics job. Rather the focus is on the later part of analysing the data. To address the problem of huge volume of data, the way is to partition it into small portions that are manageable by RAM, process the data in a parallel manner, and finally combine the processed information to produce the final output. This idea of partitioning has been used, although in a subtle way, in other areas of research, e.g., data mining buehrer ; calders , MCMC wang . An extra benefit of this divide and conquer method is that such an algorithm easily adapts to the velocity of Big Data. Velocity contributes to new partitions which are to be analysed and then the inference is to be combined with the earlier output velocity . Issues relating to variety and complexity are taken into account by the statistical methods and algorithms that are used in the analysis (see discussion of Section 4 and References therein).
3 The framework
3.1 Sample space structure
Classical theory of statistical analysis is a well developed area with sound theories. To establish a framework for Big Data analytics we naturally would like to fall back on those works. To begin with, we consider a sample space where is the space of realized values of the data and is the sigma field associated with the sample space. We denote by the set of probability measures on . Also let be the set of probability measures with finite support. An observed data can be identified by a probability measure on , with a support having finite cardinality, defined as follows,
for any and is the -th data point. To build a theory around it we would require a suitable metric on the space . For example, if is a polish space then with Prokhorov metric () we can put weak convergence on .
Till this point we have not considered any aspect of Big Data par se. Our aim is to build the ideology of Big Data analytics on this sample space structure. Identification of the realized data with an empirical measure on some sample space gives a broader ground to work on. In a Big Data set up, we hardly have any control on the generation of data. Thus unlike in classical statistical theory, where mostly we want to build better experimental designs to apply statistical methods, be it standard or novel, here we want to construct an algorithm that would work with the data generation process. This difference in approaches is subtle but central to how these two ideologies differ.
3.2 The problem approach
Main goal of Big Data analytics is to extract information from the data, which is equivalent to getting information from an element in . So we assume that a satisfactory data collection and mapping architecture exists. To develop a full framework, we introduce some definitions about functionality of data analysis. This is necessary to avoid the cumbersome details and technicalities of a particular scenario.
Extracted information of a data analysis can be viewed as an element in the result space (). A problem approach () is a function from to . Based on this formulation of problem approach we can consider two classes of problem approaches as follows.
Inference Problem: If the problem approach can be extended to a strictly larger subset of than , then such a problem or problem approach is called an inference problem.
Mining Problem: If the problem approach can only be defined on , then such a problem or problem approach is called a mining problem.
The usual examples of these two classes of problems are as follows. Parametric estimation and testing problems fall under the class of inference problems where the subset ofunder consideration is
along with the parameter models. Clustering problem or outlier detection problem, on the other hand, are under the class of mining problems. In later sections, we shall discuss both these classes of problem approaches and their solutions in more details.
A technical assumption we need to have is that, one such problem approach is viable if the map
is a continuous map, where and are appropriate metrics on respective spaces. A viable problem approach () then ensures that the problem is consistent in the number of samples as well as data points. This means that slight change in the data generation process () should not create substantial difference in the result ().
3.3 Big Data Algorithm
We now discuss various components of our proposed algorithmic structure of Big Data analytics.
A naturally accepted strategy in analysing huge volume of data is to consider small parts of data at a time. Our formulation for Big Data analytics formulates this method of partitioning the data as a functional,
such that is related to by,
where denotes the support set of and denotes the empty set.
For convenience we write for each . For a fixed data (or ) we would be given a problem approach . Then the divide and conquer strategy would choose a partitioning functional .
But to reduce the error in result due to partitioning, the strategy is to repeat times the partitioning; denote them by . This type of algorithm we call as the partition-repetition algorithm. We now formulate this partition-repetition algorithm in a comfortable manner.
Let be the set of all partitioning functionals . A -field can be defined as the smallest -field on such that the functions on to are measurable for any choice of , where
Then the strategy of analysing data of unmanageable size by partition-repetition algorithm can be understood as a probability measure on the measurable space . More precisely would be viewed as a random sample from the probability measure space . For simplicity of notation let us denote by the map,
for . Then a single random sample
from the probability distributionprovides us results , which are elements from . With a random sample from the distribution, the set of results we get using the problem approach is
Next critical part of the algorithm is combining the results obtained above, in order to arrive at a final result. Let be the combining map that takes all the results from the collection and gives the final result. The triplet can be called a solution to a Big Data problem.
Now it remains to understand the viability of the solution. We have put a stable condition of continuity in equation (1) on problem approach as a viable problem approach. Proper behaviour of the pair would ensure an accurate solution to the problem for .
We focus on the case where works in two stages. In the first stage works on each partition () to collect the results
This -tuple is combined by . For a fixed data when is a measurable map, the randomness of makes the collection an independently and identically distributed (i.i.d.) sample on the measure space . This formulation of the solution provides an opportunity to use rich statistical theory in data analytics.
In the general case, the result space can be quite complicated (we shall give concrete examples in later section). Rather than dealing with the space itself it would be better to work with real numbers. This is achieved by an evaluation function for some fixed belonging to the set of natural integers. Then, viability of the choice of can be understood using the evaluation function of the result space . For a given data and a problem approach , we call a partitioning probability measure to be viable under the first stage combining operator if,
This condition means that the probability measure and the combining method are compatible with each other for the problem . If we do infinitely many repetitions of our partition-repetition based algorithm, the combining method will give equivalent performance as the one we would have got if we could apply on the data .
The second stage of combining method operates on the collection of first stage result by combining to get the solution
Now the viability of is based on the comparison of with (say). Here we present the soundness of the algorithm of partitioning and combining through the following theorem.
For a Big Data solution , if is a viable partitioning method under combining method (i.e., equation (4) is satisfied) and convergence in is equivalent to that of in , then there exists a second stage combining method , such that almost surely in .
[Proof.] Define on ( times) as follows,
Let us use the notations , and . Since
is an i.i.d. sample, by strong law of large numbers as equation (4) holds, for all with ,
Now using the fact that and definition of , the above holds with replaced by . Since convergence in is equivalent to that in , rest of the argument follows as by assumption convergence in is equivalent to that in .
The theorem above deals with the volume aspect of Big Data. It says that even if the data is unmanageable to be processed practically, we can adopt partition-repetition approach to get a good solution. It is also not passed our attention that the number of combination rules may be more than two, but the final convergence of results requires some more assumptions and strong theorems in the dependence set up.
Next we also need to answer the question which is more of classical statistical in nature. If the velocity of the data provides us more and more information of specific form, is the partition-repetition algorithm able to extract that information? The following theorem tells us if that is the case, we would be able to choose a partitioning measure and a sequence of combining methods that gives the final result.
Let and . Suppose the problem approach is viable on its domain and . If conditions of Theorem 1 hold for the sequence of solutions , then there exists a sequence of integers and a such that, for , is absolutely continuous with respect to with
as almost surely in .
[Proof.] Define . Let us denote,
almost surely in . Choosing as rationals, result follows from Cantor’s diagonal argument.
Both these results are of existential nature rather than being instructive for practice. Although little abstract in their formulation, these theorems form the basis of the methods that would be applied in practice. Study on combining methods is not new to statistics. This framework enforces the importance of various combining methods along with partitioning methods in the light of Big Data analytics.
The power of this kind of theory is that we do not put any hard and fast regularity condition on the data or the data generation process. Theorem 2 only requires that the data collected eventually amounts to some specific information.
4 Illustrative Examples
An analyst’s job and a statistician’s work differ in a crucial way. An analyst is more concerned with how to extract information from the data available. This work is referred to as number crunching. A statistician is concerned about the quality of the extracted information sometimes taking for granted the effort of extracting the information. In a Big Data scenario where importance of analyst’s job comes more into the limelight, a statistician could provide support by accepting some compromise on their ideology. In this section we illustrate the formulation developed above through some standard data analytic problems.
We first consider a few problems where the solution can be calculated without any error from partitioning based algorithm. Here we specify by subscript the size of the data. In these examples it is enough to consider to be some degenerate probability distribution of convenience and we only require a single sample () from it.
Calculating sample mean
Here can be any distribution that partitions the data into manageable balanced pieces. Then for the combining method shall be,
A little tweak in these definitions allows us to calculate many other descriptive statistics like weighted means, dispersion measures and also some robust measures for central tendency.
To get a Big Data solution to the sorting problem we can define a partitioning as a degenerate distribution such that it divides the data into parts based on a sequence as,
The choice of the sequence should be such that the individual parts are of manageable sizes. With providing us with a sorted array, the combining stage should simply concatenate the ordered parts, i.e.,
Similar solutions of the above type are obvious for problems like searching, calculating extreme statistics (, constructing a histogram etc. Most of the time these simple problems are only intermediate steps towards more challenging problems of data analytics.
Some solutions to more standard problems of Big Data analytics are discussed in brief below. First few examples are inference problems while the later ones are mining problems. We assume that the data are cleaned and dressed for the purpose at hand. We avoid discussing the technical aspects of implementing these algorithms in practice, though in a few examples we shall provide references to available literature that has more focus on detailed analysis of the algorithms.
The problems of modeling (nonparametric, parametric, time series or even Bayesian) come under the radar of inference problem. Based on the requirements of the solution (e.g., unbiasedness, minimum variance, consistency) there would be different Big Data solutions to the problem approach. Many of the times it suffices to consider as a random partitioning measure of the data, although while considering spatial and/or temporal data more clever partitioning measure would be required to satisfy viability condition like equation (4).
Let us consider the problem of finding maximum likelihood estimate (MLE) for a parameter based on some algorithm (say, EM algorithm or Newton-Raphson or Fisher’s Scoring etc.). The scenario is that, we have a statistical model in mind where the number of parameters is fixed. Then partitioning the data simply breaks the objective function (log-likelihood function) into parts. Consequently an intuitive choice of the combining method would be whichever of the results from partitions maximizes the whole objective function. Although this method does not ensure the MLE for the data, but in practice we are hardly concerned about theoretical properties like efficiency; the estimate found by this method is acceptable.
Consider a test function that provides p-value for testing against . Then based on random partitioning of the data into balanced parts, a conservative combining algorithm combinePvals for the corresponding solution can be
The context in which variable selection problem has been addressed in recent literature is sometimes too idealistic for Big Data paradigm, although there are some promising methods. The data generation process is assumed to provide information on a set of response variables and a fixed set of regressors. We might be interested in a subset of these variables which have effect on the responses. The quality of the selected variables can be assessed by proportions of the variables wrongly selected. In a situation where assumption of homoscedastic uncorrelated linear model is valid, Barbar and Candesknockoff proposed a method to select variables with a control on the proportion of falsely discovered variables. This method is no doubt computationally heavy. The partition-repetition philosophy can be used to adapt this algorithm to achieve the same goal in current context.
If the data generation process is well controlled, the above inference problems and solutions make sense. Some recent works are available in the area of regression Battey ; splitConquer focusing on divide and conquer methods. Unfortunately spurious correlations, noisy data etc. are very common in Big Data perspective. In that case these naive solutions can be hugely mis-representative of the actual truth. Data mining problems are more relevant in such a scenario. In a mining problem we are interested in the data itself without having to make any modeling assumption. Possible Big Data solutions to a few mining problems are discussed below.
An elaborate and critical discussion on clustering problem in view of Big Data analytics can be found in recent article by the authors combineCluster . In brief, the combing method would identify the unique clusters from the set based on a decision function that tells us to combine two results when they seem to form a single data cloud. The second stage is to make stable clusters based on some measure from the sets of clusterings .
Based on a random partitioning measure and a problem approach that separates the outliers () and the data () section, (i.e., ), the combining method would check the structure of the outliers from the individual parts and get the outliers from the whole part. The method should check if outliers from one part belongs to the data section of some other part and also if outliers from all the parts together form some data section. Second stage of combining would then pick out the stable outliers from all repetitions.
Ramaswamy et al. combineOutlier discuss another Big Data solution to this mining problem based on a different partitioning method based on clustering the data and van Stein et al. vanStein propose local subspace-based solution to outlier detection problem, which applies a combining strategy using global neighbourhoods. These methods can be viewed as special cases of our proposed framework.
First we consider the
-NN classifier, wherefinds the nearest neighbours of a test data point () as,
Based on any partitioning , then the problem is exactly solvable in a single repetition with a combining operator that picks the data points nearest to among the points. Subsequently the classifier is contracted on a second algorithm that simply checks for the maximum number of representatives in these data points from each of the classes.
Data is the lubricant that drives the machinery of statistics. It is no longer a topic of debate that the way data is generated and collected in modern times is drastically different from what statisticians are used to deal with. Statistics should adapt to this change and thereby assist the masses of data analytic work.
The main contribution of this article is suggesting a basis of statistical theory for present day data analytic works. In composing the theory we have tried to stay true to the practical nature of a data science job. This formulation proposes a divide and conquer algorithm (either partition-repetition or subsampling method). More importantly it respects the fact that more often than not we have no control on the data generation process. We have also tried to encompass all possible data analytic problems. A range of such data analytic problems are discussed in perspective of our formulation.
Successful use of statistical theory in data analysis would require understanding the field of ‘Big Data’. Rather than being insistent on developing methods and elaborate theories based on idealistic assumptions, we have kept their applicability in mind. Our proposed framework encompasses statistical analyses of majority of problems in view of complex characteristics of Big Data and can be extended further keeping its compatibility with modern advances in computational world.
- (1) R.F. Barber, E. Candes, Controlling the false discovery rate via knockoffs, The Annals of Statistics 43(5) (2015) 2055–2085. doi:10.1214/15-AOS1337.
- (2) H. Battey, F. Fan, H. Liu, J. Lu, Z. Zhu, Distributed estimation and inference with statistical guarantees, arXiv preprint, 2015. arXiv:1509.05457v1.
- (3) G. Buehrer, R.L. de Oliveira, D. Fuhry, S. Parthasarathy, Towards a parameter-free and parallel itemset mining algorithm in linearithmic time, 2015 IEEE 31st International Conference on Data Engineering, Seoul (2015) 1071–1082. doi:10.1109/ICDE.2015.7113357.
- (4) T. Calders, C. Garboni, B. Goethals, Efficient pattern mining of uncertain data with sampling, in: M.J. Zaki, J.X. Yu, B. Ravindran, V. Pudi, (Eds.) PAKDD 2010, Part I. LNCS (LNAI), 6118, Springer, Heidelberg, 2010, pp. 480–487. doi:10.1007/978-3-642-13657-3_51.
- (5) X. Chen, M. Xie, A split-and-conquer approach for analysis of extraordinarily large data, Statistica Sinica 24 (2014) 1655–1684. doi:10.5705/ss.2013.088.
- (6) M. Davidian, Aren’t we data science, Amstat News, (2013) 433-435.
- (7) J. Fan, Y. Fan, High dimensional classification using features annealed independence rules, The Annals of Statistics 36 (2008) 2605–2637. doi:10.1214/07-AOS504.
- (8) J. Fan, H. Fang, L. Han, Challenges of big data analysis, National Science Review 1(2) (2014) 293–314. https://doi.org/10.1093/nsr/nwt032.
- (9) J. Fan, Q.M. Shao, W.X. Zhou, Are discoveries spurious? Distributions of maximum spurious correlations and their applications, arXiv preprint, 2015. arXiv:1502.04237.
- (10) D. Fisher, R. DeLine, M. Czerwinski, S. Drucker, Interactions with big data analytics, Interactions with Big Data Analysis 19(3) (2012) 50–59. doi:10.1145/2168931.2168943.
- (11) L. Hall, N. Chawla, K.W. Bowyer, Combining decision trees learned in parallel, in: Working Notes of the KDD-97 Workshop on Distributed Data Mining, 1998, pp. 10–15.
- (12) J.P. Huang, Big data need physical ideas and methods, arXiv preprint, 2014. arXiv:1412.6848.
- (13) A. Katal, M. Wazid, R.H. Goudar, Big data: Issues, challenges, tools and good practices, in: 2013 Sixth International Conference on Contemporary Computing (IC3), 2013, pp. 404–409. doi:10.1109/IC3.2013.6612229
- (14) B. Karmakar, I. Mukhopadhyay, An efficient partition – repetition approach in clustering of big data, in: S. Pyne, B.L.S. Prakasa Rao, S.B. Rao (Eds.), Big data analytics: Methods and applications, Springer India, New Delhi, 2016, pp. 75–93. doi:10.1007/978-81-322-3628-3_5.
- (15) A. Kleiner, A. Talwalkar, P. Sarkar, M.I. Jordan, A scalable bootstrap for massive data, J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 (2014) 795–816. doi:10.1111/rssb.12050.
- (16) D. Petcu, et al., On processing extreme data, Scalable Computing: Practice and Experience 16(4) (2015) 467–489.
- (17) S. Ramaswamy, R. Rastogi, K. Shim, Efficient algorithms for mining outliers from large data sets, SIGMOD Rec. 29(2) (2000) 427–438. doi:10.1145/335191.335437.
- (18) D.A. Reed, J. Dongarra, Exascale computing and big data, Communications of the ACM 58(7) (2015) 56–68. doi:10.1145/2699414.
- (19) E.D. Schifano, J. Wu, C. Wang, J. Yan, M.H. Chen, Online updating of statistical inference in the Big Data setting, arXiv preprint, 2015. arXiv:1505.06354.
- (20) L.H.C. Tippett, The methods of statistics, Williams & Norgate, London, 1931.
- (21) B. van Stein, M. van Leeuwen, B. Bäck, Local subspace-based outlier detection using global neighbourhoods, arXiv preprint, 2016. arXiv:1611.00183v1.
- (22) X. Wang, F. Guo, K.A. Heller, D.B. Bunson, Parallelizing MCMC with random partition trees, Advances in Neural Information Processing Systems, (2015) 451–459.
- (23) T. Zhao, G. Cheng, H. Liu, A partially linear framework for massive heterogeneous data, The Annals of Statistics 44(4) (2016) 1400–1437. doi:10.1214/15-AOS1410.