Analytics on large collections of data is a topic of vast interest in recent years. Although analysis of data was always central in the data management community, the prevalence of various machine learning and statistical systems/packages has corroborated to the interest. As a result several recent lines of research across communities aim to engineer popular machine learning techniques both at the algorithmic as well as the systems level to scale in large data collections [2, 13, 21, 16].
Data analytics tasks however, are rarely run in isolation. Typically an analysis workload consists of applying an algorithm (e.g., machine learning algorithm or statistical operation) on a large data set building a model and subsequently refine the operation based on the results of previous steps. For example consider building a model (e.g., regression operation) on a data set produced for the first two weeks of a month (e.g., sales data as it relates to various traffic parameters and promotions activities on a web site). Based on the results of the operation (e.g., regression parameters, error, etc) one decides to run an additional regression operation for the data set representing the entire month. Alternatively during a data exploration task, one creates a data model for a year worth of data collected for a service, only to decide to drill down and build a model for the second month of the year that seems to present an anomaly for the given model fit.
It is evident that analysis tasks can be part of an analysis workload and rarely run in isolation. Moreover, exploratory tasks, may involve extending or refining previously completed tasks. As a result, this behavior reveals certain dependencies among the steps of an analysis workload. Such dependencies expose opportunities for work sharing across tasks. For example one may be able to reuse the model for the first two weeks of the month instead of building the model for the entire month from scratch. Such reuse could be achieved by incrementally updating the current model with additional data. Alternatively if the model for the subsequent two weeks of the month is available, the desired model for the month could be build by combining the two models as opposed building it from scratch. Such an option is advantageous as the models are already build and one simply derives a new one without the need to access possibly large collections of data. In a similar fashion we may be able to reuse the model build for a month to derive the model for the first two weeks of the month by removing the last two weeks worth of data from the model, instead of building the desired model from scratch.
These examples reveal two basic observations that we explore further in this paper. First analysis workloads consisting of multiple modelling tasks are amenable to work sharing across tasks. In particular one may be able to reuse models previously build on a data set in order to derive new models on demand. Second, incremental updates (inserting or deleting data) is an operation that may aid to derive a new model from an existing one. It is natural to expect that some models would enable work sharing easier than others. Some models for example may allow us to derive a new model by ”extending” (with new data) or ”shrinking” (removing data) the current model and still derive the exact same model we would have derived by building it from scratch utilizing base data. Some other models could allow us to do this only approximately. At the same time from a performance standpoint it may not always be beneficial to utilize an existing model and derive a new one by adding or deleting data from it. We expect that in some cases utilizing an existing model to derive a new one may be beneficial (we may be able to build the model much faster) but in some other cases, building the model from scratch is the best (faster) option.
Currently, systems that enjoy vast attention and are utilized for data analysis tasks (e.g., R ) do not take advantage of such dependencies and inherent relationships across operations of a data analytics workload. An analyst has to be aware of work sharing opportunities as well as optimization opportunities and express them (in code) explicitly which is not an ideal solution.
In this paper we initiate a study to explore these possibilities. We introduce model materialization and incremental model reuse as first class citizens in the execution of an analytical workload. By model materialization we mean that a model can be stored after it is build in order to be considered when generating other models. Since a model requires some space to store it, we incur a storage cost but we aim to offset such costs with increased performance in executing subsequent operations. By incremental model reuse we mean that during the decision to build a model required by an analyst, we consider models previously build as candidates to generate the model. Thus, we decide whether we should reuse existing models and/or adjust them incrementally or build the model from scratch. The decision is typically based on performance and we aim to make the choice that results in building the model fastest. Towards this goal we adopt a cost model that aids in this decision; we develop the suitable optimization frameworks that decide which models to use and the suitable action to take with the objective of producing the resulting model with the smallest cost.
More specifically in this paper we make the following contributions:
We introduce model materialization and incremental model reuse as frameworks to be considered during the execution of an analysis workload.
Using linear regression and Naive Bayes as examples, we demonstrate how these common models can be casted in our framework. More specifically we establish that incremental model reuse and model materialization offer large performance benefits, while guarantying that models are constructed without loss of accuracy.
We introduce an algorithm that given a collection of materialized linear regression/naive bayes models, chooses the best models to reuse and also the suitable operations in order to modify them deriving the desired target model with minimal cost.
Using logistic regression as an example, we demonstrate that incremental model reuse and model materialization offer large performance benefits while guarantying that models are constructed with quantifiable loss in accuracy.
We introduce an algorithm that given a collection of logistic regression models, chooses the best models to reuse and the suitable operations in order to modify them deriving the desired target model with minimal cost.
We present the results of an extensive performance comparison demonstrating the performance benefits of our approach under varying parameters of interest.
This paper is organized as follows: Section 2 presents introductory material and basic notation. Section 3 demonstrates incremental manipulation of linear regression and naive bayes models, followed by Section 4 that treats the case of logistic regression models. Section 5 introduces our optimization framework followed by Section 6 that details and empirical evaluation of the proposal. Section 7 discusses related work and Section 8 concludes the paper.
2.1 Linear Regression
Linear regression is modelling the relationship between a scalar dependent variable and one or more independent variables. Consider a data set of records; each record is a
-dimensional feature vector of independent variables denoted byand a target dependent variable . Generally, a linear regression takes the following form :
is the weight vector which is estimated andis an error term. Usually, the weight parameters are learned by minimizing sum of squared errors. A -regularization term is added to avoid over-fitting of the model. The solution thus obtained has a closed form and is represented as :
is a matrix of the input vectors, is a matrix of the target values and is the regularization parameter.
2.2 Naive Bayes Classifier
Naive Bayes classifiers are simple probabilistic models assuming pair-wise independence of features given the class label. Albeit simple, Naive Bayes models perform very well in classification problems. Given a class variable
and a set of predictor variablesBayes theorem states that
Under the naive assumption and given that is constant for a particular training set we can conclude that
can be calculated from training data by maximum likelihood estimation. The class probabilityis simply the relative frequency of class in the training set, where is number of training example which have class and is the total number of training examples.
Depending upon the choice of distribution for the conditional density
where is the mean of feature in samples with class label as and
is its variance. This is often referred to asGaussian Naive Bayes. In case of categorical features the multinomial distribution is a preferred choice for conditional density. The distribution is parametrized by vectors for each class, is the dimension of the feature vector and is the probability of feature appearing in sample belonging to class .
can be calculated by a smoothed version of maximum likelihood estimation.
where , and is the total number of points in the training set. These counters are computed for each class in the training data.
2.3 Logistic Regression
Logistic regression is a linear classifier belonging to the family of Generalized Linear Models . Let denote a class variable and
represent a feature vector, then Logistic Regression can be formally represented as an optimization problem minimizing a loss function to identify the model parameters. The loss function has the following form
A very common choice for function in logistic regression is the cross entropy loss function :
and regularization function . Here is the logistic function .
The Stochastic Gradient Descent(SGD) algorithm is used to optimize the loss function to determine the model parameters. SGD initializes the model parameter to some and then updates the parameter as
where is the learning rate and is the gradient of the convex loss function just using the sample. Stochastic gradient descent requires a single pass on the data to converge.
3 An Incremental Approach
We now demonstrate how model materialization and incremental model reuse can be supported in each of the types of models we consider. We discuss how one can combine two models on different data sets to produce a new model on the union of the data sets. We also discuss how an existing model can be manipulated (by adding or removing data) to produce a new one. Formally, let be a model on data set and is the model on data set . We assume that the data sets and have the same properties. We discuss two machine learning models described in the previous section, Linear Regression and Naive Bayes.
3.1 Model Materialization
A typical machine learning model is characterized by its parameters. In order to support incremental updates to a given model extra information has to be maintained depending on the model. We show that while materializing a model we can also materialize extra information that would be sufficient in supporting incremental updates. This information varies across different types of models as discussed further in this section.
3.1.1 Linear Regression
Let be a data set of points and let represent a machine learning model build on this data set.
Parameters for a linear regression are provided by Equation 3. The equation can be considered as a combination of two terms and . Simplifying the terms
where is a matrix and each term is the sum product of any two features of the feature vector over the training samples. is a matrix where each term is the sum product of the features and the target values. We will maintain matrix and , along with the model parameters while building a model. Thus we end up maintaining extra values. It is important to note that the amount of extra information we have to maintain is independent of the number of training samples (). Given that we have both the components and we can compute the model parameters at any point using equation 2. Later on we will show how we can support incremental updates to Linear Regression model utilizing this information.
3.1.2 Naive Bayes
As discussed in section 2.2 Gaussian Naive Bayes
is parametrized by the following variables: the class prior probabilities, and the parameters explaining the conditional density distribution. These parameters can be computed as shown below
We maintain for each class in the data set, which is the number of samples belonging to each class. In order to calculate we maintain the sum of feature over the samples in class , represented by . Similarly for we maintain the sum of squares of the values of feature in class , represented by . Maintaining the statistics above we calculate all the parameters of the model. Assuming we have classes in total in the data set, we need to maintain values. This is again independent of the number of training examples ().
The multinomial Naive Bayes model also has the same class prior probabilities . In addition we have to maintain for which we need to also store and . These parameters are expressed as sum of feature values across the classes. For the case of the multinomial model, we need to maintain number of parameters for the model.
3.2 Incremental Model Updates
In this section we demonstrate how incremental changes (data additions or deletions) can be supported by the two models considered. Formally, let be a model build on data set consisting of points . We will demonstrate the incremental changes by considering adding point to the data set , where is the dimension of the data. We wish to find the parameters of the new model for data set of size .
3.2.1 Linear Regression
For the linear regression model we have already computed matrix and on data set . We will calculate the and on by operating on and and updating them to reflect the new point. The equations below show how to update matrix and :
Deletions are handled similarly. Larger collections of points can be added/deleted in a similar fashion. Other statistics computed while building regression models like ANOVA table, AIC etc. which explain the goodness of fit of the model can also be incrementally maintained in a similar fashion. Details have been omitted for brevity.
3.2.2 Naive Bayes Classifier
For the Naive Bayes model we have computed , and on . We can update these statistics for according to the equations below
Given that we have the updated statistics we can compute the parameters of the updated model . Similar observations hold for deleting data as well as operating on collections of points.
3.3 Combining Models
Let be the underlying data set of points. Assume that points in are associated with a unique identifier, namely a point is represented as , where is the identifier, the dependent (class) variable and the feature vector as before. To simplify notation for the remainder of the paper, we assume, without loss of generality that the unique identifier imposes a natural ordering in . For example could be a time-stamp associated with the point (indicating the time it was generated). Casting our entire framework for the case where the points of the underlying data set do not have a unique ordering is indeed possible. It requires however a different methodology and we defer description of this case in our subsequent future work. Also for brevity we will denote as both the model and the data set (subset of D) for which we wish to build a model on. A sequence of these data point identifiers determines a model descriptor which is a range of points in . Let and be data sets represented by model descriptors and . Our aim is to compute the model
We discuss the linear regression case. Naive Bayes models are handled similarly so we omit the description for brevity. Let and be two linear regression models. For each model we maintain the associated matrices and along with the model descriptor signifying the data set on which it was calculated. Computing the regression model , involves considering two cases: Case 1: The two data sets do not have any points in common i.e. ; this case can be easily identified by comparing the model descriptors of the two data sets. A specific entry in the matrix for model looks like , where and are any two features. Thus, it can be seen that the corresponding matrix on data set can be computed as
which is essentially adding the corresponding elements of matrix of the two models directly.
Case 2: The two data sets have points in common i.e ; in this case the points common to both data sets can be determined from the corresponding model descriptors. If we directly operate on the two models the points which are common will be accounted for twice. Thus, we need to exclude points represented in both model and make sure we account for them once in the final model. We compute matrix on data set as follows:
The matrix for can be computed in a similar fashion. Notice that in this case we need to retrieve a few extra points from . This incurs an IO cost that needs to be accounted for (see section 5).
4 Incremental Logistic Regression Models
Stochastic Gradient Descent(SGD) is a popular optimization framework for estimating parameters of a Logistic Regression model. SGD is a sequential algorithm that updates weight parameters at each iteration until convergence. A typical drawback of SGD is its poor scalability on large data sets. Recognizing the importance of analytical tasks on massive data sets, recent work has established methodologies to scale SGD into realistic data sets [16, 21]. We adopt such methodologies and extend them to fit our framework.
A generic loss function for the Logistic Regression model is given in Equation 2. SGD is applied to identify the model parameters which minimize the loss function. We describe a variant of the SGD algorithm called Mixture Weight Methods . Let us consider a sample of points formed by sub-samples of points each drawn i.i.d, . Algorithm 1 outlines the steps for executing Mixture Weight Method. Notice that the outer-loop of the algorithm can be executed in parallel and as a result the approach can easily utilize multiple processors if required.
Where is the optimization function for sample and is the number of iteration required to converge. Thus, algorithm 1 computes the model parameters on subsets of data and then averages the parameters across all the subsets to compute the parameter for the complete set of data. In  it is shown that Algorithm 1 has good convergence properties and under certain assumptions establishes a relationship between the estimated and the values computed executing SGD on the entire data set.
We extend this idea in our framework as well. Let be an underlying data-set of size and a point is represented as , where is the identifier, the dependent (class) variable and the feature vector as before.
A request to create a logistic regression model on data set (the query set), is represented by a range of values over such that . The query data set is segmented into smaller chunks of equal size with the obvious assumption that . This results into number of chunks of equal size. These chunks are created in the increasing order of ID values. A chunk is given by the following range
and . Assuming that the logistic regression models for each chunk are available, they are combined in the spirit of algorithm 1 and produce the model for . Assuming that none of the chunks is available, a request to build the model for can utilize the base data to build the logistic regression model. At the same time, the chunks are generated for , the logistic regression model build for each of them, and the result is materialized in order to benefit future model creation requests.
Any request to build a logistic regression model for a data set first tests whether contains any of the chunks for which a model has already been materialized. If it does we can readily utilize its parameters and save computation time. Any parts of that are not currently ”covered” by existing chunks have to be computed from the base data set. Thus, we retrieve the parts of for which we don’t have the model, generate chunks of size and compute the model parameters for them. Finally we average all parameters from all chunks to compute the model. Algorithm 2 presents our overall approach.
Let denote the mixture of weight vector obtained by applying Algorithm 2 on a model query and be the weight vector computed by applying SGD on . Then, for any , with probability at least , the following inequality holds:
where is the bound for the norm of feature vectors, is the regularization constant, is the number of chunks of created in step 10 of Algorithm 2, is the size of each chunk and represents the probability with which this inequality holds. The proof of 1 follows the methodology presented in  and is available in the full version of the paper .
Note that in contrast to the discussion of section 3.2, for logistic regression models, this framework supports adding points to an existing model not deleting them. Thus we can construct new models only by adding points to existing models (combining existing chunks). This is inherent to the nature of the approximation of the logistic regression. As a result the space of all possible options to consider when creating a new model considers addition of points to an existing model, not deletions.
5 Optimization Considerations
Given a collection of materialized models over a data set , it is evident that a request to create a new model can readily utilize existing models. We seek to understand the trade offs involved while building the new model . Several options are available including building by manipulating data from or utilizing materialized models directly and/or suitably adjusting them using data from .
Consider Figure 0(a). It depicts data set and four materialized models (). A request to build model is faced with numerous options. Using the materialized models to generate model , Equations 3, 4 and 5 show different ways in which this can be achieved
Equation 3 represents an execution strategy which will fetch models and combine them, then remove all points in the range of and (this constitutes incrementally updating, removing these points, from the combined model). This step consists of accessing and retrieving all points between and . In equation 4 instead of retrieving from , we compute that operation by manipulating (subtracting) models and . If the model allows (e.g., linear regression) we can subtract from and compute the model for directly. Similarly, Equation 5 represents another execution strategy which involves retrieving along with data points between and and manipulating them (incrementally updating, adding and removing points) to complete the model construction. Other choices are also possible including retrieving all points between from and computing the model directly from base data. In order to be able to quantify the merits of each choice, as is typical in cost based query optimization  we need to a) assess all possible choices efficiently and b) quantify the cost of each option in order to determine the least cost way to build the model.
The specifics of the cost model are orthogonal to our approach. The cost depends on the type of model and also the model descriptor which may or may not involve disk access. In addition retrieving data from typically involves disk access. The only requirement we impose in the cost model adopted is to be monotonic. This means that all things being equal, the cost of retrieving a certain number of data points from disk should be at least as costly as the cost of retrieving less points. For the remainder of the paper we assume a cost model that is monotonic. To facilitate notation the cost of using a materialized model is denoted as . The cost of retrieving data points from disk is denoted as .
Let be a collection of materialized models on data set . For a model , let be a model descriptor on which a new model has to be computed. and in this case express a range of data points on . We wish to identify the minimum cost collection of materialized models and/or data points from that would be used to construct the model for , .
definitionDefinition Let represent a model descriptor for model which we wish to construct and be the set of available materialized models. Then the set of relevant models for is defined as follows :
If for a materialized model , , then .
such that with then .
Intuitively the models in are relevant models because they either contain common data points with the ones of interest to and/or they are models that can be manipulated (by combinations of models or incremental updates of models) to produce models that assist in computing . As we can see in Figure 0(a) materialized models , contain data points common with while and can be manipulated along with to produce models relevant to the computation of . While computing , only relevant models will be part of .
The set of relevant models is important since it accurately reflects the set of models to be considered during the computation of . Instead of assessing all relevant models every time a new request for a model arises, we pre-process the collection of all materialized models to facilitate the derivation of for a given . Thus given we pre-process it to facilitate the computation of relevant models. Algorithm 3 presents the overall approach. The basic idea is to pre-process and create enhanced descriptors that are the union of multiple model descriptors. Such enhanced descriptors can facilitate quick search for relevant models.
Maintaining makes it easier to compute the set . When the descriptor of a model is provided, we compare it against the . If a descriptor intersects any of the descriptors in all the materialized models mapped to that descriptor become part of .
Algorithm 3 will produce the set of all models that should be considered in deriving model . Using the descriptors in we create a complete undirected graph where each node corresponds to the or values of the model descriptions in . As for our running example the set contains models to . Thus we add the and values of the descriptors of these materialized models. As we can see in figure 0(b) it contains to as nodes. An edge corresponds to the cost of building a model for the data set specified by the two nodes adjacent to . If materialized model exists for the data descriptor specified by the nodes adjacent to the edge then the cost of the edge is the cost of using model . If a model does not exist for that data set the cost of that edge is determined by the number of points in the range. In our example the solid edges in our graph represent the materialized models to . For all the other edges the cost is given by , where is the number of points in the interval represented by the edge. Given and values represent the source and destination respectively. These are shown as grey nodes in Figure 0(b).
Every path from source node to destination represents an execution strategy to construct model . Figure 0(c) illustrates how to convert a path on the graph to a set of operations that compute the model. Consider a path on the graph represented by the following sequence of nodes . We fetch four materialized models and for the edges and respectively. The edge does not correspond to any materialized model , thus cost of that edge is equivalent to fetching the corresponding data points from disk. The decision whether to manipulate an existing model by adding or removing data points from it is decided by the nodes of the edge. If we traverse the edge from to and then we remove points from the model otherwise we add data points. In our example edge (as indicated in Figure 0(a)) and that constitutes removing points. The total cost of a query path is given by
where is cost of each edge and is cost of merging two materialized models. The cost depends on the type of model under consideration. For example for linear regression the cost is outlined in section 3.3. It involves (after retrieving the model parameters) a simple manipulation of corresponding model representations. It is expected that the cost of merging two materialized models is much less than the cost of fetching models or the cost of fetching data points from the disk . Depending on how the model descriptors and model parameters are stored, retrieving them may not require any disk access. For example in the case of a linear regression model, the model descriptors would be just a range of values and the model parameters would be as outlined in Section 3.1.1.
It is evident that by construction the problem of identifying the minimum cost to construct the model is equivalent to identifying the shortest path from a single source in a weighted graph. Dijkstra’s algorithm can be used to identify the optimal solution in , is the number of edges and is the number of vertices in the graph.
We presented the entire solution for the case of models that support addition and removal of points to derive new models, as is the case of models such as linear regression and Naive Bayes. For the case of logistic regression removal of points is not supported in the model we utilize to approximate the regression. In this case we have to modify slightly the algorithm to enable optimization of logistic regression models as well. The changes are as follows:
During identification of the set we will include models such that their descriptors are fully contained in the descriptor .
The graph constructed will only contain directed edges from nodes to such that .
These two changes will enable algorithm 4 to operate on logistic regression models and yield the least cost options to construct such models as well.
In this section we present a detailed performance comparison of our entire approach and proposal compared to alternate approaches. We utilize materialized models to save processing costs, while building new models for an incoming (model construction) query as described in section 5. The natural alternative is not to materialize models, but instead build the new model directly from the raw data. We compare our approach against this baseline. Our aim from these experiments is three-fold : (a) Highlight the factors that affect performance for our materialization framework and associated trade-offs. (b) Detail the impact of our optimization framework in terms of its overheads and benefits and (c) analyze the accuracy of logistic regression materialization framework. Note that for the case of the linear regression and naive Bayes models, the models we construct are exactly the same as those constructed by the baseline, so there are no accuracy trade offs in these cases.
Data. We test our framework utilizing synthetically generated data. Two different data set are generated for regression and classification problem. The choice of synthetic data allows us to change various parameters during experimentation. In addition experiments are focused on performance while scaling the size of the model and performance does not depend on quality of data but is governed by the size and type of data. The data is generated using publicly available synthesizers . A random noise and interdependency among features is added while synthesizing data to simulate real world scenarios. In this section we present results using data sets up to 5 millions points with 10 features in each point. We tested all algorithms with synthetically generated data sets of larger sizes but the trends observed in our experiments were nearly the same. In addition we utilized popular real data sets from UCI Machine learning repository  in our experiments and in all cases the results are consistent with those presented herein for synthetic data sets.
Experimental Setup. All our experiments were prototyped on top of MySQL(version 5.5.44) in a single node RDBMS setting. The model materialization framework code has been written in Python. The experiments were carried out on a PC running Linux Kernel Version 3.13.0-43-generic. The machine has a 3.40GHz Intel Core i7-3770 CPU with 16 GB of main memory.
Our framework is naturally parametrized by the size of the materialized models and the size of the incoming model construction query . Another important parameter which is implicit in our discussion is the amount of data covered by the materialized models. Materialized models can be spread uniformly across the data set or may be concentrated on a few data points. To quantify the coverage we compute the number of unique data points covered by the materialized models and express it as a percentage of the total size of the data set. Formally let be the collection of models materialized at a given stage in the framework. For the data set, , coverage is defined as follows :
These parameters are varied across our experiments to understand their impact on performance gain. Let be a model construction query. Our optimization framework identifies the optimal way to build model . Let the overall time taken by our framework to build the model be (including the optimization and model construction time). Let the time taken by the baseline be . Then the performance gain is calculated as follows
In all experiments we report expected numbers. A query set
containing one thousand queries is generated for each experiment. The query size is chosen from a uniform or normal distribution as explained in individual sections. These queries can represent a range of data points which is positioned anywhere across the underlying data. Similarly the materialized model size
is also chosen from a uniform distribution, normal distribution or a fixed size. We create a set of materialized modelson the data set with a given coverage as required in the experimental setting. The models are materialized before executing the query set .
6.1 Analyzing Performance
We assess the overall performance gain attained by our approach as compared to the baseline. Experiments were run for all three machine leaning models Linear Regression, Naive Bayes and Logistic Regression. The sizes of the sets and are chosen from the same normal distribution, . The x-axis depicts the percentage of data covered by materialized models. We execute the queries in set and report the performance gain. Figure 1(a) and 1(b) show that we were able to achieve a performance gain of 2x as the coverage reaches 90%. The increase in coverage implies a higher probability of identifying relevant models for the query. Thus the expected performance gain improves as the coverage increases. The performance gain for Logistic regression is shown in Figure 1(c). The maximum performance gain achieved in logistic regression is 1.8x which is slightly lower than the earlier two models. This can be explained by the fact that for Logistic Regression our framework supports only incremental updates to materialized models (section 4). Thus, it eliminates certain execution strategies which would have been faster in the presence of decremental updates.
|Coverage||Model Sizes (MB)|
The previous experiment demonstrates that utilizing materialized models can have a profound effect on performance when constructing new. However materializing a model comes at a cost, namely that of storing the model descriptors as well as the model details (e.g., regression parameters and meta-data in the case of linear regression as defined in section 3). Table 1 depicts the space occupied by the materialized linear regression models for each value of coverage. The size of the materialized model is fixed at 5K points. The base data set size is 350MB containing 5M points with 10 features. As it is visible from the table, the overheads in storage imposed by the materialized models is around 1.2% of the original data. Similar trends hold for the other models of interest in our study. It is evident that the minor storage overheads are heavily compensated in light of the performance benefits.
6.2 Materialized Model Size and Performance Gain
The size of materialized models is an important parameter in our framework. With the next set of experiments we wish to understand the impact of the size of materialized models on performance. Two test query sets S1 and S2 of size 50K and 100k points are used as shown in the figure 3. On the x-axis we represent different materialized model sets of fixed size of coverage fixed to 50%. The size of the materialized model sets is varied from 5K points to 70K points as shown in the Figure 2(a) and 2(b). We present results for Naive Bayes (supports both incremental and decremental updates) and Logistic Regression (supports only incremental updates) as similar trends hold for linear regression as well. Figure 2(a), 2(b) present results for Naives Bayes and Logistic Regression respectively. We observe that for a fixed query size and fixed coverage there is an optimum size of materialized models which results in maximum performance gain. We achieve a maximum performance gain for S1 at materialized model size of 20K for Naive Bayes. Similarly, for Logistic Regression we achieve the maximum performance gain at 10K materialized model size. As the size of the query increases the optimal materialized model size also increases. As shown in the graphs the query set S2 has its maximum at 30K and 20K for Naive Bayes and Logistic Regression respectively, which is larger than the maximum for S1. The exact position of the maximum on the graph depends on the size of the specific query (or query workload for multiple queries) for a given cost model.
6.3 Materialized Model and Query Size
We conducted experiments to quantify performance while scaling to larger input queries and materialized models sizes. The model chosen for these experiment was Naive Bayes, although linear regression also shows the same trends. Figure 4 shows four sizes of materialized models under consideration to . represents materialized models with their size chosen from a uniform distribution represented by U(25k,50k). Thus M1 is the scenario in which all the materialized models have a size uniformly distributed between 25K to 50K. Similarly M2,M3 and M4 are represented following a uniform distribution U(75k,100k), U(150K,200k) and U(250K,500K). Figure 3(a) shows the time taken to execute queries of small sizes represented by U(50K,100K). As depicted in the graph for M1 and M2 the time taken to execute the model queries decreases linearly as coverage increases. However for M3 and M4 which correspond to considerably larger materialized model sizes, the performance improvement becomes significant after 70% coverage. As coverage increases there is a higher probability to find two materialized models which can be subtracted in order to create a smaller model. Figure 3(c) shows similar trend for small queries on a real world data set from the UCI machine learning repository representing physical activity data of 3M points, consisting of 31 attributes and 13 classes. It is evident that the main trends are the same as in the case of synthetic data set as is the case in all of our experiments. Figure 3(b) is the graph for larger query sizes represented by distribution U(500K,750K). Since the query size is much larger we can observe that all four cases materialized models are utilized to generate the model for the input query. For M1, small models can be combined to generate the models for larger data sets. While for M4 a large materialized model which has the maximum overlap with the incoming model construction query is manipulated to generate the new model. It is evident that the relationship of the query size to the materialized model size is important in our setting. When the query workload has a much smaller size than the materialized model sizes (correspondingly when the query workload has much larger size than the materialized model sizes) employing our framework does not result in large performance benefits. It is evident however that enabling our framework in these cases does not impose an overhead either.
6.4 Optimization and I/O Time
As mentioned in section 5 the cost of merging models is considerably smaller as compared to disk access time. We measure the time taken by the three major components of our framework namely optimizer time, disk access time (including both fetching materialized model and/or fetching direct data points) and model combination time. The optimizer time refers to the time taken to run algorithm 4. The time spend in fetching any information from MySQL is referred to as I/O time. The time remaining in our computations which cannot be attributed to the above cases is the time taken to merge the models. Experiments were run on a test set of a thousand queries. The size of the model to be generated is chosen from the normal distribution .
The expected time for each component is reported as shown in graph 5. As can be observed the majority of time to create models is spent while fetching data from disk. Model combination time is fairly constant and is much smaller as compared to disk time. Optimizer time is insignificant for small coverage and only becomes visible (but still negligible) on the graph when coverage is close to 80% and above. As coverage increases the number of possible execution plans become considerably larger thus the optimizer takes much longer to build the graph and determine the shortest paths in the graph. This graph reveals that the overhead of running the optimization is minimal. Since the potential benefits of considering materialized models are significant, it is evident that if one chooses to materialize models, the performance overhead of the optimizer is negligible. Thus, running the optimizer, even if the decision is to employ the baseline, imposes minimal penalty in the query performance. In the graph the baseline is represented by the x-axis value at zero percent coverage. It can be seen that disk time reduces from 250 ms to 110 ms, while the optimizer time and model combination time are roughly 10ms. Thus, when the coverage is low, the overhead of the optimizer is so small that even when no materialized model can be utilized and the model has to be constructed from the baseline, the impact of the optimizer to the overall performance is immaterial as evident in Figure 5. At high coverage, the chances of utilizing materialized models are much higher. In that case, the small overhead of the optimizer is clearly compensated by the large savings in model construction time.
In this section we analyze the accuracy of our framework for the logistic regression models presented in section 4. We quantify the accuracy of the overall approach.
Synthetically generated classification data with 10 features and 2 classes were used to run test experiments. Similar trends hold when the number of classes increases, so we omit these experiments for brevity. We ran experiments on a test set S of a thousand queries. For each of these queries the model was built using our framework and also by applying SGD. We compare the accuracy on training data for both models by computing their difference. Let refer to the accuracy of the model built by our framework and refer to accuracy of SGD algorithm, the accuracy difference can be represented as . Various statistics are reported on this difference. Figure 5(a) and 6(a) presents the average of the accuracy difference between the model constructed by our approach and the model constructed by SGD directly. The x-axis represents queries in increasing order of size. The graphs show negative average values which means that on average the model generated by our framework outperforms the model developed by SGD on training data. Also as the query size increases the expected performance of our model improves. Figure 5(b) and 6(b) presents the average difference in accuracy for the cases where . It can be seen that the average positive difference lies within 0.5%. It is evident that the overall approach is highly accurate. Across the materialized model sizes we observe that larger size has better accuracy as compared to smaller sizes. Finally Figures 5(c) and 6(c) present the maximum difference across various query sizes. The graph shows that as the query size increases the maximum difference between the model computed by our framework and that computed by SGD decreases. It is visible from the graph that . The last set of graphs presents the trade off between accuracy and the corresponding performance gains achieved by our framework. As figures 5(d) and 6(d) suggest we experience a performance gain of 1.5x while we compromise accuracy by 3% in the worst case. Similar results were observed on real world data sets including the PAMAP2 publicly available data set . Since they are consistent with what has been presented these results are omitted for brevity.
7 Related Work
There has been an ever increasing interest to integrate statistical and machine learning capabilities to data management systems. Several efforts have been made in academia and industry to address this demand. Major database vendors now support analytical capabilities on top their database engines : IBM’s SystemML  , Oracle’s ORE , SAP HANA 
. However the integration is loose and does not support notions of model persistence or incremental computations. In the open source community one can observe similar trends with MADLib library support for Postgres. Other data platforms like Spark and Hadoop also support machine libraries as an external layer on top of their data processing system with MLLib  and Mahout  respectively. Such approaches either utilize an existing data management platform and deploy its extensions to provide analytics capabilities or represent systems that can execute machine learning and statistical packages. See  for a general overview of systems support for machine learning and statistical operations. Haloop  and Dryad  are examples of systems that utilize a form of persistence in their operations to improve the execution of a graph data flow. Although related in spirit, the approach and goal of these systems is to improve the performance of specific iterative graph data flow computations; they do not address the case of synthesizing a new model by extending and/or combining past models which is central in our approach.
Recent work  focused on pushing machine learning primitives inside a relational database engine. Our work is intended as a middle layer between the data processing engine and the analytical computing language layer. We require awareness of previous computations by collecting them and explore materialized models to build new models for the data. Our goal is to explore natural work sharing opportunities that exist in a typical data analysis workload.
In this paper we presented an approach that utilizes model materialization and incremental model reuse as a first class citizen while processing data analytics workloads. Utilizing popular machine learning models we demonstrated their incremental aspects and detailed an optimization methodology that determines the best way (in terms of performance) to build a given new model. We demonstrated that our apporach can achieve significant savings in performance for new model construction while only imposing modest overheads in storage.
The work opens several avenues for future work. First there is a plethora of other models that are important and can be considered in conjunction with our framework. Studying their incremental aspects and embedding them into the same optimization framework is an interesting direction for future work. Incremental model reuse for analytics is an important direction of research that blends nicely with the way current data management systems build integrations to existing analytical packages. Our framework can be easily injected between the analytical package and the RDBMS and recognize as well as handle all opportunities for improved performance. We are currently building such as system based on the ideas presented herein in which we will report soon.
Finally, our focus in this paper has been in the case that a total ordering exists in the underlying data set. An interesting case is when such an ordering does not exist. In that case the model descriptors will be different as well as the associated optimizations. Indeed our entire framework can be extended for this case as well and we will be reporting on such extensions in our future work.
-  Apache Mahout. http://mahout.apache.org/.
-  Apache Spark MLLib. http://spark.apache.org/.
-  Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets.html.
-  Oracle R Enterprise. https://docs.oracle.com/cd/E36939_01/doc.13/e36761.pdf.
-  Processing Analytical Workloads Incrementally. http://www.cs.toronto.edu/~priyank/incremental_analytics.pdf.
-  SAP HANA and R. http://help.sap.com/hana/sap_hana_r_integration_guide_en.pdf.
-  The R Project for Statistical Computing. https://www.r-project.org/.
-  C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
-  Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: Efficient iterative data processing on large clusters. Proc. VLDB Endow., 3(1-2):285–296, sep 2010.
-  S. Chaudhuri. An Overview of Query Optimization in Relational Systems. PODS, 1998.
-  T. Condie, P. Mineiro, N. Polyzotis, and M. Weimer. Machine learning on big data (sigmod tutorial). In SIGMOD Conference, pages 939–942. 2013.
-  A. Ghoting and et al. SystemML: Declarative Machine Learning on MapReduce. ICDE, 2009.
-  J. M. Hellerstein, C. Re, F. Schoppmann, and D. Z. Wang. The MADlib Analytics Library, 2012.
-  M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the 2Nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys ’07, pages 59–72, New York, NY, USA, 2007. ACM.
-  A. Kumar, J. Naughton, and J. M. Patel. Learning Generalized Linear Models Over Normalized Data. SIGMOD, 2014.
-  G. Mann, R. McDonald, M. Mohri, N. Silberman, and D. Walker. Efficinet Large-Scale Distributed Training of Conditional Maximum Entropy Models. Advances in Neural Information Processing Systems, 2009.
-  K. P. Murphy. Machine Learning a Probablistic Perspective. MIT Press, 2012.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
-  C. Zhang, A. Kumar, and C. Re. Materialization Optimizations for Feature Selection Workloads. SIGMOD, 2015.
The Optimality of Naive Bayes.
American Association for Artificial Intelligence, 2004.
-  M. A. Zinkevich, M. Weimer, A. Smola, and L. Li. Parallelized stochastic gradient descent. Advances in Neural Information Processing Systems, 15(5):795–825, 2010.