F-IVM: Learning over Fast-Evolving Relational Data

06/01/2020 ∙ by Milos Nikolic, et al. ∙ 0

F-IVM is a system for real-time analytics such as machine learning applications over training datasets defined by queries over fast-evolving relational databases. We will demonstrate F-IVM for three such applications: model selection, Chow-Liu trees, and ridge linear regression.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Learning Under Updates

F-IVM (https://github.com/fdbresearch/FIVM) is a system for real-time analytics over fast-evolving relational databases [5].

F-IVM innovates on two fronts. First, F-IVM puts forward a novel incremental maintenance mechanism for batches of aggregates over arbitrary project-join queries. It constructs a tree of views, with the input relations as leaves, the query as root, and each view defined by the join of its children possibly followed by projecting away attributes. For updates to a relation, it maintains the views along the path to the tree root using delta processing and view materialization.

Second, F-IVM captures the data-intensive computation of many applications using application-specific rings, which define sum and product operations over data values. The ring of integers suffices to treat updates uniformly for sum-product aggregates over joins: negative/positive tuple multiplicities denote deletes/inserts [3]. More complex applications call for richer rings or even composition of rings. F-IVM introduces the degree-m matrix ring to maintain the gradients for linear regression models. Moreover, it uses the same view tree to maintain factorized conjunctive query evaluation, matrix chain multiplication, and linear regression, with the only computational change captured by the ring.

F-IVM differs from existing online learning algorithms in at least two ways. (1) Whereas the latter only consider inserts [4]

, F-IVM also considers deletes. (2) F-IVM avoids the materialization of the training dataset defined by a feature extraction query over multi-relational data. It casts the data-intensive computation of the learning task as ring computation inside the views and pushes it past the joins and down the view tree of the query. This comes with great performance benefits: F-IVM can maintain model gradients over a join faster than maintaining the join, since the latter may be much larger and have many repeating values. It is also competitive against state-of-the-art incremental view maintenance systems: Experiments showed several orders of magnitude performance speedup over DBToaster 

[3], with an average throughput of 10K updates per second for batches of up to thousands of aggregates over joins of five relations on one thread of a standard Azure machine [5].

We will demonstrate F-IVM’s unique ability to maintain the pairwise mutual information (MI) and the covariance matrices (COVAR) over categorical and continuous attributes. These matrices represent the data-intensive computation of common machine learning applications. MI is used for model selection, Chow-Liu trees (optimal tree-shaped Bayesian networks), as cost function in learning decision trees, and in determining the similarity of clusterings of a dataset. COVAR is used for ridge linear regression 

[6], forward/backward model selection, polynomial regression, and factorization machines. For this demonstration, we will use model selection, ridge linear regression, and Chow-Liu trees. Our web-based user interface uses F-IVM to maintain these applications under updates to the Retailer [6] and Favorita [2] databases.

Figure 1: Maintaining SUM(g(B)g(C)g(D)) over the join in four scenarios: tuple multiplicities using the ring (#); COVAR matrix using the degree-3 ring (continuous B, C, D; also categorical C and continuous B, D); MI matrix using the degree-3 ring (categorical B, C, D). The symmetric matrix values are omitted.

2 F-IVM By Example

Consider the next query over relations and :

        Q$\;\,\;\,\;_{\tt B}\,*\,_{\tt C}\,*\,_{\tt D}\;\;\;\;\;
S

The aggregates are from a ring . The SUM operator uses the addition from . The attribute functions , , and map attribute values to elements in .

F-IVM exploits the distributivity of multiplication over the SUM operator to push the aggregate past the join and later combine the partial aggregates to produce the query result. For instance, the view computes such partial sums over :

        V$_\texttt{S}\;_{\tt C}\,*\,_{\tt D}\;\;_\texttt{2}\;
BY A

In the view , we treat the -values as keys and the aggregate -values as payloads. Similarly, we can compute partial sums over as view . These views are joined as depicted by the view tree in Figure 1, which is akin to a query plan.

Figure 1 gives a toy database consisting of relations and , each mapping tuples to their multiplicity. We next demonstrate how computing over this database with different rings can support our application scenarios.

Count Aggregate. We start with the simple count aggregate SUM(1). The aggregate values are from , and and are the arithmetic operations over . The attribute functions , , and map all values to 1. Figure 1 shows the contents of and on the toy database under the ring (the payload column #). To compute , we multiply the payloads of matching -values from and and then sum up these products. The result of is a relation mapping the empty tuple to the total number of tuples in the join of and .

Linear Regression. Consider the problem of learning a linear function with parameters , and that predicts the label given the features and , where the training dataset is the natural join of our relations:

We assume that , , and have continuous domains; we consider the case when is a categorical attribute later on.

We can learn using batch gradient descent. This method iteratively updates the model parameters in the direction of the gradient to decrease the squared error loss and eventually converge to the optimal value. The gradient of the square loss objective function requires the computation of three types of aggregates: the count aggregate SUM(1), the linear aggregates SUM(B), SUM(C), and SUM(D), and the quadratic aggregates SUM(BB), SUM(BC), SUM(CC), SUM(BD), and SUM(CD). These aggregates suffice to capture the correlation between the features and and the label  [6].

F-IVM can compute all these aggregates using the query and the same evaluation strategy from Figure 1! The only needed adjustment is the replacement of the SQL SUM and operators with appropriate new sum and product operators.

We treat this batch of aggregates as one compound aggregate , where is a scalar, is a vector with one sum of values per attribute, and is a matrix of sums of products of values for any two attributes. This is the COVAR matrix. The compound aggregate can be pushed past the join similarly to the count aggregate discussed before. The payloads of keys carry these aggregates as values, see the payload column COVAR in Figure 1. The compound aggregates are from the degree-3 matrix ring , where , , and for and :

We use attribute names to index elements in and ; for instance, . For each -value , where , the attribute function is , where is a vector with all zeros except , and is a matrix with all zeros except .

In Figure 1, the payload represents the mapped -value ; the payload represents the sum of products of the mapped -pairs with the same -value ; and are computed similarly. The result of maps the empty tuple to the payload , yielding the count, the vector of aggregates SUM(X), and the matrix of aggregates SUM(XY), for , over the join of and . Our approach significantly shares the computation across the aggregates: The scalar aggregates are used to scale up the linear and quadratic ones, while the linear aggregates are used to compute the quadratic ones.

Linear Regression with Categorical Attributes.

Real-world datasets contain a mix of continuous and categorical attributes. The latter take on values from predefined sets of possible values (categories). It is common practice to one-hot encode categorical attributes as indicator vectors.

The COVAR matrix from above accounts for the interactions = SUM(XY) of attributes with a continuous domain. Assume now that attribute is categorical and attributes and remain continuous. The interaction captures the aggregates SUM(B) per -value:

        =SELECT$\;\;\;\;\;\;\;\;\;\;
C

Using the group-by clause ensures a compact representation of one-hot encoded -values and that considers only the -values that exist in the join result.

We unify the representation of aggregates for continuous and categorical attributes by composing the degree- matrix ring with the ring over relations [5] as follows: we use relations as values in , , and instead of scalars; we use union and join instead of scalar addition and multiplication; we use the empty relation as zero. The operations and over triples remain unchanged.

The attribute function now depends on whether is continuous or categorical. For any -value , , where , is a vector with all except if is continuous and otherwise, and is a matrix with all except if is continuous and otherwise.

Figure 1 shows the contents of , , and when using the generalized degree-3 matrix ring with relational values (see the payload column COVAR with as categorical). The payload in captures the count aggregate , the vector of aggregates = SUM(B), = SUM(1) grouped by , and = SUM(D), and the matrix of aggregates including = SUM(B) grouped by and = SUM(BD). The computation follows the same pattern as with the count aggregate and linear regression with continuous attributes. The only difference is due to the ring used for payloads.

Mutual Information (MI).

The MI of two discrete random variable

and is defined as:

where

is the joint probability mass function of

and , and and are the probability mass functions of and respectively

. In our database setting, we can capture the joint distribution of two categorical attributes

and and the two marginal distributions using four count aggregates: = SUM(1), = SUM(1) grouped by , = SUM(1) grouped by , and = SUM(1) grouped by . The MI of and is then computed as:

We compute the MI for all pairs of categorical attributes. The aggregates , , and are exactly those computed for the COVAR matrix over categorical attributes. We can thus assemble the aggregates into a vector and the aggregates into a matrix, and share their computation as in the linear regression case. When computing the MI for continuous attributes, we first discretize their values into bins of finite size and then follow the same steps as with computing the MI for categorical attributes.

The views , , and from Figure 1 capture the aggregates , , and of categorical attributes using the degree-3 matrix ring with relational values (the last payload column MI). The payload in consists of the count aggregate , the vector of SUM(1) aggregates grouped by , and the matrix of SUM(1) aggregates grouped by , for . As in the previous examples, the computation over keys remains the same.

The MI of two attributes quantifies their degree of correlation [4]: A value close to 0 means they are almost independent, while a large value means they are highly correlated. It can identify attributes that predict (are highly correlated with) a given label attribute and can thus be used for model selection [4]. It can also be used for learning the structure of Bayesian networks. The Chow-Liu algorithm [1] constructs an optimal tree-shaped Bayesian network with one node for each attribute. It proceeds in rounds and in each round it adds an edge between two nodes such that their pairwise MI is maximal among all pairs of attributes not chosen yet.

(a) Model Selection
(b) Regression
(c) Chow-Liu tree
(d) Maintenance Strategy
Figure 2: F-IVM’s web user interface: (a) model selection using pairwise mutual information of attributes with a given label; inspect (b) learning the ridge linear regression model with the selected features and label; (c) the mutual information matrix and the Chow-Liu tree; (d) the F-IVM view tree and M3 code for views.

Incremental Maintenance. Figure 1 shows the leaf-to-root path taken to maintain the query result under updates to . The delta captures the change in :

        V$_\texttt{R}\;_{\tt B}\;\;_\texttt{1}\delta$R GROUP$\;
BY A

The delta further joins with to compute .

The update may consist of both inserts and deletes, which are encoded as keys with positive and respectively negative payloads. In our examples, a negative payload is for the count aggregate and for the compound aggregate with relational values, where is the matrix whose entries are the empty relation.

3 User Interaction

Figure 2 depicts snapshots of F-IVM’s user interface.

In the Input tab (not shown), the user chooses the database and gives the query defining the initial training dataset. A sequence of updates is prepared for each database. The MI and COVAR matrices and the applications built on top of them are first computed over this initial training dataset.

The Model Selection tab allows the user to specify a label attribute for a predictive model and an MI threshold. It then depicts the list of all attributes ranked based on their pairwise MI with the label. Only the attributes above the threshold are selected as features of the model. F-IVM processes one bulk of 10K updates before pausing for one second. The users can then observe how relevant attributes become irrelevant to predicting the label or vice-versa.

The Regression tab allows users to inspect the ridge linear regression model with the features and label chosen in the previous tab. F-IVM updates the COVAR matrix after each bulk of updates. Then, a batch gradient descent solver resumes the convergence of the model parameters using gradients that are made of the previous parameter values and the new COVAR matrix [5]

. A dark green or white cell in this matrix is a 2D or respectively 0D tensor representing the interaction between two categorical or respectively two continuous attributes. A white cell is a 1D tensor that represents the interaction between a categorical and a continuous attribute. The same color coding is used for the model vector. A tensor is shown when clicking its cell.

The Chow-Liu Tree tab depicts the MI matrix for all pairs of attributes in the training dataset and the Chow-Liu tree constructed using this matrix. After each bulk of 10k updates, F-IVM updates the matrix and the tree and then pauses.

The Maintenance Strategy tab depicts the F-IVM view tree for the input query. For each view it shows the code for this view in the M3 interpreted representation language [3]. We use the DBToaster backend [3] to compile M3 code into efficient C++ code before running it on the updates.

References

  • [1] C. Chow and C. Liu (2006)

    Approximating discrete probability distributions with dependence trees

    .
    Trans. Inf. Theor. 14 (3), pp. 462–467. Cited by: §2.
  • [2] C. Favorita (2017) Corp. Favorita Grocery Sales Forecasting: Can you accurately predict sales for a large grocery chain?. Note: https://www.kaggle.com/c/favorita-grocery-sales-forecasting/ Cited by: §1.
  • [3] C. Koch et al. (2014) DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views. VLDB J. 23 (2), pp. 253–278. Cited by: §1, §1, §3.
  • [4] K. P. Murphy (2013) Machine learning : a probabilistic perspective. MIT Press, Cambridge, Mass.. External Links: ISBN 9780262018029 0262018020 Cited by: §1, §2.
  • [5] M. Nikolic and D. Olteanu (2018) Incremental View Maintenance with Triple Lock Factorization Benefits. In SIGMOD, pp. 365–380. Cited by: §1, §1, §2, §3.
  • [6] M. Schleich, D. Olteanu, and R. Ciucanu (2016) Learning linear regression models over factorized joins. In SIGMOD, pp. 3–18. Cited by: §1, §2.