Relational Boosted Regression Trees

07/25/2021
by   Sonia Cromp, et al.
University of Pittsburgh
0

Many tasks use data housed in relational databases to train boosted regression tree models. In this paper, we give a relational adaptation of the greedy algorithm for training boosted regression trees. For the subproblem of calculating the sum of squared residuals of the dataset, which dominates the runtime of the boosting algorithm, we provide a (1 + ϵ)-approximation using the tensor sketch technique. Employing this approximation within the relational boosted regression trees algorithm leads to learning similar model parameters, but with asymptotically better runtime.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/01/2020

F-IVM: Learning over Fast-Evolving Relational Data

F-IVM is a system for real-time analytics such as machine learning appli...
07/01/2016

Combining Gradient Boosting Machines with Collective Inference to Predict Continuous Values

Gradient boosting of regression trees is a competitive procedure for lea...
01/09/2020

Non-Parametric Learning of Lifted Restricted Boltzmann Machines

We consider the problem of discriminatively learning restricted Boltzman...
08/01/2020

Relational Algorithms for k-means Clustering

The majority of learning tasks faced by data scientists involve relation...
06/16/2022

Explainable Models via Compression of Tree Ensembles

Ensemble models (bagging and gradient-boosting) of relational decision t...
05/07/2015

DART: Dropouts meet Multiple Additive Regression Trees

Multiple Additive Regression Trees (MART), an ensemble model of boosted ...
06/16/2016

The Effect of Heteroscedasticity on Regression Trees

Regression trees are becoming increasingly popular as omnibus predicting...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Relational databases are a common solution to storing large datasets, due to their economical space usage that limits the repetition of duplicate values. A dataset containing features, represented by columns, is stored in tables that each contain a subset of the features. The design matrix with all features is not stored and must be re-constructed by joining the tables, i.e.

. This process is considered both time- and space- intensive, but is a necessary step for machine learning algorithms that expect

as input. When training boosted regression trees, the traditional approach is to compute , then train the trees on the data in the design matrix. Relational algorithms[moseley2020relational, abo2021relational] sidestep the need to calculate , which can yield significant time and space complexity improvements. As such, altering the boosted regression trees training algorithm to rely on relational algorithms, instead of , can yield time and space complexity improvements.

Boosted trees build off the classic tree model, known as a decision tree when the label is categorical and as a regression tree when the label is continuous. A decision/regression tree is a rooted binary tree such that each internal node has a condition and each leaf has a prediction. To find the prediction for a given point, we recursively check the condition for each internal node starting from the root. If the point meets the condition, we recurse on the right branch; otherwise we go to the left branch. The boosted regression tree algorithm involves greedily training multiple regression trees, each referred to as a weak regressor. The first tree is trained to predict datapoints’ labels and the subsequent trees predict the datapoints’ residuals. For some datapoint

and weak regressors, the residual equals the datapoint’s label minus the predictions of all previously trained regression trees:

The prediction of the entire boosted regression model for some datapoint equals the sum of all weak regressors’ predictions, i.e. .

Individual regression trees are commonly trained using a greedy algorithm. Training node of a regression tree consists of finding a splitting criterion of the form for some feature of the dataset and threshold value for the values . Datapoints at that fulfill this criterion are assigned to the right child and all other datapoints are assigned to the left child of ; and then the algorithm recursively trains each subtree independently. To find this splitting criterion, each possible splitting threshold for each feature of the dataset is evaluated and the candidate splitting criterion yielding the lowest loss is selected[loh2014fifty]. When making predictions in a regression tree, the predicted value of all examples in a leaf equals the average of the labels of all examples in . If there are examples in the set of examples in leaf , and the label of point is , then the prediction for each of these examples will be . As such, the goal is to provide a relational adaptation of this greedy algorithm for training regression trees, then to apply this adaptation to boosted regression trees. Section 2.1 provides an algorithm to train a single regression tree relationally using the inside-out algorithm, a relational algorithm for evaluating SumProd queries [abo2016faq].

In the relational setting, the input to a weak regressor consists of all previously trained weak regressors plus the tables that each contain two or more features, which are represented by columns. One of these columns is the label . The goal of the relational boosted regression tree algorithm is to train a set of boosted regression trees to predict on the basis of the other columns in .

While training boosted regression trees on the design matrix can be done using well-known algorithms, there has been no algorithm proposed for training regression trees relationally. We introduce an algorithm in Section 2.2 for training boosted regression trees based on the algorithm for training individual regression trees as shown in Section 2.1. However, the runtime of this algorithm is dominated by the calculation of the sum of the dataset’s squared residuals while determining splitting criteria for non-initial weak regressors. As such, the methods used to train a single regression tree do not translate well to training boosted regression trees. To address this issue, we next introduce an algorithm which uses sketching to find the sum of squared residuals in Section 3

. Sketching is a powerful dimensionality reduction tool for linear regression and matrix multiplication

[rajesh2021indatabase, woodruff2014sketching]. For more details on sketching, see Section 1.1.2.

For acyclic joins, using SumProd queries as explained in Section 1.1.1 and the inside-out algorithm [abo2016faq], the sum of squared residuals can be exactly calculated in time where weak regressors of leaves are already trained, there are input tables with a total of features and the design matrix contains rows. However, using the sketching technique along with the SumProd query algorithm, an approximation of the sum of squared residuals can be calculated in time . The constant

is set according to the desired accuracy guarantee of the sketching. With probability at least

, this algorithm gives a approximation for the sum of squared residuals. This sketching algorithm can then be plugged into the boosted regression trees algorithm. The time complexity of the boosted trees algorithm using sketching is asymptotically better for calculating the sum of squared residuals than the standard exact relational algorithm when

is very big, which signifies that the trees are deep. Heuristically, this algorithm gives a similar result to the standard algorithm.

1.1 Preliminary

1.1.1 FAQs and Relational Algorithms

A SumProd query consists of:

  • A collection of tables in which each column has an associated feature. Because all data is numerical in the regression tree setting, assume that all features are numerical. Let be the collection of all features, and be the number of features. The design matrix is , the natural join of the tables. Let denote the number of rows in the largest input table and denote the number of rows in J.

  • A function for each feature for some base set . We generally assume each is easy to compute.

  • Binary operations and such that forms a commutative semiring. Most importantly, this means that distributes over .

Evaluating results in

where is a row in and is the value for feature in row . It is also possible for a SumProd query to group by a table . Doing so represents calculating the following term simultaneously for all the rows

Lemma 1.1 ([abo2016faq]).

Any SumProd query can be computed efficiently in time where fhtw is the fractional hypertree width of the database. For an acyclic join, , so the running time is which is polynomial.

You may find the definition of acyclic joins and fractional hypertree width in Appendix A.3.

1.1.2 Tensor Sketch

Tensor sketch [avron2014subspace] is an approximation technique widely used for linear regression with polynomial kernels and linear regression with matrices that are the result of Kronocker products. Given vectors , where the -th vector is dimensional, and a Kronecker product , we define 2-wise independent hash functions such that . In addition, we define 2-wise independent hash functions such that . Let

be some random matrix of

rows and columns, where each row has exactly one non-zero entry. The index of the non-zero entry in some row vector is and the sign of this entry is .

Let where is a Kronecker product which takes the product of all combinations of the coordinates of the input vectors. Next, let equal the result of applying tensor sketch algorithm on vectors as described in [avron2014subspace] for some random selection of hash functions and where .

Theorem 1.2.

[avron2014subspace] The result of is the same as . The matrix satisfies the Approximate Matrix Product property:

Let and be vectors with rows. For , we have

As such, the sketched term may be calculated more efficiently and used in place of the exact term . Tensor sketch algorithm calculates and efficiently, when and are the result of a Kronecker product.

1.2 Related Works

A few machine learning models have already been adapted for a relational setting. Linear regression and factorization machines are implemented more efficiently by [rendle2013scaling], through using repeating patterns in the design matrix. [kumar2015learning, schleich2016learning, acdc2018]

further improve relational linear regression and factorization machines for specific scenarios. The relational algorithms for linear regression, singular value decomposition, factorization machines, and others are unified by

[abo2018database]

. The support vector machine training algorithms given by

[yang2020towards, abo2021relational] adapt this model to the relational setting. [cheng2019nonlinear]

create a relational algorithm for Independent Gaussian Mixture Models, which is experimentally shown to be faster than computing the design matrix. Relational algorithms for

-means clustering are provided in [moseley2020relational].

Section 2 contains the algorithm to train a single regression tree on relational data and the exact algorithm for boosted regression trees on relational data with cubic runtime with respect to . Section 3 describes the approximation algorithm that uses tensor sketching to achieve a quadratic runtime with respect to .

2 Exact Algorithm

2.1 Training a Single Regression Tree (Algorithm 1)

For a particular node in a regression tree, let be the rows in the design matrix that satisfy all the constraints between node and the root of the regression tree. For every vertex , the algorithm constructs a criterion of form where is a threshold for values of feature in the dataset. All points satisfying the constraint belong to the right child and all the ones not satisfying the constraint belong to the left child. This means that the training process consists of finding a threshold and a column index for every node in the regression tree. Starting from the root, we build the regression tree from top to bottom in a Breadth First Search order. To choose the splitting criterion at node , the following process is repeated to calculate the mean squared error for each possible feature/column of in each table and for each possible threshold for the feature :

For every , the algorithm performs three SumProd queries grouped by which will be explained shortly in details. Iterating over each row of , these queries gather the subset of rows in the design matrix that satisfy all splitting criteria between the root node and , and have the same values as in the columns they share with . The first query is , which equals the number of rows in satisfying the constraints posed by . The second is , which equals the sum of the labels of these such rows and the third is , which equals the sum of the labels squared.

For each feature in a table , the possible threshold values for a splitting criterion on this feature is the domain of column . Let be the domain of column which can be found in table . For each possible splitting threshold , the number of points that would satisfy the constraint between and its left child is

Note that the term for different values of are precomputed using the SumProd queries grouped by table . Similarly, the sum of labels of these such points is

and the sum of labels squared is

These expressions can each be determined by the SumProd queries that were already performed. All remaining points in satisfy the constraint between and the right child. Their number, the sum of their labels and sum of their labels squared can be calculated similarly, i.e. for the number, for the sum of labels and for the sum of labels squared. For each child, the predicted label for all points in satisfying the constraint between and that child is the average label, which is for the left child and for the right child.

To evaluate the mean squared error for the threshold , the algorithm calculates the mean squared error at each of the children. For the left child, the MSE is calculated as

and similarly for the right child as . The MSE at given this splitting threshold is the sum of the MSEs at its children, i.e.

Calculate for each threshold to find the best threshold for each feature , then select the feature that yields lowest MSE. Finally, add the two new nodes as children of to the regressor tree.

Theorem 2.1.

Algorithm 1 determines the splitting threshold for one node of a regression tree using SumProd queries.

Proof.

When selecting a splitting criterion for some node in the regression tree, it is necessary to find the number of points, the sum of labels and the sum of labels squared for the rows in . These three pieces of information are organized by the rows’ values for each of the features of , which is achieved via grouping by each of the input tables. As such, each of these three pieces of information require SumProd queries, meaning SumProd queries are needed to find the splitting threshold for one node . ∎

Corollary 2.2.

Algorithm 1 trains a single regression tree with leaves and tables using SumProd queries.

Proof.

Because a tree with leaves has nodes and finding the threshold for one node requires SumProd queries, Algorithm 1 requires SumProd queries. ∎

Corollary 2.3.

Given an acyclic join query, Algorithm 1 trains a single regression tree with leaves and tables in time .

Proof.

Algorithm 1 requires SumProd queries, the runtime of which is . Thus, the runtime of Algorithm 1 is . ∎

2.2 Exact Algorithm for Boosted Trees (Algorithm 2)

The first weak regressor, , is trained using the same method as detailed in Algorithm 1. Assume we have trained regression trees . The method to construct the regression tree is similar to Algorithm 1 except that instead of points’ labels, the tree must predict the residuals

Where is the vector of labels and is the vector of the -th weak regressor’s predictions for the set of points in the design matrix .

As in Algorithm 1, for every vertex , the algorithm constructs a criterion of form where is a threshold and all the points satisfying the constraint belong to the right child and all the ones not satisfying the constraint belong to the left child. Starting from the root, we build the regression tree from top to bottom in a Breadth First Search order.

What differs from Algorithm 1 is the need to calculate, for each table and each row , the values and , which respectively equal the sum of the residuals and the sum of the residuals squared for the points that are assigned to node of the tree grouped by table . To calculate , note that the predicted value for each previously-built regression tree has already been calculated and can be assumed to be available without needing to perform a SumProd query. Instead, for each leaf of each previously built weak regressor , let be the set of rows in the design matrix that satisfy all the constraints between node and the root of the regression tree and let be the intersection of and . Calculate the SumProd query where is 1 for all tables except the last table and the predicted value of for table . The result of this query equals the sum of predicted values for all points that are present in and . The sum of repeating this query for each leaf in the set of all leaves in yields , the sum of predicted values from for all rows in . Then, calculate the sum of labels for all points in by the query and calculate the sum of residuals for all points in ,

Next, let be the prediction of weak regressor for some row . To obtain requires finding the values for

(1)
(2)
(3)

The first term may be calculated as described in Algorithm 1 and the fourth term (sum of squared regressors’ predictions) may be calculated similarly to finding the sum of regressors predictions, i.e. calculate the SumProd query where is 1 for all tables except the last table and the predicted value squared of for table . The sum of repeating this query for each leaf in the set of all leaves in yields , the sum of squared predicted values from for all rows in . The second term is merely the product of and which were already calculated in order to find the sum of residuals .

However, the fourth term requires finding the product of sums of predictions for any non-matching pair of weak regressors. For any row in and any two regressors and , can be in for any leaf of and for any leaf of . As such, a separate query must be performed for each pair of leaves and for every pair of regressors and , of the form where , the function equals the product of the predicted values in leaves and for the last table and is 1 for all other tables. The sum of performing this query times, once for each pair of leaves, equals for one pair of weak regressors and . This term is computed times, once for each non-identical pair of weak regressors.

Let be the residual for a row . After also gathering using the same query as in Algorithm 1, the predicted value for the left child is and the predicted value for the right child is . Then, the MSE for the left child is

Similarly, the MSE for the right child is . The MSE for with splitting criterion is the sum of the MSEs for the two children of given this splitting criterion, i.e.

Calculate for each threshold to find the best threshold for each feature , then select the feature yields lowest MSE. Finally, add the two new nodes as children of to the regressor tree.

Theorem 2.4.

Algorithm 2 calculates the sum of squared residuals using SumProd queries.

Proof.

When weak regressors have already been constructed and the splitting criterion is being selected for some node of the -th weak regressor, the sum of residuals squared uses queries because the term iterates over all pairs of leaves of two weak regressors, for all pairs of weak regressors. This process is repeated to group by each of the input tables, so the sum of squared residuals is calculated in SumProd queries. ∎

Theorem 2.5.

Algorithm 2 calculates the sum of squared residuals in time .

Proof.

Algorithm 2 requires SumProd queries to find the sum of residuals squared as shown in 2.4. The runtime of a SumProd query is , so the time to find the sum of squared residuals is . ∎

Theorem 2.6.

Algorithm 2 determines the splitting threshold for one node of the -th regression tree using SumProd querıes.

Proof.

When weak regressors have already been constructed and the splitting criterion is being selected for some node of the -th weak regressor, the following items must be gathered for each child of , each feature of each table and possible splitting threshold for feature : the number of points, the sum of the residuals and the sum of residuals squared. Grouping by just one input table, the number of points can be gathered with one SumProd query and the sum of residuals requires SumProd queries. However, the sum of residuals squared uses queries as shown in Theorem 2.4. This process is repeated to group by each of the input tables. As such, Algorithm 2 requires SumProd queries to find the splitting threshold for one node . ∎

Theorem 2.7.

Algorithm 2 can train the -th regression tree with leaves using the greedy algorithm using SumProd querıes.

Proof.

This process is repeated at each non-terminal node of the tree where there are leaves. As such, Algorithm 2 requires SumProd queries. ∎

Corollary 2.8.

The runtime of Algorithm 2 is .

Proof.

Algorithm 2 requires SumProd queries. The runtime of a SumProd query is . Hence, the runtime of Algorithm 2 is . ∎

3 Approximation Algorithm

Assume that weak regressors have already been trained and node of is being evaluated to determine its splitting criterion. The time complexity of the boosted algorithm can be improved by approximating the term that dominates the runtime of the exact boosted algorithm, which is

as given in Equation 1. Let the -th element of some vector be denoted by . Then, the term can also be considered the L2 squared norm of vectors, i.e.

At index , contains the label of the -th point and contains the -th regression tree’s prediction for the -th point in . Each of these vectors is length and has zeros at the indices corresponding to points in but not in . To approximate , the vectors are each sketched to find . Because sketching is a linear operator, is the sketch of .

For one weak regressor , the vector can be sketched as a vector . For simplicity, consider sketching for just one fixed point . Vector is the sum of vectors of length . The vector represents the predictions of the regressor’s -th leaf for the rows of the design matrix that are assigned to and to node . This set of rows can be expressed as . Because each point in is only in one of the sets , all but one of these vectors consists entirely of zeroes when considering just the fixed point . If is in set , then the vector has only one non-zero element. This element is at the index corresponding to the index of in the design matrix and is equal to the prediction at leaf of tree .

The vector is the product of vectors, one for each of the tables as explained in the following. The sketching algorithm is applied on these vectors representing tables and is used to approximate the vector representing the predictions of one leaf of one regressor .

For each feature represented in the columns of tables , assign feature to one of the tables containing this feature. Let be the set of features assigned to table and let be the domain of . Let be the projection of onto , which is just the features of that are in . Let be the index of in , assuming some ordering of . Let be a vector of elements with 1 in the index and 0 elsewhere. Let be the predicted value at leaf of regressor . Then, is representing the prediction of leaf for . This vector has the prediction for at the index corresponding to the index of in and zeros at all other indices. Then, . Calculating and summing together the vectors for all equals , the vector of predictions for weak regressor for all points in .

Sketching a vector is equivalent to calculating the multiplication for some matrix . Since is very sparse and has special properties, we do not perform the calculation explicitly. However, the result would be the same as the product. Therefore, the sketch of a summation of vectors is equivalent to the summation of the sketch of those vectors. Therefore, to sketch , we can calculate the summation over the sketch of the vectors which, as we show shortly, can be computed using SumProd queries.

The sketching is applied to the term for some fixed to calculate a smaller vector . Define 2-wise independent hash functions such that . Also define 2-wise independent hash functions such that . Define some integer depending on the desired accuracy probability and runtime. Let be the element in the -th index of . Then, for each where calculate

where is a variable. The degree of represents the index of its term’s coefficient in a vector. The tensor-sketched vector of the prediction for one point in for regressor may be calculated as . Then, and . Note that can be computed using a SumProd query grouped by one of the tables.

Meanwhile, the vector of labels can be expressed as the SumProd query . This SumProd query is equivalent to , where corresponds to the -th table and has the same number of elements as has. For some fixed , the elements of vector are all zeroes except at the index corresponding to the index of in . For , the element at this index is equal to . For , the element at this index is equal to . can be found using the sketching technique on for each . Using the same 2-wise independent hash functions and the same 2-wise independent hash functions , we can have the tensor sketch of vector . Let be the element in the -th index of . Then, for each where calculate

where is polynomial multiplication modulo . Then, .

This way, the sketched terms