XGBoost(xgboost) algorithm’s popularity has been mainly due to its ability to build effective GBDT models on large and complex data sets. Some of the techniques that allow it to build models efficiently are - distributed tree building, optimized data structures, and out-of-core computation. Out of these, the most important feature is building a decision tree in a distributed manner. The main bottleneck in building a decision tree in a distributed fashion is to determine the optimal value at which the node needs to be split. For the split finding step, a greedy algorithm iterates over all feature values, computing gain to find the best split point at a particular node. This is done for all nodes of the decision tree, making it inefficient. To alleviate this, there exist approximate split finding algorithms which speedup the split finding by enumerating only a limited set of data points for each feature. These algorithms summarize all feature values to a smaller set and then find the best split point from the summarized set. The summarization was previously applied to situations where decision trees were built on big datasets, or decision trees were built on streaming data(spies; spdt). These techniques for building decision trees have been adopted to more powerful GBDT methods like XGBoost(xgboost), CATBoost(catboost), and LightGBM(Ke2017LightGBMAH). These boosting algorithms build models using many decision trees; hence, any efficiency obtained in building decision trees is magnified in the overall ensemble model. Therefore, these GBDT algorithms deal with big data by efficiently building the individual decision trees in the ensemble model. For instance, in the original paper on XGBoost(xgboost), the authors come up with a weighted quantile building step that is reported to improve the accuracy of finding the split and at the same time constructing the decision trees efficiently. The paper claims this feature is a novel contribution of XGBoost compared to rest of its contemporaries. In XGBoost and algorithms similar to it, the data is summarized by sophisticated quantile building procedures, making sure that the data summary is faithful to the original data. In this paper, we show theoretically and empirically that such methods of constructing sparse datasets using sophisticated quantile building procedures do not offer any more advantage in the accuracy of the ensembles built, compared to a simple random selection of points as candidate splits for a decision tree. The results in this paper would help practitioners to build simple and scalable algorithms to build decision trees for big data.
The paper starts out examining previous work on efficiently building decision trees and the use of some of these techniques in GBDT methods, in Section 2. We then provide a precise mathematical framework to explain the problem at hand and compare the efficacy of different quantization algorithm in Section 3. This is followed by Section 3.1 where we show that the sophisticated quantization algorithm is equivalent to a random selection of split candidates. Section 4 provides an empirical comparison of XGBoost using a simple random sampling for data summarization with the original version of XGBoost with sophisticated quantile building steps.
2. Related work
The decision tree is a useful ML technique that recursively partitions the input space and assigns specific class labels to different partitions. The partitions are constructed by first choosing the feature along which a partition needs to be created or equivalently a split needs to be created in the decision tree. This is then followed by choosing the value of the feature on which the split needs to be created. The split finding step needs to scan through the entire set of data points to determine the optimal feature and the appropriate value to split on. The split is chosen such that, by splitting on a particular node of the tree, the objective function defined over the data is maximised. Typical objective functions used in decision trees are GINI and entropy. The split point is chosen so that splitting at the particular value would give the maximum increase of the objective function compared to all other candidate splits. The split finding procedure is inefficient, especially when dealing with big data, since for every split, we need to scan over the entire data. There have been many algorithms that try to overcome this inefficiency by considering a subset of the data instead of the whole dataset. These algorithms fall into two different categories based on the way it constructs the subsets.
Data faithful: This class of algorithms typically try to build the subset of data such that the subset mimics the actual data as closely as possible. Hence we call these algorithms data faithful . These include algorithms like SS in CLOUDS(clouds), SPDT(spdt), XGBoost(xgboost), CATBoost(catboost), and LightGBM(Ke2017LightGBMAH). Here we have included boosting methods like XGBoost, CATBoost, and LightGBM along with decision tree building algorithms like CLOUDS and SPDT since the boosting algorithms build an ensemble of decision trees and use these approximate methods to build trees efficiently. Out of these, the SS variant of CLOUDS and XGBoost uses quantile approximation to build a subset over the training data. CATBoost has different possible ways of quantization provided as a choice to the user. LightGBM uses a histogram tree to construct the bins. In a quantile approximation, the idea is to construct bins over the data such that all of the training points fall into each bucket. Usually, separate bins are constructed for individual features of the data. Once the bins are constructed, a representative data point is chosen for each bin, serving as a split candidate.
Objective faithful: In this class of algorithms, the focus is to choose candidate splits so that they best represent the objective function optimized by the decision tree optimizer. However, it is challenging to represent the objective function because we do not have any knowledge about the objective. An algorithm like SSE of (clouds)
builds a subset of the training data in multiple rounds. The subset is constructed through a heuristic which tries to make the subset of data points faithful to the objective function and tries to include optimal data point in the subset. SPIES(spies) is another such algorithm. The idea is to choose the split points such that there is a good chance that the best split point is included in the subset, hence ending up with a decision tree that is equivalent or close to the tree that would have been built if all of the data points are chosen.
From the definitions, it is clear that the data faithful algorithms do not consider objective function while building the subset and entirely focus on approximating the input space. In this effort, these algorithms can end up choosing sub-optimal subsets as candidate splits. Representing the input space accurately does not guarantee that it would be able to capture splits that can potentially improve on the objective. This situation is illustrated in Fig.(1). On the other hand, objective faithful algorithms spend substantial effort choosing the approximating subset and hence are computationally inefficient. It is also not clear how to find the optimal splits over a completely unknown objective without actually evaluating the functions over the data points. In the following section, we quantify the expected error made by these algorithms in choosing the candidate splits while constructing the approximating subset and compare it with the simple approach of building the subset by sampling points uniformly at random.
3. Mathematical framework for choosing the subset
We now provide a precise mathematical definition of the subset and the role it plays. Let be set of data points over which we would like to learn a function, using a decision tree . Without loss of generality, we assume . We define a ranking on over an objective function where is a tree score function that provides a greedy measure of the quality of the split and is usually defined on the partition of the points. The objective is to find such that is the maximum amongst all the ordered partitions defined on . Conventional algorithms to build decision trees iterate through to find the maximum value of where and . For ease of notation, we express as . The linear scan algorithm is inefficient when is large. An efficient alternative to this algorithm is to choose a subset of to evaluate . Let us denote the subset chosen by the approximation algorithm to be such that and the element in that maximises is . We use as a set of candidate splits. Evaluating the splits over a smaller set compared to would be more efficient. The efficacy of the approximation algorithm is measured by the ability of the algorithm to include in . Less drastic measure would be to measure the rank error of with respect to by finding the element in that has the maximum value of , say, . The rank of in would give the error induced by the approximation algorithm. Let us denote the rank error as which takes a value between and . Hence, if contains , , if does not contain but contains the next best, then and so on. We can formally define as -
where is the cardinality of .
There are usually two approaches employed by various algorithms when constructing from as described in Section 2. In the data faithful approach is chosen to be faithful to through some appropriate definition of faithfulness. In the objective faithful approach, is constructed to be faithful to wherein the attempt is to include the best possible data point in . It must be noted that is always defined over a ranking of the data points over . We can now quantify the expected rank error if we were to construct using uniform random sampling and use this as the baseline to compare more sophisticated ways of choosing .
3.1. Random selection of bins
We start by examining a simple approximation scheme where is constructed by selecting elements uniformly random from . We begin with a theorem that provides a measure of the expected rank error of a set chosen uniformly random from .
Theorem 1 ().
The expected rank error over all possible subsets of of cardinality is , when the subsets are chosen uniformly at random.
The probability that the optimal elementis included in is denoted by and is given by -
Similarly, the probability that is not chosen in but the next best denoted by is chosen is given by -
Hence the expected rank error is given by -
We can simplify this to . (proof in Appendix Section 6.2) ∎
Hence from the theorem, we find that if we were to choose the subset of candidate split points from the data in a random fashion, then the expected error of the rank of the best element in the chosen subset would be inversely related to the size of the subset chosen. For ease of analysis, we normalise the expected error by dividing the expected error by the worst possible error to get normalised error as -
3.2. Deterministic selection of bins
There are other possible ways of choosing using some of the popular methods of quantile computations. These algorithms provide a good approximation of the entire data efficiently. In this discussion, however, our focus is on building decision trees with these approximated sketches and belong to the category of data faithful approaches. The argument we make here is that it is not enough to have an approximation of the big data to ensure a good decision tree being built. Instead, we need an approximation of the data that would be faithful to , which would ensure that the candidate split points obtained from the data approximation has a good probability of including the points that have the best split candidates. XGBoost uses a modification of GK summary(gk), and CATBoost uses a fixed number of bins with fixed ranges to quantize the data.
We first consider a simplified variant of the approximation strategy used in the popular algorithm of XGBoost. XGBoost originally uses a weighted version of the GK summary to construct the approximate data. We consider the GK summary algorithm where there are no weights associated with the data points. Bins are then constructed in such a way that the rank of a point can be approximated such that the error in the rank is less than . Hence, we expect to have as many bins as , which means that . In (xgboost), the authors prove that such an allocation of bins and consequent selection of results in a normalized error of less than for a query made for a quantile. In its simplest implementation, the quantile sketch is used to query for rank of the data in . However, in our case, we need to find the maximum element in , which we cannot find using the quantile sketch unless is monotonic over . Hence, the algorithm is equivalent to a random selection of points since there is no correlation between the bins constructed out of and . In Fig.(2), we compare the random selection of with the deterministic selection of bins. In the experiment, we sample uniformly random and select first using random selection as described in Section 3.1. We approximate the data set by binning it into buckets, and we measure the expected rank error by averaging over many such runs. The expected error is measured over different bin sizes and is plotted in Fig.(2). The plots in the figure support our claim that the random selection of and the deterministic selection of using the algorithm in (xgboost) are not significantly different in terms of their rank errors. In XGBoost, the quantile sketch is modified to have weighted quantiles. The weights are designed to weigh points based on their contribution to the error, yet it still does not correlate with the objective used to decide on the split, hence it still would be data faithful than objective faithful . We have illustrated the effect of data faithful on GK summary used in XGBoost, but this analysis applies to all other algorithms belonging to the data faithful class.
In our experiments, we compare the accuracy achieved by XGBoost using a simple random sampling of points versus XGBoost using weighted quantile building algorithm. Random sampling is incorporated in XGBoost by performing local sampling during data reading and performing global sampling during split point proposal. Since building the subset primarily affects the split point chosen and can result in a non-greedy decision tree being built, so, as a first step we compare the quality of the decision tree (1 tree XGBoost) built between the two methods. The accuracies (for classification task) / mean absolute percentage error (MAPE for time series regression task) are compared on datasets(noniiddata)(uci) described in Table 1 and the results are presented in Table 2. For both of the methods, we used default settings for the parameters and distributed the computation over workers. The number of bins used for the experiments is set based on the training dataset size. The evaluations using different bins have been reported as an average over runs. Using a similar setup we also built an ensemble of decision trees ( for regression) using XGBoost.
From the results on different datasets shown in Table 2, it is clear that by using a simple random selection of candidate split points for building decision trees, we can achieve the same levels of accuracy as using sophisticated quantile building algorithm thus validating our claim. We can see in Table 2 that the time taken by random sampling is substantially lesser than using quantile approximation as would be expected. We also observe that random sampling is able to handle non-iid data and is performing equivalent to quantile algorithm. We also measured the variance of accuracies/errors across runs and observed that the variance is, so we can consider random sampling to give stable boosting results. By using a simpler method, we are able to improve the computational efficiency of the algorithm which is important while processing big data.
|Wiretap(Mirsky2018KitsuneAE) (WT - Class)||115||200000||50000|
|Mirai(Mirsky2018KitsuneAE) (MI - Class)||115||563137||100000|
|SUSY(higgs) (SU - Class)||18||4500000||500000|
|Hepmass(hepmass) (HM - Class)||28||7000000||3500000|
|Higgs(higgs) (HI - Class)||28||10500000||500000|
|PJM East(noniiddata) (PJM - Reg)||10||110000||35366|
|Dominion Vir (noniiddata) (DOM - Reg)||10||84750||31439|
5. Conclusion and future work
In this paper, we critically examine the methods used to build decision trees using a quantile sketch algorithm to approximate the data. We proved that just having data faithful methods that approximate the data distribution without linking it with the objective function does not help in providing any improvement over a random selection of candidate split points. Using this strong result, we argue and empirically prove that using a random selection of candidate split points provides the same level of accuracy as a sophisticated algorithm for finding split points. A random selection of points would be more time-efficient compared to other methods providing gains in running boosting algorithms. In terms of the complexity of the software, implementing a distributed decision tree based on a random selection is much simpler than sophisticated quantile building algorithms.
In future, we would like to develop objective faithful algorithms that are also as computationally efficient as the data faithful algorithms. We would also like to modify other GBDT methods like CATBoost and LightGBM to use random sampling.
6.1. Incorporating Random Sampling in XGBoost
This section will describe how we replaced the quantile algorithm with the random sampling algorithm in XGBoost. During the first step of data reading, each node in a distributed setting randomly samples from the local data it reads. If bins are given as input for each feature, the sample size read per feature by each node will be . After reading the data, the boosting starts. We need to propose candidate split points for each iteration from which the best split point will be found. This proposal can be done by weighted quantile sketch or using random sampling. Now it is required that all nodes have the same set of candidate split points. So, after each node proposes its local candidate split points (in case of random sampling local proposal done during data reading), an AllReduce(Chen2015RABITA) operation is called (AllReduce is reducing and then broadcasting). In case of random sampling, all reduce will combine the samples and then sample from the set again to ensure that the sample size for a feature is at most . After broadcasting the samples, the further steps are shared with what is done in XGBoost. The pseudo code of the distributed XGBoost algorithm is given in Algorithm 1.
6.2. Proof for theorem 1
We start with Eq.(5),
There are two parts of this equation which will be simplified separately. The first part is:
For simplifying the first part, we would be using the following identity repeatedly:
We expand Eq.(11),
Repeating the process over the terms we get,
We can now simplify the second part of the equation,
For this too, we will start by expanding it into a series:
Let , and using the property
which can be regrouped as,
Simplifying using Eq.(12) recursively,
Simplifying Eq.(20) in terms of Z, we get,
Simplifying using Eq.(12) recursively,
Using Eq.(12) to simplify,
Replace and Z from Eq.(22),