1 Introduction
Random forest (RF) models Breiman (2001) are a powerful tool in machine learning (ML) that are being used in many applications such as bioinformatics Boulesteix et al. (2012), climate change modeling Prasad et al. (2006) and credit card fraud detection Bhattacharyya et al. (2011). Their widespread usage stems from a number of advantageous properties. RF models are amenable to high degree of parallelism, typically tend to have good generalization capabilities, natively support both numeric and categorical data, and allow interpretability of the results. A RF model is an ensemble model that uses decision trees as the base learners. Designing a scalable and fast treebuilding algorithm is key for increasing the performance of RF models in terms of training time. In particular, for large datasets and a large number of trees, to achieve good performance it is critical to design the treebuilding algorithm with the characteristics of the underlying system in mind.
In this paper we evaluate the performance of different treebuilding algorithms, namely breadthfirstsearch (BFS) and depthsearchfirst (DFS). We investigate the tradeoffs and identify the bottlenecks of both approaches. Next, we propose a novel hybrid BFSDFS algorithm, which can dynamically switch modes, and demonstrate that it performs better than both BFS and DFS, and further, it is more robust in the presence of workloads with different characteristics. Moreover, we identify systemlevel bottlenecks at training time, and we alleviate them by (i) optimizing the layout of memory access pattern to be CPU cache friendly, (ii) employing explicit prefetching, and (iii) reducing the dynamic memory allocation overhead. The proposed RF implementation scheme provides an improvement in training time of upto , and on average compared to stateoftheart RF solvers (sklearn, H2O and xgboost), averaged over datasets, RF parameters, and two different hardware systems.
2 Background
Random Forest. An RF model is an ensemble learning method which uses a decision tree as the base learner. Training such a model consists of training a collection of N independent decision trees. In a decision tree, each node represents a test on a feature, each branch the outcome of the test and each leaf node (terminal node) a class label (for classification tasks) or a continuous value (for regression tasks). For a standalone decision tree model, the tree is trained using all the examples and features present in the dataset. Whereas, in the context of an RF model, each decision tree is trained using a bootstrap sample of the training examples. In addition, when splitting each node in the tree, a different random subset of the features is used. In order to identify the best split, one must identify the feature and feature value which, if split, will optimize a predefined metric. For classification tasks, common choices for this metric include the Gini score as well as the binary entropy. For regression tasks, it is common to use either the mean squared error or the mean absolute error.
Data presorting. Searching for the best split consumes the majority of the training time, and is thus the first part of the algorithm one should try to optimize. A common solution is to presort the training matrix for each feature Mehta et al. (1996); GuillameBert and Teytaud (2018); Shafer et al. (1996)
. While this approach vastly reduces the complexity of finding the best split at each node, it introduces a oneoff overhead: the time required to sort the matrix. Whether this overhead can be amortized depends strongly on the tree depth as well as on the candidate features sampled at each split. If the tree is grown to the point that all of the features have been sampled at least once, then sorting the matrix once in the beginning is more efficient than sorting at each node. To analyze this behaviour, we note that the question of how many features have been considered, for a given number of nodes in a tree, is equivalent to a variant of the wellknown coupon collector’s problem from probability theory
Ferrante and Saltalamacchia (2014). An exact expression for the probability that all features have been used, and thus the cost of presorting the matrix amortized, can be found in Stadje,Wolfgang (1990). To give an example: for a dataset with features, assuming features are sampled at each node, it only takes a tree depth of before the probability that all features have been considered reaches . Moreover, if the presorted matrix can be used across trees in a forest, its sorting cost is further amortized. To this end, in our implementation we maintain a single readonly version of this sorted matrix in shared memory, used across all trees for the duration of the training.Different approaches to treebuilding. The performance of training an RF model also strongly depends on the exact manner in which the tree is built, i.e., the order in which the nodes are traversed. One wellknown approach is the socalled depthfirstsearch (DFS) algorithm Cormen et al. (2009). In DFS, after a node has been split, the treebuilding algorithm recursively traverses the tree through the leftchild. Once a terminal node has been reached, it traverse up one level and recursively explores the rightchild. A DFSbased RF implementation is offered by the widelyused machine learning framework, sklearn Pedregosa et al. (2011). An alternative approach to treebuilding is to construct the tree levelbylevel using an algorithm known as breadthfirstsearch (BFS) Mehta et al. (1996). BFS is offered by software packages such as xgboost Chen and Guestrin (2016) and has recently been shown to work well when building trees on large datasets in a distributed settingGuillameBert and Teytaud (2018). In what follows, we will compare BFS and DFS from a systemlevel perspective, in the context of training an RF model.
3 Breadthfirst, Depthnext tree building algorithm
In this section we will analyze the memoryaccess patterns of BFS and DFS, and propose a novel hybrid algorithm that is designed to deliver the best of both worlds.
Memory access patterns.
In the common case when the dataset does not fit in the CPU cache, accessing the data matrix in a cacheefficient manner is important to achieve high performance. The notion of an active example
is critical to the analysis that follows. We define an active example, at a given moment in the treebuilding algorithm, to be any training example that is not associated with a terminal node. A key insight is that at each tree depth level, most of the matrix (assuming most examples are still active) is accessed exactly once to compute the best split across the nodes of that depth. A BFS treebuilding algorithm, operating across all nodes at the same depth at each step, is thus well suited to access the data matrix in a cacheefficient manner. DFS however is inherently less suited to exploit this property due to it repeatedly going down and up with respect to the tree depth as it builds the tree.
To illustrate the different memory access patterns of BFS and DFS, a tree with 5 nodes and a sorted matrix for a toy dataset with 9 examples and 2 features are shown in Fig 1. Each item in the sorted matrix contains the example value and the example index (only the indices are shown for simplicity). The expected memory access patterns for each step of the DFS and BFS algorithms are depicted below the sorted matrix; each row shows an algorithm step; green depicts an accessed memory location for this step; orange depicts a skipped memory location. The example split illustrated results in leaf nodes C (examples 2,5,6), D (3,4,8) and E (1,7). A DFS variant will start at node A, then proceed to nodes B, D, E, and C. This process will result in a large number of skipped memory accesses, as shown in the figure. In fact, DFS will quickly (w.r.t. tree depth) result in almost random accesses to the data matrix. On the other hand, a BFS variant will start at node A, then proceed to nodes B, C, D, and E. This, together with an additional optimization to compute all splits at each depth in one sequential access of the sorted matrix (see paragraph below on optimizations), results in a very cacheefficient memory access pattern to the matrix.
However, as the depth of the tree increases, and the number of active examples dramatically reduces, BFS no longer maintains a benefit over DFS: they both incur effectively random accesses to the sorted matrix, and exhibit very little reuse of cache lines brought to the CPU. In fact, once there exist very few active examples, we expect the DFS to have better efficiency than BFS, especially if the active part of the sorted matrix (e.g., examples 1,3,7,8,4 at node B in Fig 1) fits in the CPU cache and is copied in a packed form to each tree node. DFS is guaranteed to only work with this active set of examples while expanding the tree from said node (e.g., starting at node B, discovering nodes D and E in Fig 1), thus exhibiting very good cache behavior.
Breadthfirst, Depthnext algorithm. Based on the above analysis, BFS is more cacheefficient at the first tree levels with most examples still active, while DFS would perform better towards the deepest end, when most examples are inactive. Another argument for starting with a BFS approach is better cache reuse across trees in an RF, assuming trees are built in parallel: at low tree depths, where most examples are still active, each tree will read the sorted matrix sequentially from shared memory, and overlapping accesses across tree builders are very likely. On the other hand, starting with a DFS approach would only have that benefit at the root node, after which each tree builder will quickly approach a random memory access pattern to the sorted matrix, resulting in dramatically reduced shared cache reuse across builders. Due to the above, we have designed a Breadthfirst, Depthnext tree building algorithm: we start with a BFS approach and at each BFS step we monitor the active number of examples; when the number of active examples is so small that we no longer expect BFS to be beneficial, we switch to a DFS approach; each node at the tree frontier proceeds with a DFS search for its own set of active examples. The switching point is chosen when all the active data structures fit into the CPU cache size available to the each tree builder. This hybrid algorithm is presented in Alg. 1, and has been implemented in the Snap Machine Learning framework Dünner et al. (2018) in C++. To the best of our knowledge, this breadthfirst, depthnext technique has not been applied in the context of RF before; the closest we could find was a hybrid BFSDFS algorithm applied on the treewidth problem Zhou and A. Hansen (2009). We implement multithreading at the treelevel: each tree is trained in parallel on a different CPU core using OpenMP Dagum and Menon (1998) directives. We also perform the sorting of the data matrix in a multithreaded fashion during initialization.
Further Optimizations. During the BFS phase of the algorithm, we perform two main modifications: (i) the subset of features randomly selected are the same for each node at a particular depth (similar to Chen and Guestrin (2016)), and (ii) instead of building the tree in a nodetoexample manner we do the opposite; at each tree level we sequentially walk the sorted matrix for all features chosen, maintaining an exampletonode mapping; by the end of this sequential scan, the splits for all nodes have been computed. With the accesses to the sorted matrix being sequential, we have profiled the code and identified a lot of time spent accessing the exampletonode mapping, due to random accesses to it during the BFS. We alleviated this performance issue by prefetching the subsequent exampletonode mappings (the indices of which are readily available in the subsequent entries of the sorted matrix). The next performance issue that shows up in profiling is the memory accesses to the example label. For binary classification problems, we exploit the fact that one bit is enough to hold the label information, and we pack that into the sorted matrix’s example id field (using C bit fields), effectively stealing one bit from the id without
increasing the memory size of the matrix. Applying all the above optimizations results in a performance profile dominated by vectorized floatingpoint instructions.
For the DFS phase of the algorithm, for each node we maintain a packed version of part of the sorted matrix corresponding to the node’s active examples. At each split, we copy the part of the parent’s sorted matrix to the child that received the smaller number of examples after the split, then shrink the parent’s sorted matrix and reuse it for the other child. This optimization reduces the memory allocations (and deallocations) needed at each DFS step by half compared to a straightforward implementation that allocates two new submatrices per split, copies the data over from the parent to the children, and then frees the parent’s matrix.
4 Evaluation
In this section, we study the performance of our optimized RF implementation within the Snap ML framework in singleserver environments. For the remainder of the section we will refer to our implementation as SnapRF.
Datasets. We will use three binary classification datasets to evaluate the performance of SnapRF: Susy Baldi et al. (2014) (5m examples, 18 features), Higgs Baldi et al. (2014) (11m examples, 28 features) and Epsilon Epsilon (2008) (400k examples, 2000 features). In all cases, of the examples were used to construct a training set, and to form a test set.
Infrastructure. For the following evaluation we used two multisocket systems. Firstly, a server with two 10core Intel^{®} Xeon^{®} E52630v4 CPUs, 125GiB RAM, running a 4.4 Linux kernel (Ubuntu 16.04). Secondly, a server with two 20core IBM POWER9 CPUs, 1TiB RAM, running a 4.15 Linux kernel (Ubuntu 16.04). We disabled simultaneous multithreading and fixed the CPU frequency to the maximum supported (2.2GHz for x86, and 3.8GHz for P9).
Switching between BFSDFS. Firstly, we evaluate the performance of SnapRF for BFSonly, DFSonly, and Breadthfirst, Depthnext switching at fixed thresholds, and our automated cachebased switching mechanism. As explained in Sec. 3, for the latter, the switching from BFS to DFS occurs when all of the data structures corresponding to the active examples belonging to a node fit into the CPU cache size. The threshold is expressed as a percentage of the number of training examples. If the fraction of active training examples in a given node is less than the specified threshold then the construction of the subtree originating from that node is performed using DFS. The higher the threshold, the earlier the treebuilding algorithm switches to DFS.
In Figure 2
we show the training time as a function of the number of trees in the ensemble, assuming unbounded tree depth, for different BFSDFS thresholds, and for the automated heuristic (
bfsdfsauto), on the x86 system. We observe that for the Higgs dataset, BFSonly is the slowest choice, and the automated heuristic provides the best performance. For Susy, DFSonly is the best choice, closely followed by our automated heuristic. On the other hand, for the Epsilon dataset, we find that BFSonly actually performs the best, with the performance of the automated heuristic being again very close. Compared to the other datasets, Epsilon has a very large number of features, and the BFS maintains a benefit to the optimization of using the same subset of features for all nodes at each depth (see Sec 3) overshadowing any DFS benefits. Based on these results, we conclude that our automated switching heuristic is a robust choice that should provide good performance across a range of different datasets. In all of the following results we will use this heuristic.Performance relative to baselines. We will now evaluate the performance of SnapRF relative to RF implementations offered in three widelyused ML frameworks: sklearn Pedregosa et al. (2011), H2O H2O.ai team (2015) and xgboost Chen and Guestrin (2016)^{1}^{1}1In all experiments we used sklearn version , H2O version and xgboost version . We will evaluate the performance for (a) ensembles of 10 and 100 trees, (b) unbounded trees and trees grown to a maximum depth of 20 and (c) on the x86 and P9 systems.
In all experiments and for all frameworks, we sample features when splitting each node, where is the number of features in the dataset. While SnapRF and sklearn both train each tree in the ensemble using a bootstrap sample of the training examples (i.e., sampling with replacement), xgboost and H2O train each tree by sampling without replacement from the set of training examples. This difference is related to the fact that in the latter two frameworks, the underlying tree implementation is designed to also support boosting Friedman (2001), for which sampling without replacement is common. In order to compare the different frameworks as fairly as possible, we set the subsampling ratio in xgboost and H2O to : roughly corresponding to the probability of any given training example being included in a bootstrap sample when the dataset is large. Note that since they use subsampling rather than bootstrap sampling, both xgboost and H2O train each tree using a smaller number of training examples, which should in principle allow them to run faster. Since in this paper we are concerned with exact treebuilding (i.e., not using histogrambased techniques), we set the tree_method parameter in xgboost to exact. Furthermore, we set the boostingspecific parameters min_child_weight and lambda parameters to zero, effectively disabling them. In H2O, it is not possible to build trees using an exact method, therefore we can only compare with the histogrambased method that is used by default. Again, we note that by using histograms, one significantly reduces the complexity of searching for the optimal split and thus H2O should have a performance advantage in this regard. In all training times reported, we do not include the time required to load the data from disk, nor do we count the time required to import the data into any frameworkspecific data structure (e.g. H2OFrame for H2O and DMatrix for xgboost).
In Figure 3, we show the relative speedup achieved by SnapRF over the other frameworks for 10 trees, for bounded and unbounded tree depths, and for the two systems under study. On the P9 system, averaging across datasets and tree depths, SnapRF shows an average speedup of x, x and x over sklearn, H2O and xgboost respectively. On the x86 system, SnapRF achieves an average speedup of x, x and x over sklearn, H2O and xgboost respectively.
To see how the implementations scale to larger ensembles, in Figure 4 we show the relative speedup achieved when using 100 trees. The generalization accuracy achieved by all schemes is provided in Table 1. On the P9 system, again averaging across datasets and tree depths, SnapRF shows an average speedup of x, x and x over sklearn, H2O and xgboost respectively. On the x86 system, SnapRF achieves an average speedup of x, x and x over sklearn, H2O and xgboost respectively. Thus, we find that the speedup increases significantly when using a larger ensemble.
max_depth = 20  max_depth = None  

framework  snapRF  sklearn  H2O  xgbRF  snapRF  sklearn  H2O  xgbRF 
susy  0.801  0.801  0.801  0.801  0.8  0.799  0.799  0.8 
higgs  0.737  0.737  0.736  0.738  0.751  0.75  0.751  0.751 
epsilon  0.764  0.765  0.764  0.768  0.758  0.759  0.761  0.766 
5 Conclusion
We have designed a novel, hybrid BFSDFS treebuilding algorithm that is optimized for training RF models in modern multicore CPU systems. By dynamically exploiting different tradeoffs at runtime, this hybrid algorithm is robust and able to outperform both BFS and DFS. Moreover, we have performed a set of systemlevel optimizations that improve the memory access behavior of the algorithm and reduce its memory heap allocations. The proposed hybrid BFSDFS treebuilding algorithm and optimizations have been implemented in the training routine of the RF model within the the Snap Machine Learning framework. When compared against RF solvers from stateoftheart ML frameworks (sklearn, H2O, and xgboost), across different CPU architectures and RF configurations, it provides a speedup in training time of upto and on average of .
References

[1]
(2014)
Searching for exotic particles in highenergy physics with deep learning.
. Nature communications 5, pp. 4308. Cited by: §4.  [2] (2011) Data mining for credit card fraud: a comparative study. Decision Support Systems 50 (3), pp. 602–613. Cited by: §1.
 [3] (2012) Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2 (6), pp. 493–507. Cited by: §1.
 [4] (200110) Random forests. Mach. Learn. 45 (1), pp. 5–32. External Links: ISSN 08856125, Link, Document Cited by: §1.
 [5] (2016) XGBoost: a scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, New York, NY, USA, pp. 785–794. External Links: ISBN 9781450342322, Link, Document Cited by: §2, §3, §4.
 [6] (2009) Introduction to algorithms. MIT press. Cited by: §2.
 [7] (1998) OpenMP: an industrystandard api for sharedmemory programming. Computing in Science & Engineering (1), pp. 46–55. Cited by: §3.
 [8] (2018) Snap ml: a hierarchical framework for machine learning. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 252–262. External Links: Link Cited by: §3.
 [9] (2008) ”Pascal largescale learning challenge.”. External Links: Link Cited by: §4.
 [10] (2014) The coupon collector’s problem. Materials matemàtics, pp. 1–35. External Links: Link Cited by: §2.

[11]
(2001)
Greedy function approximation: a gradient boosting machine
. Annals of statistics, pp. 1189–1232. Cited by: §4.  [12] (2018) Exact distributed training: random forest with billions of examples. CoRR abs/1804.06755. External Links: Link, 1804.06755 Cited by: §2, §2.
 [13] (2015) ”H2O: python interface for h2o.”. External Links: Link Cited by: §4.

[14]
(1996)
SLIQ: a fast scalable classifier for data mining
. In International conference on extending database technology, pp. 18–32. Cited by: §2, §2.  [15] (201111) Scikitlearn: machine learning in python. J. Mach. Learn. Res. 12, pp. 2825–2830. External Links: ISSN 15324435, Link Cited by: §2, §4.
 [16] (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9 (2), pp. 181–199. Cited by: §1.
 [17] (1996) SPRINT: a scalable parallel classifier for data mining. In Proceedings of the 22th International Conference on Very Large Data Bases, VLDB ’96, San Francisco, CA, USA, pp. 544–555. External Links: ISBN 1558603824, Link Cited by: §2.
 [18] (1990) The collector’s problem with group drawings. Advances in Applied Probability 22 (4), pp. 866–882. External Links: Document Cited by: §2.
 [19] (200901) Combining breadthfirst and depthfirst strategies in searching for treewidth.. pp. 640–645. Cited by: §3.
Comments
There are no comments yet.