Block-distributed Gradient Boosted Trees

04/23/2019
by   Theodore Vasiloudis, et al.
0

The Gradient Boosted Tree (GBT) algorithm is one of the most popular machine learning algorithms used in production, for tasks that include Click-Through Rate (CTR) prediction and learning-to-rank. To deal with the massive datasets available today, many distributed GBT methods have been proposed. However, they all assume a row-distributed dataset, addressing scalability only with respect to the number of data points and not the number of features, and increasing communication cost for high-dimensional data. In order to allow for scalability across both the data point and feature dimensions, and reduce communication cost, we propose block-distributed GBTs. We achieve communication efficiency by making full use of the data sparsity and adapting the Quickscorer algorithm to the block-distributed setting. We evaluate our approach using datasets with millions of features, and demonstrate that we are able to achieve multiple orders of magnitude reduction in communication cost for sparse data, with no loss in accuracy, while providing a more scalable design. As a result, we are able to reduce the training time for high-dimensional data, and allow more cost-effective scale-out without the need for expensive network communication.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/10/2018

Feature-Distributed SVRG for High-Dimensional Linear Classification

Linear classification has been widely used in many high-dimensional appl...
research
11/10/2020

Distributed Learning with Low Communication Cost via Gradient Boosting Untrained Neural Network

For high-dimensional data, there are huge communication costs for distri...
research
03/04/2020

Distributed Asynchronous Union-Find for Scalable Feature Tracking

Feature tracking and the visualizations of the resulting trajectories ma...
research
02/13/2019

Do Subsampled Newton Methods Work for High-Dimensional Data?

Subsampled Newton methods approximate Hessian matrices through subsampli...
research
03/10/2020

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

Distributed deep learning becomes very common to reduce the overall trai...
research
11/11/2020

Sketch and Scale: Geo-distributed tSNE and UMAP

Running machine learning analytics over geographically distributed datas...
research
01/05/2021

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

Machine learning has been proven to be effective in various application ...

Please sign up or login with your details

Forgot password? Click here to reset