An Experimental Evaluation of Large Scale GBDT Systems

07/03/2019
by   Fangcheng Fu, et al.
0

Gradient boosting decision tree (GBDT) is a widely-used machine learning algorithm in both data analytic competitions and real-world industrial applications. Further, driven by the rapid increase in data volume, efforts have been made to train GBDT in a distributed setting to support large-scale workloads. However, we find it surprising that the existing systems manage the training dataset in different ways, but none of them have studied the impact of data management. To that end, this paper aims to study the pros and cons of different data management methods regarding the performance of distributed GBDT. We first introduce a quadrant categorization of data management policies based on data partitioning and data storage. Then we conduct an in-depth systematic analysis and summarize the advantageous scenarios of the quadrants. Based on the analysis, we further propose a novel distributed GBDT system named Vero, which adopts the unexplored composition of vertical partitioning and row-store and suits for many large-scale cases. To validate our analysis empirically, we implement different quadrants in the same code base and compare them under extensive workloads, and finally compare Vero with other state-of-the-art systems over a wide range of datasets. Our theoretical and experimental results provide a guideline on choosing a proper data management policy for a given workload.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/22/2023

Efficient Partitioning Method of Large-Scale Public Safety Spatio-Temporal Data based on Information Loss Constraints

The storage, management, and application of massive spatio-temporal data...
research
06/09/2020

Artificial Intelligence (AI)-Centric Management of Resources in Modern Distributed Computing Systems

Contemporary Distributed Computing Systems (DCS) such as Cloud Data Cent...
research
03/08/2019

Deductive Optimization of Relational Data Storage

Optimizing the physical data storage and retrieval of data are two key d...
research
04/05/2018

Scaling Out Acid Applications with Operation Partitioning

OLTP applications with high workloads that cannot be served by a single ...
research
08/14/2023

Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads

LSM-trees are widely adopted as the storage backend of key-value stores....
research
05/02/2023

Updatable Learned Indexes Meet Disk-Resident DBMS – From Evaluations to Design Choices

Although many updatable learned indexes have been proposed in recent yea...

Please sign up or login with your details

Forgot password? Click here to reset