Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees

10/20/2022
by   Moacir Antonelli Ponti, et al.
0

Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose a method based on metrics computed from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. We focus on datasets containing mostly tabular or structured data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of performance. We show results on detecting noisy labels in order to either remove them, improving models' metrics in synthetic and real datasets, as well as a productive dataset. Our methods achieved the best results overall when compared with confident learning and heuristics.

READ FULL TEXT

page 3

page 4

page 7

research
11/18/2017

Tree-Structured Boosting: Connections Between Gradient Boosted Stumps and Full Decision Trees

Additive models, such as produced by gradient boosting, and full interac...
research
03/08/2017

Structural Data Recognition with Graph Model Boosting

This paper presents a novel method for structural data recognition using...
research
09/10/2019

GBDT-MO: Gradient Boosted Decision Trees for Multiple Outputs

Gradient boosted decision trees (GBDTs) are widely used in machine learn...
research
10/29/2019

Minimal Variance Sampling in Stochastic Gradient Boosting

Stochastic Gradient Boosting (SGB) is a widely used approach to regulari...
research
04/17/2018

MetaBags: Bagged Meta-Decision Trees for Regression

Ensembles are popular methods for solving practical supervised learning ...
research
10/13/2016

Bank Card Usage Prediction Exploiting Geolocation Information

We describe the solution of team ISMLL for the ECML-PKDD 2016 Discovery ...
research
02/08/2023

Decision trees compensate for model misspecification

The best-performing models in ML are not interpretable. If we can explai...

Please sign up or login with your details

Forgot password? Click here to reset