Scalable Feature Selection for (Multitask) Gradient Boosted Trees

09/05/2021
by   Cuize Han, et al.
0

Gradient Boosted Decision Trees (GBDTs) are widely used for building ranking and relevance models in search and recommendation. Considerations such as latency and interpretability dictate the use of as few features as possible to train these models. Feature selection in GBDT models typically involves heuristically ranking the features by importance and selecting the top few, or by performing a full backward feature elimination routine. On-the-fly feature selection methods proposed previously scale suboptimally with the number of features, which can be daunting in high dimensional settings. We develop a scalable forward feature selection variant for GBDT, via a novel group testing procedure that works well in high dimensions, and enjoys favorable theoretical performance and computational guarantees. We show via extensive experiments on both public and proprietary datasets that the proposed method offers significant speedups in training time, while being as competitive as existing GBDT methods in terms of model performance metrics. We also extend the method to the multitask setting, allowing the practitioner to select common features across tasks, as well as selecting task-specific features.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2016

Feature Selection Library (MATLAB Toolbox)

Feature Selection Library (FSLib) is a widely applicable MATLAB library ...
research
12/29/2022

On the utility of feature selection in building two-tier decision trees

Nowadays, feature selection is frequently used in machine learning when ...
research
03/08/2023

Optimal Sparse Recovery with Decision Stumps

Decision trees are widely used for their low computational cost, good pr...
research
05/30/2009

A Minimum Description Length Approach to Multitask Feature Selection

Many regression problems involve not one but several response variables ...
research
08/28/2023

Causality-Based Feature Importance Quantifying Methods:PN-FI, PS-FI and PNS-FI

In current ML field models are getting larger and more complex, data we ...
research
03/07/2023

VOCALExplore: Pay-as-You-Go Video Data Exploration and Model Building [Technical Report]

We introduce VOCALExplore, a system designed to support users in buildin...
research
04/05/2023

Selecting Features by their Resilience to the Curse of Dimensionality

Real-world datasets are often of high dimension and effected by the curs...

Please sign up or login with your details

Forgot password? Click here to reset