Why do tree-based models still outperform deep learning on tabular data?

07/18/2022
by   Leo Grinsztajn, et al.
3

While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains with clear characteristics of tabular data and a benchmarking methodology accounting for both fitting models and finding good hyperparameters. Results show that tree-based models remain state-of-the-art on medium-sized data (∼10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide researchers aiming to build tabular-specific NNs: 1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions. To stimulate research on tabular architectures, we contribute a standard benchmark and raw data for baselines: every point of a 20 000 compute hours hyperparameter search for each learner.

READ FULL TEXT

page 5

page 17

page 18

page 20

page 23

page 27

research
11/23/2022

Subgroup Robustness Grows On Trees: An Empirical Baseline Investigation

Researchers have proposed many methods for fair and robust machine learn...
research
06/01/2022

Hopular: Modern Hopfield Networks for Tabular Data

While Deep Learning excels in structured data as encountered in vision a...
research
06/09/2021

XBNet : An Extremely Boosted Neural Network

Neural networks have proved to be very robust at processing unstructured...
research
04/21/2018

Expert Finding in Community Question Answering: A Review

The rapid development recently of Community Question Answering (CQA) sat...
research
02/27/2019

Robust Decision Trees Against Adversarial Examples

Although adversarial examples and model robustness have been extensively...
research
11/17/2019

The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction

Accurate streamflow prediction largely relies on historical records of b...
research
10/07/2022

A unified approach to radial, hyperbolic, and directional distance models in Data Envelopment Analysis

The paper analyzes properties of a large class of "path-based" Data Enve...

Please sign up or login with your details

Forgot password? Click here to reset