When Do Neural Nets Outperform Boosted Trees on Tabular Data?

05/04/2023
by   Duncan McElfresh, et al.
8

Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and ask, 'does it matter?' We conduct the largest tabular data analysis to date, by comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than selecting the best algorithm. Next, we analyze 965 metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well. For example, we find that GBDTs are much better than NNs at handling skewed feature distributions, heavy-tailed feature distributions, and other forms of dataset irregularities. Our insights act as a guide for practitioners to decide whether or not they need to run a neural net to reach top performance on their dataset. Our codebase and all raw results are available at https://github.com/naszilla/tabzilla.

READ FULL TEXT
research
10/29/2021

Hyperparameter Tuning is All You Need for LISTA

Learned Iterative Shrinkage-Thresholding Algorithm (LISTA) introduces th...
research
09/07/2022

A Survey of Neural Trees

Neural networks (NNs) and decision trees (DTs) are both popular models o...
research
10/16/2022

Positive-Unlabeled Learning using Random Forests via Recursive Greedy Risk Minimization

The need to learn from positive and unlabeled data, or PU learning, aris...
research
07/23/2023

RANSAC-NN: Unsupervised Image Outlier Detection using RANSAC

Image outlier detection (OD) is crucial for ensuring the quality and acc...
research
06/21/2021

Regularization is all you Need: Simple Neural Nets can Excel on Tabular Data

Tabular datasets are the last "unconquered castle" for deep learning, wi...
research
06/20/2019

ID3 Learns Juntas for Smoothed Product Distributions

In recent years, there are many attempts to understand popular heuristic...
research
09/25/2019

Manifold Forests: Closing the Gap on Neural Networks

Decision forests (DF), in particular random forests and gradient boostin...

Please sign up or login with your details

Forgot password? Click here to reset