Benchmarking Multimodal AutoML for Tabular Data with Text Fields

11/04/2021
by   Xingjian Shi, et al.
0

We consider the use of automated supervised learning systems for data tables that not only contain numeric/categorical columns, but one or more text fields as well. Here we assemble 18 multimodal data tables that each contain some text fields and stem from a real business application. Our publicly-available benchmark enables researchers to comprehensively evaluate their own methods for supervised learning with numeric, categorical, and text features. To ensure that any single modeling strategy which performs well over all 18 datasets will serve as a practical foundation for multimodal text/tabular AutoML, the diverse datasets in our benchmark vary greatly in: sample size, problem types (a mix of classification and regression tasks), number of features (with the number of text columns ranging from 1 to 28 between datasets), as well as how the predictive signal is decomposed between text vs. numeric/categorical features (and predictive interactions thereof). Over this benchmark, we evaluate various straightforward pipelines to model such data, including standard two-stage approaches where NLP is used to featurize the text such that AutoML for tabular data can then be applied. Compared with human data science teams, the fully automated methodology that performed best on our benchmark (stack ensembling a multimodal Transformer with various tree models) also manages to rank 1st place when fit to the raw text/tabular data in two MachineHack prediction competitions and 2nd place (out of 2380 teams) in Kaggle's Mercari Price Suggestion Challenge.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/19/2022

TransTab: Learning Transferable Tabular Transformers Across Tables

Tabular data (or tables) are the most widely used data format in machine...
research
09/06/2019

Supervised Multimodal Bitransformers for Classifying Images and Text

Self-supervised bidirectional transformer models such as BERT have led t...
research
08/02/2022

Silo NLP's Participation at WAT2022

This paper provides the system description of "Silo NLP's" submission to...
research
07/25/2023

ARC-NLP at Multimodal Hate Speech Event Detection 2023: Multimodal Methods Boosted by Ensemble Learning, Syntactical and Entity Features

Text-embedded images can serve as a means of spreading hate speech, prop...
research
03/31/2023

Self-Supervised Multimodal Learning: A Survey

Multimodal learning, which aims to understand and analyze information fr...
research
02/14/2019

Categorical Metadata Representation for Customized Text Classification

The performance of text classification has improved tremendously using i...
research
02/23/2023

Embeddings for Tabular Data: A Survey

Tabular data comprising rows (samples) with the same set of columns (att...

Please sign up or login with your details

Forgot password? Click here to reset