OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering

07/08/2022
by   Zhengbao Jiang, et al.
0

The information in tables can be an important complement to text, making table-based question answering (QA) systems of great value. The intrinsic complexity of handling tables often adds an extra burden to both model design and data annotation. In this paper, we aim to develop a simple table-based QA model with minimal annotation effort. Motivated by the fact that table-based QA requires both alignment between questions and tables and the ability to perform complicated reasoning over multiple table elements, we propose an omnivorous pretraining approach that consumes both natural and synthetic data to endow models with these respective abilities. Specifically, given freely available tables, we leverage retrieval to pair them with relevant natural sentences for mask-based pretraining, and synthesize NL questions by converting SQL sampled from tables for pretraining with a QA loss. We perform extensive experiments in both few-shot and full settings, and the results clearly demonstrate the superiority of our model OmniTab, with the best multitasking approach achieving an absolute gain of 16.2 also establishing a new state-of-the-art on WikiTableQuestions. Detailed ablations and analyses reveal different characteristics of natural and synthetic data, shedding light on future directions in omnivorous pretraining. Code, pretraining data, and pretrained models are available at https://github.com/jzbjyb/OmniTab.

READ FULL TEXT
research
03/17/2019

Question Answering via Web Extracted Tables and Pipelined Models

In this paper, we describe a dataset and baseline result for a question ...
research
05/22/2023

MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering

Recent advances in tabular question answering (QA) with large language m...
research
12/09/2022

VindLU: A Recipe for Effective Video-and-Language Pretraining

The last several years have witnessed remarkable progress in video-and-l...
research
05/24/2023

TACR: A Table-alignment-based Cell-selection and Reasoning Model for Hybrid Question-Answering

Hybrid Question-Answering (HQA), which targets reasoning over tables and...
research
08/15/2021

HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation

Tables are often created with hierarchies, but existing works on table r...
research
06/12/2019

Synthetic QA Corpora Generation with Roundtrip Consistency

We introduce a novel method of generating synthetic question answering c...
research
09/15/2021

FORTAP: Using Formulae for Numerical-Reasoning-Aware Table Pretraining

Tables store rich numerical data, but numerical reasoning over tables is...

Please sign up or login with your details

Forgot password? Click here to reset