CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

10/14/2021
by   Patrick Huber, et al.
0

With the rise of large-scale pre-trained language models, open-domain question-answering (ODQA) has become an important research topic in NLP. Based on the popular pre-training fine-tuning approach, we posit that an additional in-domain pre-training stage using a large-scale, natural, and diverse question-answering (QA) dataset can be beneficial for ODQA. Consequently, we propose a novel QA dataset based on the Common Crawl project in this paper. Using the readily available schema.org annotation, we extract around 130 million multilingual question-answer pairs, including about 60 million English data-points. With this previously unseen number of natural QA pairs, we pre-train popular language models to show the potential of large-scale in-domain pre-training for the task of question-answering. In our experiments, we find that pre-training question-answering models on our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

11/02/2019

How to Pre-Train Your Model? Comparison of Different Pre-Training Models for Biomedical Question Answering

Using deep learning models on small scale datasets would result in overf...
02/10/2020

REALM: Retrieval-Augmented Language Model Pre-Training

Language model pre-training has been shown to capture a surprising amoun...
09/04/2021

FewshotQA: A simple framework for few-shot learning of question answering tasks using pre-trained text-to-text models

The task of learning from only a few examples (called a few-shot setting...
05/25/2022

Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation

Manually annotating datasets requires domain experts to read through man...
06/15/2021

Question Answering Infused Pre-training of General-Purpose Contextualized Representations

This paper proposes a pre-training objective based on question answering...
05/23/2022

StreamingQA: A Benchmark for Adaptation to New Knowledge over Time in Question Answering Models

Knowledge and language understanding of models evaluated through questio...
05/16/2022

Heroes, Villains, and Victims, and GPT-3: Automated Extraction of Character Roles Without Training Data

This paper shows how to use large-scale pre-trained language models to e...