CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training

10/14/2021
by   Patrick Huber, et al.
0

With the rise of large-scale pre-trained language models, open-domain question-answering (ODQA) has become an important research topic in NLP. Based on the popular pre-training fine-tuning approach, we posit that an additional in-domain pre-training stage using a large-scale, natural, and diverse question-answering (QA) dataset can be beneficial for ODQA. Consequently, we propose a novel QA dataset based on the Common Crawl project in this paper. Using the readily available schema.org annotation, we extract around 130 million multilingual question-answer pairs, including about 60 million English data-points. With this previously unseen number of natural QA pairs, we pre-train popular language models to show the potential of large-scale in-domain pre-training for the task of question-answering. In our experiments, we find that pre-training question-answering models on our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2019

How to Pre-Train Your Model? Comparison of Different Pre-Training Models for Biomedical Question Answering

Using deep learning models on small scale datasets would result in overf...
research
12/12/2022

Momentum Contrastive Pre-training for Question Answering

Existing pre-training methods for extractive Question Answering (QA) gen...
research
02/23/2023

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

Large language models have demonstrated an emergent capability in answer...
research
05/02/2023

Huatuo-26M, a Large-scale Chinese Medical QA Dataset

In this paper, we release a largest ever medical Question Answering (QA)...
research
02/10/2020

REALM: Retrieval-Augmented Language Model Pre-Training

Language model pre-training has been shown to capture a surprising amoun...
research
09/04/2021

FewshotQA: A simple framework for few-shot learning of question answering tasks using pre-trained text-to-text models

The task of learning from only a few examples (called a few-shot setting...
research
05/25/2022

Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation

Manually annotating datasets requires domain experts to read through man...

Please sign up or login with your details

Forgot password? Click here to reset