WikiOmnia: generative QA corpus on the whole Russian Wikipedia

04/17/2022
by   Dina Pisarevskaya, et al.
0

The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. However, compiling factual questions is accompanied by time- and labour-consuming annotation, limiting the training data's potential size. We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline. The dataset includes every available article from Wikipedia for the Russian language. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with paragraphs for ruT5-large).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/01/2015

QANUS: An Open-source Question-Answering Platform

In this paper, we motivate the need for a publicly available, generic so...
research
07/14/2019

TWEETQA: A Social Media Focused Question Answering Dataset

With social media becoming increasingly pop-ular on which lots of news a...
research
11/30/2022

A Pipeline for Generating, Annotating and Employing Synthetic Data for Real World Question Answering

Question Answering (QA) is a growing area of research, often used to fac...
research
05/28/2020

Generating Diverse and Consistent QA pairs from Contexts with Information-Maximizing Hierarchical Conditional VAEs

One of the most crucial challenges in questionanswering (QA) is the scar...
research
10/19/2021

DEEPAGÉ: Answering Questions in Portuguese about the Brazilian Environment

The challenge of climate change and biome conservation is one of the mos...
research
04/09/2022

Extending the Scope of Out-of-Domain: Examining QA models in multiple subdomains

Past works that investigate out-of-domain performance of QA systems have...
research
11/17/2022

Summarizing Community-based Question-Answer Pairs

Community-based Question Answering (CQA), which allows users to acquire ...

Please sign up or login with your details

Forgot password? Click here to reset