Automatically Ranked Russian Paraphrase Corpus for Text Generation

06/17/2020
by   Vadim Gudkov, et al.
0

The article is focused on automatic development and ranking of a large corpus for Russian paraphrase generation which proves to be the first corpus of such type in Russian computational linguistics. Existing manually annotated paraphrase datasets for Russian are limited to small-sized ParaPhraser corpus and ParaPlag which are suitable for a set of NLP tasks, such as paraphrase and plagiarism detection, sentence similarity and relatedness estimation, etc. Due to size restrictions, these datasets can hardly be applied in end-to-end text generation solutions. Meanwhile, paraphrase generation requires a large amount of training data. In our study we propose a solution to the problem: we collect, rank and evaluate a new publicly available headline paraphrase corpus (ParaPhraser Plus), and then perform text generation experiments with manual evaluation on automatically ranked corpora using the Universal Transformer architecture.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/17/2021

GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

Leaderboards have eased model development for many NLP datasets by stand...
research
06/18/2022

Argumentative Text Generation in Economic Domain

The development of large and super-large language models, such as GPT-3,...
research
06/22/2022

Understanding the Properties of Generated Corpora

Models for text generation have become focal for many research tasks and...
research
10/12/2020

Controlled Hallucinations: Learning to Generate Faithfully from Noisy Data

Neural text generation (data- or text-to-text) demonstrates remarkable p...
research
06/18/2022

Automatic Summarization of Russian Texts: Comparison of Extractive and Abstractive Methods

The development of large and super-large language models, such as GPT-3,...
research
10/04/2019

Template-free Data-to-Text Generation of Finnish Sports News

News articles such as sports game reports are often thought to closely f...
research
04/27/2023

SweCTRL-Mini: a data-transparent Transformer-based large language model for controllable text generation in Swedish

We present SweCTRL-Mini, a large Swedish language model that can be used...

Please sign up or login with your details

Forgot password? Click here to reset