Open Subtitles Paraphrase Corpus for Six Languages

09/17/2018
by   Mathias Creutz, et al.
0

This paper accompanies the release of Opusparcus, a new paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The corpus consists of paraphrases, that is, pairs of sentences in the same language that mean approximately the same thing. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows. The informal and colloquial genre that occurs in subtitles makes such data a very interesting language resource, for instance, from the perspective of computer assisted language learning. For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been checked manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2021

The ELITR ECA Corpus

We present the ELITR ECA corpus, a multilingual corpus derived from publ...
research
07/10/2019

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

We present an approach based on multilingual sentence embeddings to auto...
research
09/21/2018

Paraphrase Detection on Noisy Subtitles in Six Languages

We perform automatic paraphrase detection on subtitle data from the Opus...
research
11/10/2019

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

We show that margin-based bitext mining in a multilingual sentence space...
research
04/11/2018

Generating Multilingual Parallel Corpus Using Subtitles

Neural Machine Translation with its significant results, still has a gre...
research
07/08/2018

A Deep Generative Model of Vowel Formant Typology

What makes some types of languages more probable than others? For instan...
research
10/21/2022

A Dataset for Plain Language Adaptation of Biomedical Abstracts

Though exponentially growing health-related literature has been made ava...

Please sign up or login with your details

Forgot password? Click here to reset