Learning To Split and Rephrase From Wikipedia Edit History

by   Jan A. Botha, et al.

Split and rephrase is the task of breaking down a sentence into shorter ones that together convey the same meaning. We extract a rich new dataset for this task by mining Wikipedia's edit history: WikiSplit contains one million naturally occurring sentence rewrites, providing sixty times more distinct split examples and a ninety times larger vocabulary than the WebSplit corpus introduced by Narayan et al. (2017) as a benchmark for this task. Incorporating WikiSplit as training data produces a model with qualitatively better predictions that score 32 BLEU points above the prior best result on the WebSplit benchmark.


page 1

page 2

page 3

page 4


Mining Naturally-occurring Corrections and Paraphrases from Wikipedia's Revision History

Naturally-occurring instances of linguistic phenomena are important both...

BiSECT: Learning to Split and Rephrase Sentences with Bitexts

An important task in NLP applications such as sentence simplification is...

Split and Rephrase: Better Evaluation and a Stronger Baseline

Splitting and rephrasing a complex sentence into several shorter sentenc...

Harvesting Paragraph-Level Question-Answer Pairs from Wikipedia

We study the task of generating from Wikipedia articles question-answer ...

Wikipedia Vandal Early Detection: from User Behavior to User Embedding

Wikipedia is the largest online encyclopedia that allows anyone to edit ...

WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse

We release a corpus of 43 million atomic edits across 8 languages. These...

An Edit-centric Approach for Wikipedia Article Quality Assessment

We propose an edit-centric approach to assess Wikipedia article quality ...

Please sign up or login with your details

Forgot password? Click here to reset