Multilingual Unsupervised Sentence Simplification

05/01/2020
by   Louis Martin, et al.
0

Progress in Sentence Simplification has been hindered by the lack of supervised data, particularly in languages other than English. Previous work has aligned sentences from original and simplified corpora such as English Wikipedia and Simple English Wikipedia, but this limits corpus size, domain, and language. In this work, we propose using unsupervised mining techniques to automatically create training corpora for simplification in multiple languages from raw Common Crawl web data. When coupled with a controllable generation mechanism that can flexibly adjust attributes such as length and lexical complexity, these mined paraphrase corpora can be used to train simplification systems in any language. We further incorporate multilingual unsupervised pretraining methods to create even stronger models and show that by training on mined data rather than supervised corpora, we outperform the previous best results. We evaluate our approach on English, French, and Spanish simplification benchmarks and reach state-of-the-art performance with a totally unsupervised approach. We will release our models and code to mine the data in any language included in Common Crawl.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/10/2019

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

We present an approach based on multilingual sentence embeddings to auto...
research
04/12/2021

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

We present Samanantar, the largest publicly available parallel corpora c...
research
08/28/2018

WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse

We release a corpus of 43 million atomic edits across 8 languages. These...
research
05/23/2022

Unsupervised Tokenization Learning

In the presented study, we discover that the so-called "transition freed...
research
01/31/2021

Multilingual Email Zoning

The segmentation of emails into functional zones (also dubbed email zoni...
research
09/23/2017

Language Independent Acquisition of Abbreviations

This paper addresses automatic extraction of abbreviations (encompassing...
research
12/28/2020

Universal Sentence Representation Learning with Conditional Masked Language Model

This paper presents a novel training method, Conditional Masked Language...

Please sign up or login with your details

Forgot password? Click here to reset