Multilingual Unsupervised Sentence Simplification

05/01/2020 ∙ by Louis Martin, et al. ∙ Facebook Inria 0

Progress in Sentence Simplification has been hindered by the lack of supervised data, particularly in languages other than English. Previous work has aligned sentences from original and simplified corpora such as English Wikipedia and Simple English Wikipedia, but this limits corpus size, domain, and language. In this work, we propose using unsupervised mining techniques to automatically create training corpora for simplification in multiple languages from raw Common Crawl web data. When coupled with a controllable generation mechanism that can flexibly adjust attributes such as length and lexical complexity, these mined paraphrase corpora can be used to train simplification systems in any language. We further incorporate multilingual unsupervised pretraining methods to create even stronger models and show that by training on mined data rather than supervised corpora, we outperform the previous best results. We evaluate our approach on English, French, and Spanish simplification benchmarks and reach state-of-the-art performance with a totally unsupervised approach. We will release our models and code to mine the data in any language included in Common Crawl.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text Simplification is the task of reducing the complexity of the vocabulary and sentence structure of text while retaining its original meaning, with the goal of improving readability and understanding. Simplification has a variety of important societal applications, for example increasing accessibility for those with cognitive disabilities such as aphasia Carroll et al. (1998), dyslexia Rello et al. (2013), and autism Evans et al. (2014), or for non-native speakers Paetzold and Specia (2016) and children with reading difficulties Gala et al. (2020). Research has mostly focused on English simplification, where corpora with source texts and their associated simplified texts exist and can be automatically aligned, such as English Wikipedia and Simple English Wikipedia or the Newsela corpus Xu et al. (2015). However, such data is difficult to find in languages other than English, and the need for manually simplified corpora limits the incorporation of large quantities of raw text data.

Figure 1: Sentence Simplification Models for Any Language using Unsupervised Data. Sentences from the web are used to create a large scale index that allows mining millions of paraphrases. Subsequently, we finetune pretrained models augmented with controllable mechanisms on the paraphrase corpora to achieve sentence simplification models in any language.

In this work, we focus on leveraging unsupervised data to train sentence simplification systems in multiple languages. As sentence simplification is a special case of paraphrasing, we present a general technique to create training corpora in any language by mining pairs of paraphrased sequences from the web. Using controllable text generation mechanisms proposed in

Martin et al. (2020)

we then train simplification models by controlling attributes such as length, lexical complexity, and syntactic complexity. Mining paraphrases as a superset of sentence simplifications proves better than mining for simplifications directly. Datasets mined for simplifications would necessitate several heuristics on the type of simplifications that should be mined; we show that a model able to adjust during training on a dataset mined with fewer assumptions can reach better experimental results.

These automatically created corpora benefit from large quantities of data in various languages available online. We apply this technique to mine English, French, and Spanish paraphrase training data. On English, we show that the quality of the models trained with mined data is similar to that of models trained with automatically aligned English Wikipedia and Simple English Wikipedia.

Subsequently, we use multilingual pretraining with BART and mBART Lewis et al. (2019); Liu et al. (2020) to further incorporate unsupervised data in our sentence simplification models. This unsupervised pretraining leverages large quantities of data freely available on the web to create high quality generative models. By using pretrained models and finetuning on our mined paraphrase datasets, we are able to achieve state of the art results with no supervised data in multiple languages. In summary, we make the following contributions:

  • We propose a mining procedure to create large paraphrase corpora in multiple languages, to train sentence simplification systems.

  • We incorporate recent advances in generative pretraining and controllable generation to achieve or match the state of the art in English with 42.65 SARI on ASSET and 40.85 SARI on TurkCorpus datasets with a completely unsupervised approach. We show that we can further improve by more than 1 SARI point by incorporating supervised data.

  • We reach state of the art in multiple languages using only mined data, alleviating the need for language-specific supervised corpora.

2 Related work

Sentence Simplification in multiple languages

Data-driven methods have been predominant approach in English Sentence Simplification in the recent years, requiring large supervised training corpora of complex-simple aligned sentences Wubben et al. (2012); Xu et al. (2016); Zhang and Lapata (2017); Zhao et al. (2018); Martin et al. (2020). Methods have relied on using the Wikipedia edit history Botha et al. (2018), or more notably on English Wikipedia and Simple English Wikipedia with automatic alignment of sentences from similar articles for the creation of such corpora Zhu et al. (2010); Coster and Kauchak (2011); Woodsend and Lapata (2011); Kauchak (2013); Zhang and Lapata (2017). Using Wikipedia and automatic alignments has been shown to have various flaws compared to professional simplifications from news articles such as in the Newsela dataset Xu et al. (2015). There are fewer professional simplifications, and these often come with restrictive licenses that hinder widespread usage and reproducibility.

Multiple efforts have explored simplification in other languages such as Brazilian Portuguese Aluísio et al. (2008), Spanish Saggion et al. (2015); Štajner et al. (2015), Italian Brunato et al. (2015); Tonelli et al. (2016), Japanese Goto et al. (2015); Kajiwara and Komachi (2018); Katsuta and Yamamoto (2019), and French Gala et al. (2020), but these lack of a large supervised training corpus such as what is available in English.111More detail can be found in the survey by Alva-Manchego et al. (2020b) In this work, we show that a method trained on large corpora automatically mined from raw text in any language can reach state of the art results in all languages.

Unsupervised Simplification

When parallel data is not available or not used, sentence simplification systems rely on unsupervised simplification techniques, often based on techniques from machine translation. The prevailing approach is to split a raw monolingual corpora into two disjoint sets of complex and simple sentences with readability metrics to train unsupervised models. Kajiwara and Komachi (2018) train statistical machine translation models on unsupervised alignment, based on word embeddings, of English Wikipedia. Other methods remove the alignment process and train with the two disjoint sets with auto-encoders Surya et al. (2019); Zhao et al. (2020), unsupervised statistical machine translation for Japanese Katsuta and Yamamoto (2019), and back-translation in Spanish and Italian Aprosio et al. (2019). The performance of such unsupervised methods are however often below their supervised counterparts. In our work we remove the need for separating the raw monolingual corpora in two disjoint sets and instead mine paraphrases directly, and train models with state-of-the-art performance.


Previous work on the unsupervised creation of corpora from noisy text has focused on mining parallel data for machine translation systems. Works have leveraged document retrieval Munteanu and Marcu (2005), language models Buck and Koehn (2016); Koehn et al. (2018, 2019), and embedding space alignment Artetxe and Schwenk (2019) to create large corpora Tiedemann (2012); Schwenk et al. (2019a, b). We focus on paraphrasing for sentence simplifications, which presents new challenges for unsupervised mining. Unlike machine translation, where the same sentence should be identified in two languages, we develop a method to identify varied simplifications of sentences. Mining in translation has leveraged heuristics such as similar length, but paraphrases, and most specifically simplifications, have a wide array of surface forms, including multiple sentences, different vocabulary usage, and removal of content from more complex sentences.

Previous work in unsupervised paraphrasing has learned to align sentences from various corpora Barzilay and Lee (2003) with a variety of different objective functions Liu et al. (2019). Techniques for large-scale mining have been used to create large paraphrase corpora Wieting and Gimpel (2018), but this has not been applied to multiple languages or the task of sentence simplification.

3 Method

We describe how to create sentence simplification models in multiple languages without supervised data. We mine a large quantity of paraphrases from the Common Crawl using libraries such as LASER Artetxe et al. (2018) and faiss Johnson et al. (2019). Then, we show how we leverage controllable generation mechanisms and unsupervised pretraining in addition to unsupervised data mining to train high quality simplification models.

Query For insulation, it uses foam-injected polyurethane which helps ensure the quality of the ice produced by the machine. It comes with an easy to clean air filter.
Mined It has polyurethane for insulation which is foam-injected. This helps to maintain the quality of the ice it produces. The unit has an easy to clean air filter.
Query Here are some useful tips and tricks to identify and manage your stress.
Mined Here are some tips and remedies you can follow to manage and control your anxiety.
Query As cancer cells break apart, their contents are released into the blood.
Mined When brain cells die, their contents are partially spilled back into the blood in the form of debris.
Query The trail is ideal for taking a short hike with small children or a longer, more rugged overnight trip.
Mined It is the ideal location for a short stroll, a nature walk or a longer walk.
Query Thank you for joining us, and please check out the site.
Mined Thank you for calling us. Please check the website.
Table 1: Examples of Mined Paraphrases. Paraphrases, although sometimes not preserving the entire meaning, display various rewriting operations, such as lexical substitution, compression or sentence splitting.
Type # Sequence # Avg. Tokens
Pairs per Sequence
WikiLarge Supervised 296,402 21.7 (orig.)
16.0 (simp.)
English Mined 1,194,945 22.3
French Mined 1,360,422 18.7
Spanish Mined 996,609 22.8
Table 2: Statistics of mined paraphrase training corpora compared to standard supervised WikiLarge.

3.1 Mining Paraphrases in Multiple Languages

Challenges in Mining Simplifications

Progress in creating simplification models in multiple languages has been hindered by lack of supervised training data, namely pairs of complex sentences matched with their simplified forms. Existing models often use automatically aligned sentences from Wikipedia and Simple English Wikipedia, but this approach is not scalable and does not exist for other languages. Further, simplifications exist in many possible forms that depends on the target audience it is aimed for — a simple sequence is not always only shorter, but could be split into multiple sentences, use less complex vocabulary, include less detail, and so on. These simplification guidelines are not uniquely defined, and even if they could be used to heuristically mine corpora, rules for different languages might wildly differ and prevent large-scale mining in multiple languages. We confirm experimentally in Section 5.3 that mining simplifications does not achieve better results than the more general mining of paraphrases.

Mining Paraphrases as Generalized Simplifications

Thus, in this work, we focus on creating strong simplification models from paraphrase data. Paraphrases can be seen as strict generalization of simplifications. To create models that can simplify after being trained on paraphrases, we leverage advancements in controllable text generation. Control models have been used in a variety of settings to adjust length Fan et al. (2018), bias Dinan et al. (2019), and style See et al. (2019). We use the controllable sentence simplification model ACCESS Martin et al. (2020), which learns to control length, amount of paraphrasing, lexical complexity and syntactic complexity to train on mined paraphrase corpora and dynamically create simplifications at inference time. We then scale our unsupervised paraphrase mining approach to create automatically aligned paraphrase corpora in different languages.

Sequences Extraction

Sentence Simplification consists of multiple rewriting operations, some of which spanning over multiple sentences (e.g. sentence splitting or sentence fusion). To allow for these types of operations to be represented in our data, we extract sequences of multiple sentences from raw documents. As we detail in Section 4.1

these sequences are further filtered to remove noisy text with too much punctuation, or low language model probability for example. In the following these series of multiple consecutive sentences are termed


We extract these sequences from CCNet Wenzek et al. (2019). CCNet is an extraction of Common Crawl,222

an open source snapshot of the web, that has been split into different languages using

fasttext language identification Joulin et al. (2017) and various language modeling filtering techniques to identify high quality, clean sentences. For English and French, we extract 1 billion sequences from CCNet. For Spanish we extract 650 millions sequences, the maximum for this language in CCNet after filtering out noisy text.

Creating a Sequence Index Using Embeddings

To automatically mine our paraphrase corpora, we first compute -dimensional sentence embeddings for each of our sequences using the LASER toolkit Artetxe et al. (2018). LASER is a multilingual sentence embedding model that is trained to map sentences of similar meaning to the same location in the embedding space. We use the faiss librairy Johnson et al. (2019) to create an index with all these sentence embeddings. faiss

indexes are a type of data structure that can store a large amount of vectors and provide a fast and efficient interface for searching nearest neighbors within the index.

Mining Paraphrases

For each language, after the billion-scale index is created, we use the same 1 billion sequences as queries to identify potential paraphrases in the index. Each sequence is queried against the index and returns a set of top-k nearest neighbor sequences according to the semantic LASER embedding space using L2 distance333The original LASER

implementation uses cosine similarity instead of L2 distance.

. These nearest neighbors are candidate paraphrases of the query sentence. We apply additional filters to remove poor quality alignments where the sequences are not paraphrases, for example when they are almost identical, when they are contained in one another, or when they were extracted from two consecutive and overlapping sliding windows of the same original document. Table 1 displays some examples of the resulting mined paraphrases in English while Table 2 reports statistics of the mined corpora in English, French and Spanish.

3.2 Simplifying with Access

We train our models on paraphrase data to produce simplifications by leveraging advancements in controllable text generation, specifically applying the ACCESS control mechanism Martin et al. (2020).

Training with Control Tokens

At training time, the model is provided with control tokens that give oracle information on the target sequence, such as the amount of compression of the target sequence relative to the source sequence. For example, when the target sequence is 80% of the size of the original sequence, we provide the NumChars_80% control token. This encourages the model to rely on the oracle control tokens. At inference time it can then control the generation by selecting a given target control token value.

Choosing Control Values at Inference

In our experiments, We want the generations to best reflect the simplification typologies of each dataset. To this end, we select the control values that achieve the best SARI on the validation set using hyperparameter search, and keep those values fixed for all sentences in the test set. We repeat this process for each evaluation dataset.

Figure 2: Density of several text features in WikiLarge and our mined data. The WordRank ratio is a measure of lexical complexity reduction Martin et al. (2020). Replace-only Levenshtein similarity only considers replace operations in the traditional Levenshtein similarity and assigns 0 weights to insertions and deletions.

3.3 Leveraging Unsupervised Pretraining

Unsupervised pretraining has recently demonstrated large improvements on generative tasks, by using sequence-to-sequence models as denoising auto-encoders on large quantities of raw data Lewis et al. (2019); Liu et al. (2020) by training with noising functions such as span-based masking Fan et al. (2019a); Joshi et al. (2020) or shuffling sentence order. We leverage these pretrained models to further extend our unsupervised approach to text simplification. For English, we finetune pretrained generative model BART Lewis et al. (2019) on our newly created mined training corpora. BART is a pretrained sequence-to-sequence model that can be seen as a generalization of other recent pretrained models such as BERT Devlin et al. (2018) and GPT2 Radford et al. .

For non-English languages, we use its multilingual generalization mBART Liu et al. (2020). mBART was pretrained using the BART objective on 25 languages and showed large performance gains for low-resource machine translation.

4 Experimental Setting

We assess the performance of our approach on three languages: English, French, and Spanish.

4.1 Mining Details

Sequence Extraction

We only consider documents from the head split in CCNet— this represents the third of the data with the best perplexity using a language model. We extract sequences from raw documents using the NLTK sentence tokenizer Bird and Loper (2004) and generate all possible sequences of adjacent sentences with lengths between 10 and 300 characters. We further filter these sequences to remove those with low probability according to a 3-gram Kneser-Ney language model trained with kenlm Heafield (2011). We also filter sequences with too much punctuation.

Paraphrase Mining

We compute LASER embeddings of dimension 1024 and reduce dimensionality with a 512 PCA followed by random rotation. We further compress them using 8 bit scalar quantization. The compressed embeddings are then stored in a faiss inverted file index with 32,768 cells (nprobe=16). These embeddings are used to mine pairs of paraphrases. We return the top-8 nearest neighbors, and keep those with L2 distance lower than 0.05 and relative distance compared to other top-8 nearest neighbors lower than 0.6.

Paraphrases Filtering

The resulting paraphrases are filtered to remove almost identical paraphrases by enforcing a case-insensitive character-level Levenshtein distance Levenshtein (1966) greater or equal to 20%. We remove paraphrases that come from the same document to avoid aligning sequences that overlapped each other in the original text. We also remove paraphrases where one of the sequence is contained in the other. We further filter out any sequence that is present in our evaluation datasets.

4.2 Training details

We implement our models with fairseq Ott et al. (2019). All our models are Transformers Vaswani et al. (2017) based on BART Large, keeping the optimization procedure and hyperparameters fixed Lewis et al. (2019). We either randomly initialize weights for the standard sequence-to-sequence experiments or initialize with pretrained BART for the BART experiments. For controllable generation, we use the open-source ACCESS implementation Martin et al. (2018). We use the same control parameters as the original paper, namely length, Levenshtein similarity, lexical complexity, and syntactic complexity.444We modify the Levenshtein similarity parameter to only consider replace operations, by assigning a 0 weight to insertions and deletions. This change helps decorrelate the Levenshtein similarity control token from the length control token and produced better results in preliminary experiments.

In all our experiments, we report scores averaged over 5 random seeds evaluated on the test set with 95% confidence intervals.

4.3 Baselines

In addition to that of previous work, we report results for several basic baselines, which are useful for languages where previous work does not exist.

Identity baseline

The original sequence is the simplification.

Truncation baseline

The original sequence is truncated by keeping the first 80% of words. It proves to be a strong baseline in practice, as measured by standard text simplification metrics.

Pivot baseline

We use machine translation to provide a baseline for languages for which no simplification corpora are available. The complex non-English sentence is translated to English, simplified with our best English simplification system, and then translated back into the source language. For French and Spanish translation, we use ccmatrix Schwenk et al. (2019b) to train Transformer models with LayerDrop Fan et al. (2019b). We use the BART+ACCESS model trained on Mined+ WikiLarge as the English simplification model. While pivoting creates potential errors from inaccurate or unnatural translations, recent improvements of neural translation systems on high resource languages nevertheless makes this a strong baseline.


We report reference scores for English because multiple references are available for each original sentence. We compute these scores in a leave-one-out scenario where each reference is evaluated against all the others and then scores are averaged over all references.555To avoid creating a discrepancy in terms of number of references between this setting where we leave one reference out and when we evaluate the models with all references, we compensate by duplicating one of the other references at random so that the total number of references is unchanged.

Data ASSET TurkCorpus
PBMT-R Wubben et al. (2012) PWKP (Wikipedia)
UNTS Surya et al. (2019) Unsup. data
Dress-LS Zhang and Lapata (2017) WikiLarge
DMASS-DCSS Zhao et al. (2018) WikiLarge
ACCESS Martin et al. (2020) WikiLarge
Identity Baseline
Truncate Baseline
Seq2Seq WikiLarge
Seq2Seq Mined
BART+ACCESS WikiLarge + Mined
Table 3: Unsupervised Sentence Simplification for English. We display SARI, BLEU and FKGL on TurkCorpus and ASSET English evaluation datasets.
Original It is particularly famous for the cultivation of kiwifruit.
Simplified It is famous for growing the kiwifruit.
Original History Landsberg prison, which is in the town’s western outskirts, was completed in 1910.
Simplified The Landsberg prison, which is near the town, was built in 1910.
Original In 2004, Roy was selected as the greatest goaltender in NHL history by a panel of 41 writers, coupled with a simultaneous fan poll.
Simplified In 2004, Roy was chosen as the greatest goaltender in NHL history by a group of 41 writers.
Original The name ”hornet” is used for this and related species primarily because of their habit of making aerial nests (similar to the true hornets) rather than subterranean nests.
Simplified The name ”hornet” is used for this and related species because they make nests in the air (like the true hornets) rather than in the ground.
Original Nocturnes is an orchestral composition in three movements by the French composer Claude Debussy.
Simplified Nocturnes is a piece of music for orchestra by the French composer Claude Debussy.
Original This book by itself is out of print having been published along with nine short stories in the collection The Worthing Saga (1990).
Simplified This book by itself is out of print. It was published along with nine short stories in 1990.
Table 4: Examples of Generated Simplifications. We show simplifications generated by our best unsupervised model: BART+ACCESS trained on mined data only. Bold highlights differences between original and simplified.

4.4 Evaluation Metrics


Sentence simplification is commonly evaluated with SARI Xu et al. (2016), which compares model generated simplifications with the source sequence and gold references. It averages F1 scores for addition, keep, and deletion operations. We compute SARI with the EASSE666 simplification evaluation suite Alva-Manchego et al. (2019)777We use the latest version of SARI implemented in EASSE which fixes bugs and inconsistencies from the traditional implementation of SARI. As a consequence, we also recompute scores from previous systems that we compare to. We do so by using the system predictions provided by the respective authors, and available in EASSE..


We report BLEU scores for completeness, but these should be carefully interpreted. They have been found to correlate poorly with human judgments of simplicity Sulem et al. (2018). Furthermore, the identity baseline achieves very high BLEU scores on some datasets (e.g. 92.81 on ASSET or 99.36 on TurkCorpus), which underlines the weaknesses of this metric.


Finally, we report readability scores using the Flesch-Kincaid Grade Level (FKGL) Kincaid et al. (1975), a linear combination of sentence lengths and word lengths.

Data ALECTOR (French) SIMPLEXT Corpus (Spanish)
Identity Baseline
Truncate Baseline
Pivot Baseline
Seq2Seq Mined
Table 5: Unsupervised Text Simplification in French and Spanish. We display SARI, BLEU and FKGL on the French evaluation dataset ALECTOR and the Spanish evaluation dataset SIMPLEXT Corpus.

4.5 Evaluation Data


To evaluate our models in English, we use ASSET Alva-Manchego et al. (2020a) and TurkCorpus Xu et al. (2016). TurkCorpus (sometimes known as WikiLarge evaluation set) and ASSET were created using the same 2000 valid and 359 test source sentences. TurkCorpus contains 8 reference simplifications per source sentence and ASSET contains 10 references per source. TurkCorpus’s simplifications mostly consist in small lexical paraphrases. ASSET is a generalization of TurkCorpus with a more varied set of rewriting operations. ASSET simplifications are also considered simpler than simplifications in TurkCorpus by human judges.


For French, we use the ALECTOR dataset Gala et al. (2020) for evaluation. ALECTOR is a collection of literary (tales, stories) and scientific (documentary) texts along with their manual document-level simplified versions. These documents were extracted from material available to French primary school pupils.

Most of these documents were simplified line by line, each line consisting of a few sentences. For each original document, we align each line that contains less than 6 sentences with the closest line in its simple counterpart, using the LASER embedding space. The resulting alignments are split into validation and test by randomly sampling the documents for the validation (450 samples) and rest for test (416 samples).


For Spanish we use the SIMPLEXT Corpus from Saggion et al. (2015). The SIMPLEXT Corpus is a set of 200 news articles that were manually simplified by trained experts for people with learning disabilities. We split the dataset by documents into a valid (460 samples) and test set (449 samples). SIMPLEXT Corpus contains simplifications with a high editing ratio, where the original sentence is strongly rephrased and compressed compared to other evaluation datasets.

5 Results

We now assess the quality of our mined data and the improvements brought up by unsupervised pretraining for simplifying English, French and Spanish.

5.1 English Simplification

We compare models trained on our mined corpus of 1.2 million English paraphrase pairs (see Table 2) with models trained with the standard simplification dataset WikiLarge. We also compare to other state-of-the-art supervised models Wubben et al. (2012); Zhang and Lapata (2017); Zhao et al. (2018); Martin et al. (2020) and the only previously published unsupervised model Surya et al. (2019).

Using Seq2Seq Models on Mined Data

When training a Transformer sequence-to-sequence model (Seq2Seq) on WikiLarge compared to the mined corpus, we find that our models trained on the mined data perform better by 5.32 SARI on ASSET and 2.25 SARI on TurkCorpus (see Table 3). It is surprising that a model trained solely on a paraphrase corpus might achieve such good results on simplification benchmarks. Multiple works have shown that simplification models suffered from not making enough modifications to the source sentence and showed that forcing models to rewrite the input was beneficial Wubben et al. (2012); Martin et al. (2020). We speculate that this might explain why our sequence-to-sequence model trained on paraphrases achieves these results. Furthermore, when generating a paraphrase, our model might assign higher probabilities to frequent words which might naturally operate lexical simplifications to some extent.

Adding Bart and Access

We use mined data to finetune BART and add the simplification-based generative control from ACCESS. When trained on the mined English data and WikiLarge, BART+ACCESS performs the best in terms of SARI on ASSET (44.15). On TurkCorpus the best results are achieved by training BART+ACCESS on WikiLarge (42.62 SARI).

Using only unsupervised data, we are able to achieve a +2.52 SARI improvement over the previous state of the art on ASSET 888It should be noted that the previous state of the art by Martin et al. (2020) was achieved before the publication of the ASSET dataset. ACCESS, the authors’ controllable sentence simplification system might have performed better, were it finetuned for the ASSET dataset.. We achieve similar results on TurkCorpus with 40.85 SARI with no supervision (average over 5 random seeds) versus the previous state of the art of 41.38 SARI (best seed selected on the validation set by the authors).

Examples of Simplifications

Various examples from our totally unsupervised simplification system are shown in Table 4. Examining the simplifications, we see reduced sentence length, sentence splitting of a complex sentence into multiple shorter sentences, and the use of simpler vocabulary. For example, the word cultivation is changed into growing and aerial nets is simplified into nests in the air. Additional simplifications include the removal of less important content and removal of content within parentheses.

5.2 French and Spanish Simplification

Our unsupervised approach to text simplification can be applied to any language provided enough data can be mined. As for English, we first create a corpus of paraphrases composed of 1.4 million sentences in French and 1.0 million sentences in Spanish (see Table 2). We evaluate the quality of our mined corpus in Table 5. Unlike for English, where supervised training data has been created using Simple English Wikipedia, no such datasets exist for French or Spanish simplification. We compare to several baselines, namely the identity, truncation and pivot baselines.

Using Seq2Seq Models on Mined data

Compared to our baselines, training a Transformer sequence-to-sequence model on our mined data achieves stronger results in French and stronger results in Spanish except for the pivot baseline.

Adding mBART and Access

To incorporate multilingual pretraining, we use mBART. mBART was trained on 25 languages rather than the English BART. Similar to what we observed in English, we achieve the best results by combining mBART, ACCESS, and training on mined data. It outperforms our strongest baseline by +8.25 SARI in French but seems to lag behind in Spanish.

As shown in the English results in Table 3, mBART also suffers a small loss in performance of 1.54 SARI compared to its monolingual English counterpart BART, probably due to the fact that it handles 25 languages instead of one. Using a monolingual version of BART trained for French or Spanish would perform even better.

Evaluation Metrics Weaknesses in Spanish

For Spanish, the pivot baseline achieves the highest SARI, but a very low BLEU of 0.94. Qualitative examination of the pivot simplifications showed that the generated Spanish simplifications had very few words in common with the gold standard references —hence the low BLEU— although these predictions were correct simplifications of the source.

The SIMPLEXT Corpus contains highly rewritten and compressed simplifications (average compression ratio of compared to for evaluation datasets in other languages). And the SARI metric reserves 33% of the score to rewarding “correct” word deletions, which might affect most of the source words in this dataset. As a confirmation, a dummy system that deletes all words from the source (i.e. returns an empty simplification), achieves a SARI score of 33.33 and BLEU of 0. This finding questions the use of SARI in settings with high rates compression and rewriting.

5.3 Ablations

Mining Simplifications vs. Paraphrases

ASSET (English)
Seq2Seq Simpl.
Seq2Seq Para.
Table 6: Mining Simplifications vs. Paraphrases. Comparison of models trained on our mined paraphrase corpora or on our mined simplification corpora.

In this work, we mined paraphrases to train simplification models. We also considered directly mining simplifications using simplification heuristics.

In order to mine a simplification dataset for comparison, we followed the same procedure of querying 1 billion sequences on an index of 1 billion sequences. We then kept only pairs that either contained sentence splits, were compressed by reducing sequence length, or that had simpler vocabulary (based on word frequencies). We also removed the paraphrase constraint that enforced sentences to be different enough (Levenshtein distance greater than 20%). We tuned these heuristics to optimize SARI scores on the validation set. The resulting dataset is composed of 2.7 million simplification pairs.

In Table 6, we show that sequence-to-sequence models trained on paraphrases achieve better performance. A similar trend exists with BART and ACCESS, justifying the simpler approach of mining paraphrases instead of simplifications.

How Much Mined Data Do You Need?

Our proposed method leverages the large quantity of sentences in many languages available on the web to mine millions of paraphrases. We investigate the importance of a scalable mining approach that can create million-sized training corpora for sentence simplification. In Table 7, we analyze the performance of training our best model on English, a combination of BART and ACCESS, on different amounts of mined data. By increasing the number of mined pairs, SARI drastically improves, indicating that efficient mining at scale is critical to achieve state of the art performance. Unlike human-created training sets, unsupervised mining with LASER and faiss allows for even larger datasets, and enables this for multiple languages.

# Training ASSET (English)
1.2M (full)
Table 7: Importance of Large Scale Mining. We finetune BART+ACCESS on varying amounts of mining.

Improvements from Pretraining and Controllable Simplification

ASSET (English)
Table 8: Relative Importance of Pretraining and ACCESS. We display SARI, BLEU and FKGL scores on the ASSET evaluation dataset of models trained on our mined paraphrase data.
Original They are culturally akin to the coastal peoples of Papua New Guinea.
ACCESS They’re culturally much like the Papua New Guinea coastal peoples.
BART+ACCESS They are closely related to coastal people of Papua New Guinea
Original Orton and his wife welcomed Alanna Marie Orton on July 12, 2008.
ACCESS Orton and his wife had been called Alanna Marie Orton on July 12.
BART+ACCESS Orton and his wife gave birth to Alanna Marie Orton on July 12, 2008.
Original He settled in London, devoting himself chiefly to practical teaching.
ACCESS He set up in London and made himself mainly for teaching.
BART+ACCESS He settled in London and devoted himself to teaching.
Table 9: Influence of BART on Simplifications. We display some examples of generations that illustrate how BART improves the fluency and meaning preservation of generated simplifications.

Unlike previous approaches to text simplification, we use pretraining to train our simplification systems. In a qualitative examination, we found the main improvement from pretraining is increased fluency and meaning preservation in the output generations. For example, in Table 9, the model trained with ACCESS only substituted culturally akin with culturally much like whereas when using BART+ACCESS it was simplified into the more fluent closely related.

While our simplification models trained on mined data see several million sentences, pretraining methods are typically trained on billions of sentences. We combine pretrained models with controllable simplification, which enhances the performance of the simplification system in addition to allowing them to adapt to different type of simplified text given the needs of the end audience.

Table 8 highlights that while both the BART pretrained model and the ACCESS controllable simplification mechanism bring some improvement over standard sequence-to-sequence in terms of SARI, they work best in combination and boost the performance by +4.62 SARI.

6 Conclusion

We propose an unsupervised approach to text simplification using controllable generation mechanisms and pretraining in combination with large scale mining of paraphrases from the web. This approach is language agnostic and achieves state-of-the-art results, even above previously published results of supervised systems, on three languages: English, French, and Spanish. In future work, we plan to investigate how to scale this approach to more languages and types of simplifications.


This work was partly supported by Benoît Sagot’s chair in the PRAIRIE institute, funded by the French national agency ANR as part of the “Investissements d’avenir” programme under the reference ANR-19-P3IA-0001.