Text simplification reduces the complexity of a sentence in both lexical and structural aspects in order to increase its intelligibility. It brings benefits to individuals with low language skills and has abundant usage scenarios in education and journalism fields . Also, a simplified version of a text is easier to process for downstream tasks, such as parsing , semantic role labeling , and information extraction .
. These systems rely on large corpus containing pairs of complex and simplified sentences, which severely restrict their usage in different languages and the adaptation to downstream tasks in different domains. So, it is essential to explore unsupervised or semi-supervised learning paradigm which can effectively work with unpaired data.
In this work, we adopt back-translation 
framework to perform unsupervised and semi-supervised text simplification. Back-translation converts the unsupervised task into a supervised one by on-the-fly sentence pair generation. It has been successfully used in unsupervised neural machine translation[2, 15], semantic parsing  and natural language understanding . Denoising autoencoder (DAE)  plays an essential part in back-translation model. It performs language modeling and helps the system learn useful structures and features from the monolingual data. In NMT task, the translations between different languages are equal, and the denoising autoencoders have a symmetric structure, which means different languages use the same types of noise (mainly word dropout and shuffle). However, if we treat the set of simple and complex sentences as two different languages, the translation processes are asymmetric: Translation from simple to complex is an extension process requires extra generations, while information distillation is needed during the inverse translation. Moreover, text simplification is a monolingual translation task. The inputs and outputs are quite similar, which makes it more difficult to capture the different features in complex and simple sentences. As a result, symmetric denoising autoencoders may not very helpful in modeling sentences with diverse complexity and make it non-trivial to generate appropriate parallel data.
To tackle this problem, we propose asymmetric denoising autoencoders for sentences with different complexity. We analyze the effects of denoising type on the simplification performance and show that separate denoising methods is beneficial for decoders to generate suitable sentences with different complexity. Besides, we set several criteria to evaluate the generated sentences and use policy gradient to optimize these metrics. We use this as an additional method to improve the quality of the generated sentences. Our approach relies on two unpaired corpora – one is statistically simpler than another. In summary, our contributions include:
We adopt the back-translation framework to utilize large amounts of unpaired sentences for text simplification.
We propose asymmetric denoising autoencoders for sentences with different complexity and analyze the corresponding effects.
We develop methods to evaluate both simple and complex sentences derived from back-translation and use reinforcement learning algorithms to promote the quality of the back-translated sentences.
. xu2016optimizing xu2016optimizing achieved state-of-the-art performance by leveraging paraphrases rules extracted from bilingual texts. Recently, neural network models have been widely used in simplification systems. nisioi-etal-2017-exploring nisioi-etal-2017-exploring first applied Seq2Seq architecture to model text simplification. Several extensions are also proposed for this architecture such as augmented memory and multi-task learning . Furthermore, zhang-lapata-2017-sentence zhang-lapata-2017-sentence proposed DRESS, a Seq2Seq model trained in a reinforcement learning framework. Sentences with high fluency, simplicity and adequacy are rewarded during the training process. zhao-etal-2018-integrating zhao-etal-2018-integrating utilized Transformer  integrated with external knowledge and achieved state-of-the-art performance in automatic evaluation. Complexity-Weighted Complexity-Weighted proposed complexity-weighted loss and a reranking system to improve the simplicity of the sentences. Systems all above require large amounts of paralleled data.
In terms of unsupervised simplification, several systems only perform lexical simplification [18, 20] by replacing complicated words with their simpler synonyms, which ignored other operations such as reordering and rephrasing. unsuper-nerual-simplificaiton unsuper-nerual-simplificaiton proposed an unsupervised method for neural models. They utilized adversarial training to enforce a similar attention distribution between complex and simple sentences. They also tried back-translation with normal denoising techniques but did not achieve preferable results. We think it is inappropriate to apply back-translation framework mechanically into simplification task. So in this work, we make several improvements and achieve promising results.
The architecture of our simplification system is illustrated in Figure 1. The system consists of a shared encoder and a pair of independent decoders: for simple sentences and for complex sentences. Denote the corresponding sentence spaces by and . The encoder and decoders are first pre-trained as asymmetric denoising autoencoders (See below) on separated data. Next, the model goes through an iterative process. At each iteration, simple sentence is translated to a relatively complicated one via current model and . Similarly, complex sentence is translated to a relatively simple version via and . The pairs and are automatically-generated parallel sentences which can be used to train the model in a supervised manner with cross entropy loss. During the supervised training, our current model can also be regarded as translation policies. Let , denote the simple and complex sentences sampled from the current policies. Corresponding rewards and is calculated according to their quality. The model parameters are updated with both cross entropy loss and policy gradient.
In the back-translation framework, the shared encoder aims to represent both simple and complex sentences in a same latent space, and the decoders need to decompose this representation into sentences with corresponding types. We update the model by minimizing the cross entropy loss:
Where and represent the translation models from complex to simple and vice versa. The updated model tends to generate better synthetic sentence pairs for the next training process. Through such iterations, the model and back-translation process can promote mutually and finally lead to a good performance.
LampleCDR18 LampleCDR18 showed that denoising strategies such as word dropout and shuffle have a critical impact on unsupervised NMT systems. We argue that these symmetric noises in NMT may not be very effective in simplification task. So in this section, we will describe our asymmetric noises for simple and complex corpus.
Noise for Simple Sentences
Sentence with low complexity tends to have simple words and structures. We introduce three types of noise to help the model capture these characteristics.
Substitution: We replace the relatively simple words into advanced expressions with the guidance of Simple PPDB . Simple PPDB is a subset of the Paraphrase Database (PPDB)  adapted for the simplification task. It contains 4.5 million pairs of complex and simplified phrase. Each pair constitutes a simplification rule and has a score to indicate the confidence.
shows several examples, where advance expression such as “fatigued” and “wary” can be simplified to “tired”. However, in this situation, we utilize these rules in the reverse direction, meaning if “tired” appears in the sentence, it can be replaced by one of the candidates above with probability. In our experiments, is set to 0.9. Rules with scores lower than 0.5 will be discarded, and we only choose the top five phrases with the highest confidence score as the candidates for each word. During the substitution process, a substitute expression is randomly sampled from the candidates and replace the original phrases.
|0.95516||completely exhaust tired|
Substitution helps the model learn words distribution from the simplified sentences. To some extent, it also simulates the lexical simplification process, which can encourage decoder to generate simpler words from the shared latent space.
Additive: Additive noise inserts additional words into the input sentences. fevry2018unsupervised fevry2018unsupervised used autoencoder with additive noise to perform sentence compression, and generate imperfect but valid sentence summaries. Additive noise forces the model to subsample words from the corrupt inputs and generate reasonable sentences. It can help the model capture sentence trunk in simplification task.
For an original input, we randomly select another sentence from the training set and sample a subsequence without replacement. We then insert the subsequence to the original input. Instead of sampling independent words, we sample bigrams from the additional sentence. The subsequence length depends on the length of the original input. In our experiments, the sampled sequence serving as noise accounts for 25%-35% of the whole noised sentence.
Shuffle: Word shuffling is a common noising method in denoising autoencoders. It is proven to be helpful for the model to learn useful structure in sentences . To make the additive words evenly distributed in the noised simple sentence, we concatenate the original sentence and the additive subsequence and complete shuffle the bigrams, keeping all word pairs together.
An example noising process on simple sentence is illustrated in Table 2.
|Original||Their voices sound tired|
|+ substitution||Their voices sound exhausted|
|+ additive & shuffle||sound exhausted he knows Their voices|
Noise for Complex Sentences
Substitution is also performed for complex sentences. Here, we use the rules in Simple PPDB normally to rewrite the complicated words into their simpler versions. Rest of the process is the same as the substitution method for simple sentences. Apart from this, we applied other two noising methods.
Drop: Word dropping discards several words from the sentences. During the reconstruction, the decoder has to recover the removed words through the context. Translation from simple to complex usually include sentence expansion, which needs the decoder to generate extra words and phrases. Word dropping can align autoencoding task closer with sentence expansion and promote the quality of the generated sentences.
Since words with lower frequency usually contain more semantic information, we only delete the “frequent word” with probability . We define “frequent word” as the word with more than 100 occurrences in the entire corpus. A similar approach has also been used in unsupervised language generation . We set in our experiments.
Shuffle: Different from the complete shuffle process for simple sentences, we only slightly shuffle the complex sentences. This is because complex sentences don’t have additive noise, and when the sentences get longer and more complex, it is hard for the decoder to reconstruct the sentences with the complete shuffled inputs. Similar to LampleCDR18LampleCDR18, the max distance between shuffled word and its original position is limited.
Reward in Back-Translation
In order to further improve the training process and generate more appropriate sentences for the following iterations, we proposed three ranking metrics as the reward and directly optimize these metrics through policy gradient:
Fluency: The fluency of a sentence is measured by language models. We trained two LSTM language models  for both types of sentences with the corresponding data. For sentence , The fluency reward is derived from its perplexity and scaled to :
Relevance: Relevance score
indicate how well the meaning is preserved during the translation. For inputs and sampled sentences, we generate sentence vectors by taking a weighted average of word embeddings
and calculate the cosine similarity.
Complexity: Complexity reward is derived from Flesch–Kincaid Grade Level index (FKGL). FKGL 
refers to the level that must be reached to understand a specific text. Typically, FKGL score is positively correlated to sentence complexity. We normalize the score with the mean and variance calculated from the training data. For complex sentences,is equal to the normalized FKGL, while for simple sentences, = 1 FKGL, because the model is encouraged to generate sentences with low complexity.
Regard and as translation policies. Let and denote the simple and complex sentences obtained by sampling from the current policies. The total reward for sampled sentences can be calculated as:
Where is the harmonic average functions. Comparing with the arithmetic average, the harmonic average can optimize these metrics more equitably. To reduce the variance, sentences obtained by greedy decoding and are used as baselines in the training process:
The loss function is the sum of the negative expected reward for sampled sentence and :
To optimize this objective function, we estimate the gradient with REINFORCE algorithm:
The final loss is a weighted sum of the cross entropy loss and the policy gradient loss:
Where is the parameter to balance the two loss. The complete training process is described in Algorithm 1.
We use the UNTS dataset  to train our unsupervised-model. The UNTS dataset is extracted from the English Wikipedia dump. It uses automatic metrics111Mainly by Flesch Readability Ease to measure the text complexity and categorize the sentences into complex and simple part. It contains 2M unparalleled sentences.
For semi-supervised training and evaluation, we also use two parallel datasets: WikiLarge  and Newsela dataset . WikiLarge comprise 359 test sentences, 2000 development sentences and 300k training sentences. Each source sentences in test set has 8 simplified references. Newsela is a corpus extracted from news articles and simplified by professional editors, which is considered to have higher quality and harder than Wiki-Large. Following the settings of zhang-lapata-2017-sentence zhang-lapata-2017-sentence, we discarded the sentence pairs with adjacent complexity. The first 1,070 articles are used for training, next 30 articles for development and others for testing.
Our model is built upon Transformer . Both encoder and decoders have 3 layers with 8 multi-attention heads. To reduce the vocabulary size and restrict the frequency of unknown words, we split the words into sub-units with byte-pair encoding (BPE) . The sub-word embeddings are 512-dimensional vectors pre-trained on the entire data with FastText . In the training process, we use Adam optimizer ; the first momentum was set to 0.5 and batch size to 16. For reinforcement training, we dynamically adjust the balance parameter . At the start of the training process, is set to zero, which can help model converge rapidly and shrink the search space. As training progresses,
is gradually increased and finally converge to 0.9. We use the sigmoid function to perform this process.
The system is trained in both unsupervised and semi-supervised manner. We pre-train the asymmetric denoising autoencoders for 200,000 steps with a learning rate of 1e-4, After that, we add back-translation training with a learning rate of 5e-5. As for semi-supervised training, we randomly select 10% data from the corresponding parallel corpus and the model is trained alternately between denoising autoencoders, back-translation, and parallel sentences.
Metrics and Model Selection
Following the previous studies, we use corpus level SARI 
as our major metrics. SARI measures whether a system output can correctly keep, delete and add from the complex sentence. It calculates the N-gram overlap of these three aspects between system outputs and reference sentences. SARI is the arithmetic mean of F1-scores of three rewrite operations222For corpus level SARI, the original script provided by xu2016optimizing xu2016optimizing is only for 8 references WikiLarge dataset. We confirmed this fact with the author. So in our experiments, we use the original script for WikiLarge corpus and our own script for 1 reference Newsela corpus. Several previous works misused the original scripts on the 1 reference dataset which may lead to a very low score.. We also use BLEU score as an auxiliary metric. Although BLEU is reported to have a negative correlation with simplicity , it often positively correlates with grammaticality and adequacy. This may help us give a comprehensive evaluation for different systems.
For model selection, we mainly use SARI to tune our model. However, SARI rewards deletion, which means large differences may lead to good SARI even though the output is ungrammatical or irrelevant. To tackle this problem, we introduce BLEU score threshold similar to vu-etal-2018-sentence vu-etal-2018-sentence. epochs with BLEU score lower than thresholdwill be ignored. We set to 20 on Newsela dataset and 70 on Wiki-Large dataset.
Comparing Systems and Model Variants
We compare our system with several baselines. For unsupervised model, we considered UNTS —- a neural encoder-decoder model based on adversarial training; and a rule-based lexical simplification system called LIGHT-LS . Multiple supervised systems are also used as baselines, including Hybrid , PBMT-R  and DRESS333The system outputs of PBMT-R, Hybrid, and DRESS are publicly available. . We also trained a Seq2Seq model based on vanilla Transformer.
Using our approach, we also propose three different variants for experiments. (1) Basic Back-Translation based unsupervised TS model (BTTS). (2) Model integrated with reinforcement learning (BTTSRL). (3) Semi-Supervised model with limited supervision using 10% labelled data (BTTS+10%) and with full supervision using all labelled data (BTTS+full).
In this section, we present the comparison results on both standard automatic evaluation and human evaluation. We also analyze the effects of different noising type in back-translation with ablation study.
|Newsela||SARI||Component of SARI||BLEU|
|WikiLarge||SARI||Component of SARI||BLEU|
We report the results in Table 3. For unsupervised systems, our model outperforms previous unsupervised baselines on both datasets. Compared to LIGHT-LS and UNTS, our model achieves a large improvement (+9.28, +3.68 on SARI) on Newsela dataset. On Wiki-Large dataset, our model still outperforms the LIGHT-LS and gets similar results with UNTS on SARI, but achieves higher BLEU score. This means our model can generate more fluently and the outputs are more relevant to the source sentences. Furthermore, our unsupervised models perform closely to the state-of-the-art supervised systems. The results also show that reinforcement training is helpful to unsupervised systems. It brings 0.62 SARI improvement on Newsela and 0.08 on Wiki-Large corpus.
In addition, the results of the semi-supervised systems show that our model can greatly benefit from small amounts of parallel data. Model trained with 10% of the parallel sentences can perform competitively with state-of-the-art supervised systems on both datasets. With the increase of parallel data, all metrics can be further improved on Newsela corpus. Semi-supervised model trained with full parallel sentences significantly outperform the state-of-the-art TS models such as DRESS-LS (+1.03 SARI). On Wiki-Large dataset, the BLEU score has 3.7 point improvement with the full parallel sentences, but we cannot observe any improvements on SARI metrics. This might because the simplified sentences in Wiki-Large are often too closed to the source sentences or even not simpler at all . This defect may motivate the system to copy directly from the source sentences, which cause a decline on SARI score.
Both unsupervised and semi-supervised model achieve better improvement on Newsela dataset, showing that by leveraging large amount of unpaired data, our models can learn simplification better on harder and smaller datasets.
). We use student t-test to perform significance tests
|original(drop & shuffle)||33.63||38.60||59.28||3.01||35.91||70.74||34.97||2.03|
Due to the limitations in automatic metrics, we also conduct human evaluation on two datasets. We randomly select 200 sentences generated by our systems and the baselines as test samples. Similar to previous work , we ask native English speakers to evaluate the fluency, adequacy, and simplicity of the test samples via Amazon Mechanical Turk. The three aspects are rated on a 5-point Likert Scale. We use our semi-supervised model to perform human evaluation. The results are illustrated in Table 4.
On Newsela dataset, our model gets comparable results with DRESS and substantially outperforms Hybrid and fully supervised sequence-to-sequence model. Although sequence-to-sequence model has obtained promising scores on SARI (see in Tab 3), it performs the worst on adequacy and rather poor on fluency. This also proved that SARI only weakly correlates with judgments on fluency and adequacy . We have similar results on Wiki-Large dataset and our model achieves the highest score on adequacy.
We perform ablation study to analyze the effects of denoising type on simplification performance. We test three types of noise:
Original noise in machine translation including word dropout and shuffling (denoted as original)
Original noise plus with additive noise on simple sentences.
Substitution noise introduced on top of (b), which is our proposed noise type above.
Note that denoising autoencoders with different noise type may have varied convergence rate. To make a better comparison, we pre-train these autoencoders with different steps until they reach similar training loss. In our experiment, we pre-train 20,000 steps for autoencoders with noise type (a), 50,000 steps for noise type (b) and 200,000 steps for noise type (c). Figure 3 shows the variation of SARI on the development set with the change of back-translation epoch in semi-supervised training. The model with only word dropout and shuffle remains at low scores during the training process, while our proposed model has made a significant improvement.
Furthermore, we analyze the insights of SARI score in detail. Table 5
illustrate SARI score and its components under different types of noise. Additive noise in simple sentences can significantly promote the delete and add operation. Substitution also has a similar effect and makes a further improvement. Model with original noise tend to copy directly from the source sentence, resulting a relative higher F-score in keep operation, but much lower scores on other aspects.
In this paper, we adopt back-translation architecture to perform unsupervised and semi-supervised text simplification. We propose a novel asymmetric denoising autoencoder to model simple and complex corpus separately, which can help the system learn structures and features from the sentence with different complexity. Ablation study demonstrates that our proposed noise type can significantly promote the system performance comparing with basic denoising method. We also integrate reinforcement learning and achieve better SARI score on unsupervised models. Automatic evaluation and human judgment show that with limited supervision, our model can perform competitively with multiple full supervised systems. We also find the automatic metrics cannot correlate well with the human evaluation. We plan to investigate a better method in future work.
This work has been supported by the National Key Research and Development Program of China (Grant No.2017YFB1002102) and the China NSFC projects (No.61573241). We thank the anonymous reviewers for their thoughtful comments and efforts towards improving this manuscript.
-  (2017) A simple but tough-to-beat baseline for sentence embeddings. In ICLR, Cited by: Reward in Back-Translation.
-  (2018) Unsupervised neural machine translation. In ICLR, Cited by: Introduction.
-  (2017) Enriching word vectors with subword information. TACL 5, pp. 135–146. External Links: Cited by: Training Details.
-  (2019-07) Semantic parsing with dual learning. In ACL, Cited by: Introduction.
-  (1996) Motivations and methods for text simplification. In COLING, pp. 1041–1044. Cited by: Introduction.
-  (2010) Text simplification for children. In SIGIR, pp. 19–26. Cited by: Introduction.
Unsupervised natural language generation with denoising autoencoders. In EMNLP, pp. 3922–3929. Cited by: Noise for Complex Sentences.
-  (2013) PPDB: the paraphrase database. In NAACL, pp. 758–764. Cited by: Noise for Simple Sentences.
-  (2015) Simplifying lexical simplification: do we need simplified corpora?. In ACL, pp. 63–68. Cited by: Comparing Systems and Model Variants.
-  (2018) Dynamic multi-level multi-task learning for sentence simplification. In COLING, pp. 462–476. Cited by: Related Works.
BioSimplify: an open source sentence simplification engine to improve recall in automatic biomedical information extraction. In AMIA Annual Symposium Proceedings, pp. 351. Cited by: Introduction.
-  (1975) Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. . Cited by: Reward in Back-Translation.
-  (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: Training Details.
-  (2018) Unsupervised machine translation using monolingual corpora only. In ICLR, Cited by: Noise for Simple Sentences.
-  (2018) Phrase-based & neural unsupervised machine translation. In EMNLP, pp. 5039–5049. Cited by: Introduction.
-  (2010) Recurrent neural network based language model. In INTERSPEECH, Cited by: Reward in Back-Translation.
-  (2014) Hybrid simplification using deep semantics and machine translation. In ACL, pp. 435–445. Cited by: Related Works, Comparing Systems and Model Variants.
-  (2016) Unsupervised sentence simplification using deep semantics. In INLG, pp. 111–120. Cited by: Related Works.
-  (2017-07) Exploring neural text simplification models. In ACL, Vancouver, Canada, pp. 85–91. Cited by: Introduction.
-  (2016) Unsupervised lexical simplification for non-native speakers. In AAAI, Cited by: Related Works.
-  (2016) Simple PPDB: A paraphrase database for simplification. In ACL, Cited by: Noise for Simple Sentences.
-  (2016) Improving neural machine translation models with monolingual data. In ACL, Cited by: Introduction.
-  (2016) Neural machine translation of rare words with subword units. In ACL, Cited by: Training Details.
-  (2018) BLEU is not suitable for the evaluation of text simplification. In EMNLP, Cited by: Metrics and Model Selection.
-  (2018) Unsupervised neural text simplification. CoRR. Cited by: Data, Comparing Systems and Model Variants.
-  (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: Related Works, Training Details.
-  (2008) Extracting and composing robust features with denoising autoencoders. In ICML, ACM International Conference Proceeding Series, pp. 1096–1103. Cited by: Introduction.
-  (2018-06) Sentence simplification with memory-augmented neural networks. In NAACL, New Orleans, Louisiana, pp. 79–85. Cited by: Related Works.
-  (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: Reward in Back-Translation.
-  (2011) Learning to simplify sentences with quasi-synchronous grammar and integer programming. In EMNLP, pp. 409–420. Cited by: Introduction.
-  (2012) Sentence simplification by monolingual machine translation. In ACL, pp. 1015–1024. Cited by: Related Works, Comparing Systems and Model Variants.
-  (2015) Problems in current text simplification research: new data can help. TACL 3, pp. 283–297. Cited by: Data, Automatic Evaluation.
-  (2016) Optimizing statistical machine translation for text simplification. TACL 4, pp. 401–415. Cited by: Metrics and Model Selection, Human Evaluation.
-  (2017-09) Sentence simplification with deep reinforcement learning. In EMNLP, Copenhagen, Denmark, pp. 584–594. Cited by: Introduction, Data, Comparing Systems and Model Variants, Human Evaluation.
-  (2019) Data augmentation with atomic templates for spoken language understanding. In Proceedings of EMNLP-IJNLP, pp. 3628–3634. Cited by: Introduction.