An Experimental Study of LSTM Encoder-Decoder Model for Text Simplification

by   Tong Wang, et al.
University of Massachusetts-Boston

Text simplification (TS) aims to reduce the lexical and structural complexity of a text, while still retaining the semantic meaning. Current automatic TS techniques are limited to either lexical-level applications or manually defining a large amount of rules. Since deep neural networks are powerful models that have achieved excellent performance over many difficult tasks, in this paper, we propose to use the Long Short-Term Memory (LSTM) Encoder-Decoder model for sentence level TS, which makes minimal assumptions about word sequence. We conduct preliminary experiments to find that the model is able to learn operation rules such as reversing, sorting and replacing from sequence pairs, which shows that the model may potentially discover and apply rules such as modifying sentence structure, substituting words, and removing words for TS.


page 1

page 2

page 3

page 4


Encoder-decoder with Focus-mechanism for Sequence Labelling Based Spoken Language Understanding

This paper investigates the framework of encoder-decoder with attention ...

Learning Generic Sentence Representations Using Convolutional Neural Networks

We propose a new encoder-decoder approach to learn distributed sentence ...

Automatic Text Scoring Using Neural Networks

Automated Text Scoring (ATS) provides a cost-effective and consistent al...

Sequence to Sequence Learning with Neural Networks

Deep Neural Networks (DNNs) are powerful models that have achieved excel...

Deconvolutional Latent-Variable Model for Text Sequence Matching

A latent-variable model is introduced for text matching, inferring sente...

Generating Text with Deep Reinforcement Learning

We introduce a novel schema for sequence to sequence learning with a Dee...

Global Attention-based Encoder-Decoder LSTM Model for Temperature Prediction of Permanent Magnet Synchronous Motors

Temperature monitoring is critical for electrical motors to determine if...

1 Introduction

Text Simplification (TS) aims to simplify the lexical, grammatical, or structural complexity of text while retaining its semantic meaning. It can help various groups of people, including children, non-native speakers, and people with cognitive disabilities, to understand text better. The field of automatic text simplification has been researched for decades. It is generally divided into three categories: lexical simplification (LS), rule-based, and machine translation (MT) [1].

LS is mainly used to simplify text by substituting infrequent and difficult words with frequent and easier words. However, challenges exist for the LS approach. First, a great number of transformation rules are required for reasonable coverage; second, different transformation rules should be applied based on the specific context; third, the syntax and semantic meaning of the sentence is hard to retain. Rule-based approaches use hand-crafted rules for lexical and syntactic simplification, for example, substituting difficult words in a predefined vocabulary. However, such approaches need a lot of human-involvement to manually define these rules, and it is impossible to give all possible simplification rules. MT-based approach regards original English and simplified English as two different languages, thus TS is the process to translate ordinary English to simplified English. Neural Machine Translation (NMT) is a newly-proposed deep learning approach and achieves very impressive results 

[2, 3, 4]. Unlike the traditional phrased-based MT system which operates on small components separately, NMT systems attempt to build a large neural network such that every component is tuned based on the training sentence pairs.

NMT models are types of Encoder-Decoder models, which can represent the input sequence as a vector, and then decode that vector into an output sequence. In this paper, we propose to apply Long Short-Term Memory (LSTM) 

[5] Encoder-Decoder on TS task. And we show the LSTM Encoder-Decoder model is able to learn operation rules such as reversing, sorting, and replacing from sequence pairs, which are similar to simplification rules that change sentence structure, substitute words, and remove words. Thus this model is potentially able to learn simplification rules. We conduct experiments to show that the trained model has a high accuracy for reversal, sorting, and sequence replacement. Also, the word embeddings learned from the model are close to its real meaning.

2 Related Work

Automatic TS is a complicated natural language processing (NLP) task, it consists of lexical and syntactic simplification levels. Usually, hand-crafted, supervised, and unsupervised methods based on resources like English Wikipedia (EW) and Simple English Wikipedia (SEW) 


are utilized for extracting simplification rules. It is very easy to mix up the automatic TS task and the automatic summarization task 

[7, 8]. TS is different from text summarization as the focus of text summarization is to reduce the length and redundant content.

At the lexical level, [9] proposed an lexical simplification system which only requires a large corpus of regular text to obtain word embeddings to get words similar to the complex word.  [10] proposed an unsupervised method for learning pairs of complex and simpler synonyms and a context aware method for substituting one for the other. At the sentence level, [11] proposed a sentence simplification model by tree transformation based on Statistical Machine Translation (SMT). [12] presented a data-driven model based on a quasi-synchronous grammar, a formalism that can naturally capture structural mismatches and complex rewrite operations.

The limitation of aforementioned methods requires syntax parsing or hand-crafted rules to simplify sentences. Compared with traditional machine learning 

[13, 14] and data mining techniques [15, 16, 17], deep learning has shown to produce state-of-the-art results on various difficult tasks, with the help of the development of big data platforms [18, 19]. The RNN Encoder-Decoder is a very popular deep neural network model that performs exceptionally well at the machine translation task [2, 4, 3][1] proposed a preliminary work to use RNN Encoder-Decoder model for text simplification task, which is similar to the proposed model in this paper.

3 The Model

In this section, we first briefly introduce the basic idea of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM), then describe the LSTM Encoder-Decoder model.

3.1 Recurrent Neural Network and Long Short-Term Memory

RNN is a class of Neural Network in which internal units may form a directed cycle to demonstrate the state history of previous inputs. The structure of RNN makes it naturally suited for variable-length inputs such as sentences. For a sequence data , where at each , the hidden state of the RNN is then updated via



is the activation function. However, the optimization of basic RNN models is difficult because its gradients vanish over long sequences. LSTM is very good at learning long range dependencies through its internal memory cells. Similar to RNN, LSTM updates its hidden state sequentially, but the updates highly depend on memory cells containing three kind of gates: the forget gate

decides how much remembered information to forget, the update gate decides how to update remembered information, the output gate decides how much the remembered information to output.


Recent works have proposed many modified LSTM models such as the gated recurrent unit (GRU) 

[3]. However, [20] showed that none of the LSTM variants can improve upon the standard architecture significantly. In this paper, we use the standard LSTM structure in our model.

Figure 1: LSTM Encoder-Decoder Model

3.2 LSTM Encoder-Decoder Model

Given a source sentence and the target (simplified) sentence , where and are in the same vocabulary, and

are the length of each sentence. Our goal is to build a neural network to model the conditional probability

, then train the model to maximize the probability.

We show our LSTM Encoder-Decoder model in Figure 1. This model uses one-hot representation of words in the sequence in the input layer, and converts it to a 300-dimensional vector in the following embedding layer. We find that adding an embedding layer can significantly improve performance when the vocabulary becomes large. Then we feed word embeddings through two LSTM layers, and get a vector representation of the input sequence after finishing reading all the words. Finally, we decode this vector to output sequence through two LSTM layers and one output embedding layer.

Let us take the input sentence “Man with high intelligence” as a difficult sentence, the output sentence “a very smart man” as the simplified sentence, and represent the pair of sentences as a pair of word indices (we made up some indices here), we have:

  • [Man,with,high,intelligence][A,very,smart,man]

We only apply sorting, reversing, and replacement to the indices to simplify a sentence, where sorting and reversing could be highly related to changing the structure of a sentence or simplifying a grammar, replacement could be highly related to lexical simplification or removing redundant words. Motivated by this observation, we conduct experiments to show the LSTM Encoder-Decoder is able to learn these three rules automatically, and thus can potentially perform text simplification.

4 Experiments

In this section, we conduct experiments to show LSTM Encoder-Decoder can perform basic operations for sequence data. Intuitively, TS should include operations like replacing difficult words with easier words, removing redundant words, simplifying syntax structure by changing word order, etc. In the following experiments, we show that a very basic LSTM Encoder-Decoder model is able to reverse, sort, and replace the elements of a sequence.

We implement the LSTM Encoder-Decoder in Keras 


. The model contains two LSTM layers for both the encoder and the decoder, the output is fed into a softmax layer. RMSprop 

[22], which generates its parameter updates using a momentum on the rescaled gradient, was used as the optimizer in out experiment since it achieves the best performance compared to other optimization methods. We utilized early stopping with patience 5 to avoid over-fitting.

We generate sequences of random integer numbers with length 25 as inputs, since sentences are usually less than 25 words. These integers are the indices of words in the vocabulary . We use three different vocabularies in our experiment

. For the target outputs, we reverse, sort, and replace words in the input sequence to simulate changing the sentence structure, replacing words, and removing words. The results show that the LSTM Encoder-Decoder is able to learn the reversing, sorting, and replacement operation rules from the provided data, and thus has the potential to simplify a complex text. We use a short example below.

  • Reverse:

  • Sort:

  • Replace:

  • Combine:

4.1 Reverse

We first conduct experiments to show that the LSTM Encoder-Decoder can reverse a sequence after training on a large set of sequence pairs , where

The results are given in Table  1.

represent the vocabulary size, the number of hidden neurons in the LSTM layer, and the training epoch, respectively.

The size of our training data is extremely important for the model. As shown in Table 1, the performance decreases significantly if we reduce the size of our training set from 135k to 9k. The size of the vocabulary also influences the performance. A larger vocabulary requires more training data and more hidden neurons in the LSTM layers. By increasing the number of neurons in the LSTM layers from 128 to 256, we produce a higher capacity model with more neurons that can be trained with fewer epochs and achieve a higher accuracy.

In general, this model can reverse an input sequence with higher than percent accuracy given enough training data. On the other hand, it shows that LSTM is proficient at memorizing long-term dependencies.

V H E Data Train,Val,Test
10 128 200 9k,1k,10k 0.9362,0.8732,0.8731
100 128 200 9k,1k,10k 0.3967,0.1884,0.1883
100 128 200 135k,15k,10k 0.9690,0.9613,0.9623
100 256 81 135k,150k,10k 0.9904,0.9784,0.9787
1000 256 133 135k,15k,10k 0.9410,0.9151,0.9155
Table 1: Reverse Sequence
V H E Train,Val,Test Train,Val,Test
10 128 114 9k,1k,10k 0.9744,0.9952,0.9956
100 128 200 9k,1k,10k 0.6370,0.5003,0.4996
100 128 82 135k,15k,10k 0.9882,0.9907,0.9906
100 256 62 135k,15k,10k 0.9886,0.9965,0.9969
1000 256 127 135k,15k,10k 0.9069,0.7958,0.7971

Table 2: Sort Sequence
Figure 2: 1-dimensional and 2-dimensional PCA projection of number embedding

4.2 Sort

Even though Neural Programmer-Interpreters (NPI)  [23], a recent model, can represent and execute programs such as sorting and addition, the LSTM Encoder-Decoder is much simpler and more light-weight compared to NPI. In the following experiment, we show that the LSTM Encoder-Decoder is able to sort a sequence of integers.

The datasets consist of sequence pairs , where

We show the results in Table 2. Similarly, the size of the vocabulary, training data, and neurons influence the performance. It is harder to train if we increase the vocabulary to 1000, but the model can still learn the sorting rule with a high accuracy if provided enough training data.

We extracted the hidden states of the embedding layer from the trained model of the vocabularies of the size 10 and 100. Since the embedding of each word is a 300-dimensional vector, we use Principle Component Analysis (PCA) for dimension reduction and visualize the word embeddings in Figure 2. Interestingly, we find the learned embedding correctly represents the relationship between each word in 1-dimensional and 2-dimensional space. Noted that even though the inputs are integer “numbers”, these numbers are actually symbols, or word indices that should be perpendicular with each other and have same distance. The model does not know, for example, the order between 1 and 2 before training. The model successfully captures the meaning and relationship between words. Similarly, if we provide the model with difficult and simple English sentence pairs, the LSTM Encoder-Decoder may be able to learn how to order words to make the sentence simpler.

4.3 Replace

We next show that the LSTM Encoder-Decoder can replace words in a sequence. For the sequence pairs , where

We let when . We only keep the top 20 percent of words in the vocabulary, and use these words to replace all matching words in the output sequence. Lexical simplification is similar, in which we can regard the top 20 percent of words in the vocabulary as simple and common words and all the other words are complex words. Similarly, we can also think of the top 20 percent of words as meaningful and important words, other words are redundant. Therefore it also suits the task of removing redundant words. The results are shown in Table 3. Compared with the sorting operation, the replacement operation is easier to train when the vocabulary is large, or the size of training data is not large.

V H E Data Train,Val,Test
10 128 180 9k,1k,10k 0.9635,0.9172,0.9150
100 128 200 9k,1k,10k 0.7392,0.5472,0.5488
100 128 61 135k,15k,10k 0.9927,0.9911,0.9912
100 256 40 135k,15k,10k 0.9974,0.9997,0.9975
1000 256 75 135k,15k,10k 0.9868,0.9884,0.9885

Table 3: Replace in Sequence
V H E Data Train,Val,Test
10 128 21 9k,1k,10k 0.9150,0.8662,0.8660
100 128 130 9k,1k,10k 0.7570,0.6670,0.6599
100 128 122 135k,15k,10k 0.9909,0.9985,0.9982
100 256 21 135k,15k,10k 0.9897,0.9987,0.9988
1000 256 107 135k,15k,10k 0.9570,0.9405,0.9404
Table 4: Combine Three Operations

4.4 Combine Three Operations

We have shown that the LSTM Encoder-Decoder can work well on the reversing, sorting and replacement operations separately, but in reality, a sentence is usually simplified by a complex combination of these three different rules. Therefore, we combine the three operations together to see if this model can still discover the mapping rules between sequences.

So the data is sequence pairs , where is obtained by performing modulo for each index in first, then sorting and reversing. The results are shown in Table 4. The LSTM Encoder-Decoder continue working very well as expected, and even as good as each of the operations alone. Therefore, the LSTM Encoder-Decoder can easily discover mapping patterns of combined operations between sequences, thus it may potentially find complicated simplification rules.

5 Conclusion and Future Work

In conclusion, we find that the LSTM Encoder-Decoder model is able to learn operation rules such as reversing, sorting, and replacing from sequence pairs, which shows the model may potentially apply rules like modifying sentence structure, substituting words, and removing words for text simplification. This is a preliminary experimental study in solving the text simplification problem using deep neural networks. However, unlike the machine translation task, there are very few text simplification training corpora online. So our future work includes collecting complex and simple sentence pairs from online resources such as English Wikipedia and Simple English Wikipedia, and training our model using natural languages.


  • [1] Tong Wang, Ping Chen, John Rochford, and Jipeng Qiang. Text simplification using neural machine translation. In

    Thirtieth AAAI Conference on Artificial Intelligence

    , 2016.
  • [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • [3] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  • [4] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
  • [5] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [6] William Coster and David Kauchak. Simple english wikipedia: a new text simplification task. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 665–669. Association for Computational Linguistics, 2011.
  • [7] Tong Wang, Ping Chen, and Dan Simovici. A new evaluation measure using compression dissimilarity on text summarization. Applied Intelligence, pages 1–8, 2016.
  • [8] Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685, 2015.
  • [9] Goran Glavaš and Sanja Štajner. Simplifying lexical simplification: Do we need simplified corpora? page 63, 2015.
  • [10] Or Biran, Samuel Brody, and Noémie Elhadad. Putting it simply: a context-aware approach to lexical simplification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 496–501. Association for Computational Linguistics, 2011.
  • [11] Zhemin Zhu, Delphine Bernhard, and Iryna Gurevych. A monolingual tree-based translation model for sentence simplification. In Proceedings of the 23rd international conference on computational linguistics, pages 1353–1361. Association for Computational Linguistics, 2010.
  • [12] Kristian Woodsend and Mirella Lapata. Learning to simplify sentences with quasi-synchronous grammar and integer programming. In Proceedings of the conference on empirical methods in natural language processing, pages 409–420. Association for Computational Linguistics, 2011.
  • [13] Yahui Di. Prediction of long-lead heavy precipitation events aided by machine learning. In 2015 IEEE International Conference on Data Mining Workshop (ICDMW), pages 1496–1497. IEEE, 2015.
  • [14] Tong Wang, Vish Viswanath, and Ping Chen. Extended topic model for word dependency. In Proceedings of the 53th Annual Meeting of the Associatioin for Computational Linguistics (ACL-2015, Short Papers), volume 2, page 506, 2015.
  • [15] Dan A Simovici, Tong Wang, Ping Chen, and Dan Pletea. Compression and data mining. In 2015 International Conference on Computing, Networking and Communications (ICNC), pages 551–555. IEEE, 2015.
  • [16] Kaixun Hua and Dan A Simovici.

    Long-lead term precipitation forecasting by hierarchical clustering-based bayesian structural vector autoregression.

    In 2016 IEEE 13th International Conference on Networking, Sensing, and Control (ICNSC), pages 1–6. IEEE, 2016.
  • [17] Yong Zhuang, Kui Yu, Dawei Wang, and Wei Ding.

    An evaluation of big data analytics in feature selection for long-lead extreme floods forecasting.

    In 2016 IEEE 13th International Conference on Networking, Sensing, and Control (ICNSC), pages 1–6. IEEE, 2016.
  • [18] Jiayin Wang, Yi Yao, Ying Mao, Bo Sheng, and Ningfang Mi. Fresh: Fair and efficient slot configuration and scheduling for hadoop clusters. In Cloud Computing (CLOUD), 2014 IEEE 7th International Conference on, pages 761–768. IEEE, 2014.
  • [19] Yi Ren, Jun Suzuki, Athanasios Vasilakos, Shingo Omura, and Katsuya Oba. Cielo: An evolutionary game theoretic framework for virtual machine placement in clouds. In Future Internet of Things and Cloud (FiCloud), 2014 International Conference on, pages 1–8. IEEE, 2014.
  • [20] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. Lstm: A search space odyssey. arXiv preprint arXiv:1503.04069, 2015.
  • [21] François Chollet. keras., 2015.
  • [22] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4:2, 2012.
  • [23] Scott Reed and Nando de Freitas. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279, 2015.