Log In Sign Up

ReDecode Framework for Iterative Improvement in Paraphrase Generation

by   Milan Aggarwal, et al.

Generating paraphrases, that is, different variations of a sentence conveying the same meaning, is an important yet challenging task in NLP. Automatically generating paraphrases has its utility in many NLP tasks like question answering, information retrieval, conversational systems to name a few. In this paper, we introduce iterative refinement of generated paraphrases within VAE based generation framework. Current sequence generation models lack the capability to (1) make improvements once the sentence is generated; (2) rectify errors made while decoding. We propose a technique to iteratively refine the output using multiple decoders, each one attending on the output sentence generated by the previous decoder. We improve current state of the art results significantly - with over 9 Quora question pairs and MSCOCO datasets respectively. We also show qualitatively through examples that our re-decoding approach generates better paraphrases compared to a single decoder by rectifying errors and making improvements in paraphrase structure, inducing variations and introducing new but semantically coherent information.


page 6

page 7


A Deep Generative Framework for Paraphrase Generation

Paraphrase generation is an important problem in NLP, especially in ques...

Cross-Thought for Sentence Encoder Pre-training

In this paper, we propose Cross-Thought, a novel approach to pre-trainin...

Evidentiality-guided Generation for Knowledge-Intensive NLP Tasks

Retrieval-augmented generation models have shown state-of-the-art perfor...

Label Dependent Deep Variational Paraphrase Generation

Generating paraphrases that are lexically similar but semantically diffe...

Co-Attention Hierarchical Network: Generating Coherent Long Distractors for Reading Comprehension

In reading comprehension, generating sentence-level distractors is a sig...

Polly Want a Cracker: Analyzing Performance of Parroting on Paraphrase Generation Datasets

Paraphrase generation is an interesting and challenging NLP task which h...

Let's Ask Again: Refine Network for Automatic Question Generation

In this work, we focus on the task of Automatic Question Generation (AQG...


Paraphrases refer to texts which express the same meaning in different ways. For example, ”Can time travel ever be possible?” and ”Is time travel a possibility?”

are paraphrases of each other. Human conversations typically involve a high level of paraphrasing to express similar intent, but comprehending such sentences as semantically similar and generating them is a difficult task for a machine. Automatic paraphrase generation is an important task in NLP that has practical significance in many text-to-text generation tasks such as question answering, conversational systems, information retrieval, summarization, etc. Knowledge-based QA systems are highly sensitive to the way a question is asked. Using paraphrases of the asked question while ranking answers in the knowledge base improves the system performance

[Dong et al.2017]. Paraphrasing also fosters incorporating variations in domain specific conversational bots which have a fixed set of responses to prevent them from being repetitive. In the task of query reformulation, paraphrasing has direct utility, e.g. in search engines, paraphrase generation module can be used for recommending different possible variations of the user query or directly show the search results after incorporating the variations as part of the search process. In the case of end-to-end conversational systems, training data can be augmented with paraphrases of available dialogues which helps in improving semantic understanding capability of the system.

Early paraphrase generation systems used handcrafted rule-based systems

[McKeown1983], relied on automatic extraction of paraphrase patterns from available parallel corpus data [Barzilay and Lee2003] or used knowledge base like word-net for paraphrase generation [Bolshakov and Gelbukh2004]. Statistical machine translation tools have also been applied for paraphrase generation [Quirk, Brockett, and Dolan2004]. These approaches are limited because of their methodology and don’t generalize well.

Recent advances in deep neural network based models for sequence generation has advanced state of the art in various NLP tasks such as machine translation

[Bahdanau, Cho, and Bengio2014] and question answering [Yin et al.2015]. For the task of paraphrase generation, prakash2016neural prakash2016neural for the first time explored sequence-to-sequence (Seq2Seq) [Sutskever, Vinyals, and Le2014]

based neural network model and proposed an improved variant of the model - a stacked LSTM Seq2Seq network with residual connections.

In this paper, we present a framework for automatic paraphrase generation which is based on variational autoencoder (VAE)

[Kingma and Welling2013]. VAE is used extensively for generative tasks in image domain and has been experimented with in text domain [Bowman et al.2015] as well; the model usually consists of LSTM RNN [Sundermeyer, Schlüter, and Ney2012]

as encoder and decoder (VAE-LSTM), for processing sequential input. Unlike the traditional reconstruction task of VAE, paraphrasing involves generating outputs which are different in their expression but have same semantic meaning. To achieve this objective, gupta2017deep gupta2017deep introduced a supervised variant (VAE-S) of VAE-LSTM where decoder is conditioned on the vector representation of input sentence obtained through another RNN, instead of only depending on latent representation. Our approach is based on the supervised generative sequence modeling through VAE where supervision is obtained through decoder attending over the hidden states of an LSTM RNN that encodes the input sentence.

In this work, we introduce a methodology for iterative improvement of output using the VAE-S framework as compared to previous sequence generation models that decode output sequence only once. This concept is inspired from the idea that given a crude paraphrase and the original sentence, the model should be able to generate a better quality paraphrase in the next iteration by rectifying errors and identifying regions of improvement; similar to what humans can do. We achieve the task of iterative improvement by having multiple decoders in the model and each decoder, except the first one, attends on the output of previous decoder for supervision. We establish the effectiveness of this approach for the domain of paraphrase generation by showing significant improvements in the scores of standard metrics on benchmark paraphrase datasets over state of the art. Our approach is applicable to any other domain which involves sequence generation such as conversational systems, question answering etc. However we do not explore its capabilities in other domains in this paper. Our contributions in this paper can be listed as:

  • We introduce an iterative improvement framework for the output using multiple decoders under VAE based generative model. The first decoder is conditioned on the input sentence encoding whereas further decoders are conditioned on the outputs generated by preceding decoders.

  • We improve the existing state of the art in paraphrase generation task by a significant margin using our above mentioned approach.

Related Work

Paraphrase generation has been modeled as a Seq2Seq learning problem from the input sentence to the target paraphrase. The first Seq2Seq neural network based approach for this was proposed by prakash2016neural prakash2016neural which was a stacked LSTM RNN model with residual connections. The authors compared it with other variants of Seq2Seq model which included attention and bidirectional LSTM unit. cao2017joint cao2017joint introduced a Seq2Seq model fusing two decoders, one of them is copying decoder and the other is a restricted generative decoder inspired from the human way of paraphrasing that principally involves copying or rewriting. gupta2017deep gupta2017deep introduced VAE based model for paraphrase generation. VAE as introduced by kingma2013auto kingma2013auto is a generative deep neural network model that maps the input to latent variables and decodes the latent variable to reconstruct the data. VAE is ideal for generating new data as it explicitly learns a probability distribution on the latent code from which a sample is used for decoding. gupta2017deep gupta2017deep condition the decoder on input sentence and use reference paraphrase along with input sentence as input for generating latent code to obtain better quality paraphrases.

There has also been some work on improving paraphrase generation models inspired from machine translation. It has been shown that paraphrase pairs obtained using back-translated texts from bilingual machine translation corpora has data quality at par with manually-written English paraphrase pairs [Wieting, Mallinson, and Gimpel2017]. There has been work done in syntactically controlled paraphrase generation as well where parse tree template of paraphrase to be generated is also given as input [Iyyer et al.2018].

Our work in paraphrase generation is similar to the approach of gupta2017deep gupta2017deep in that our model is also based on VAE. The main difference lies in our methodology to iteratively improve the decoded output and use attention mechanism to condition the decoder on input sentence while training. We also introduce a specific loss term to promote generation of varied paraphrases of a given sentence.

Figure 1: Architecture diagram of iterative approach for the case of two decoders .

processes the input sentence and its final output is used to obtain the mean and variance vectors

and through fully connected layer and respectively which are used to sample . produces output vectors as it processes the input sentence word by word. First decoder attends on these output vectors, takes as input and outputs the paraphrase using standard Seq2Seq technique. Decoder attends on the softmax vectors produced by and generates the final output. The dotted connections in the decoders across different time steps show that output generated at time t is passed as input at time t+1 during inference while inputs are pre-determined during training as per teacher forcing technique [Williams and Zipser1989].


In this section, we explain our model architecture, which is based on VAE. We first give a brief overview of VAE and then explain our framework in detail.

Variational Autoencoder

Variational Autoencoder as introduced by kingma2013auto kingma2013auto is a generative model that learns a posterior distribution over latent variables for generating output. Input data x is mapped to a latent code z from which x can be reconstructed back. It differs from traditional auto-encoders in the sense that instead of learning a deterministic mapping function to latent code , it learns a posterior distribution from the data starting with a prior . The posterior distribution is usually taken to be and prior distribution as

to facilitate stochastic back propagation based training. The encoder can be a neural network with a feed forward layer at the end to estimate

and from .

is sampled from the normal distribution

and passed to the decoder as input. The decoder which is also a neural network learns the probability distribution to reconstruct input data from latent code. The network is trained by maximizing the following objective function:


and are the parameters of encoder and decoder respectively; stands for KL divergence. The objective function maximizes the log likelihood of reconstructed data from the posterior and at the same time reduces the KL divergence between the prior and posterior distribution of latent code . This objective is a valid lower bound on the true log likelihood of data as shown by the authors, therefore maximizing it ensures that the total log likelihood of data is maximized. The first term in equation 1 is maximized by minimizing the cross entropy error over the training dataset.

Since VAE learns the probability distribution , it is ideal for generative modeling tasks. For sequence generation task in text, bowman2015generating bowman2015generating proposed RNN based variational autoencoder model. Both the encoder and decoder are LSTM RNN with a feed forward layer at the end of encoder to estimate and . They introduce techniques like KL cost annealing and word dropout in decoder for efficient learning. gupta2017deep gupta2017deep improved upon this model in a supervised setting of paraphrase generation by conditioning decoder on input sentence encoding computed by a separate encoder and using at every time step of decoding as input. From now on we use VAE-S (S stands for supervision) to denote this model.

In our model as well, is concatenated with the word encoding as input to decoder (as in standard Seq2Seq technique) and decoder is conditioned on input sentence. Since paraphrase generation is subtly different from sentence reconstruction, using alone may not result in good paraphrases. We condition the decoder on the inputs using well known Attention Mechanism [Luong, Pham, and Manning2015] while generating the paraphrases to enable the model to learn phrase level semantics. Attention Mechanism has been widely used in sequence tasks such as Recognizing Text Entailment (RTE) [Rocktäschel et al.2015], Machine Translation (MT) [Vaswani et al.2017] etc.

We explain our attention based ReDecode model architecture in the next section.

Model Architecture

Training data consists of sentence and its expected paraphrase . Input to the model is a sequence of vector encodings of represented as which we take as pre-trained Glove [Pennington, Socher, and Manning2014] vector embeddings instead of training word vectors from the scratch. The architecture diagram of our model is as shown in figure 1. It consists of a Sampling Encoder (), Sentence Encoder () and sequence of decoders ; , and are parameters of the model respectively. Below we explain each module and training strategy in detail.

Sampling Encoder

: is used to encode the original sentence for sampling the latent vector . As shown in figure 1, it consists of a single layer LSTM RNN that sequentially processes the word embeddings of words in the original sentence and creates the vector representation of sentence. is then passed through two separate fully connected layers and to estimate mean () and variance(). Final latent code is sampled from distribution.

Sentence Encoder

: computes a vector representation of the input sentence used for generating the output paraphrase in the decoding stage. It is a two layer stacked LSTM unit which sequentially processes the input sentence and generates a set of hidden vectors corresponding to each time step of the input sequence. These hidden vectors are attended upon by the decoder. In attention mechanism, given a sequence of vectors , attributed as memory M with vectors arranged along the columns, the decoder LSTM learns a context vector derived using weighted sum of columns of M as a function of its input and hidden state at time step j and uses it for generating the output. The decoder learns to identify and focus on specific parts of the memory while generating the words.

Iterative Decoder

: In the decoding stage, we propose to use multiple decoders , , …, to generate the output iteratively. While training, the input to each decoder is sampled using concatenated with encoding of at each time step. During inference, the generated word is given as input to the next step of decoding as in standard Seq2Seq paradigm. Each decoder is a two layer stacked LSTM unit followed by a projection layer which outputs a likelihood distribution over the vocabulary. In addition, decoder attends on the softmax vectors generated by whereas attends over the outputs generated by Sentence Encoder . More formally we iteratively generate a sequence of paraphrases such that,


where is a sequence of words in the paraphrase generated by , H is set of outputs generated by the Sentence Encoder , are the softmax vectors generated by the previous decoder () and are the context vectors obtained by attending over the softmax vectors.

As shown in experimental results section, (i1) iteratively improves the output generated by . In single decoder model, the output at time-step t is decided based on the outputs at time-steps less than t. In case of multiple decoders, (i1) has the information about complete paraphrase generated by . We hypothesize that further decoders have prior notion of output to be generated at every time step; this enables them to rectify errors, modify the structure and introduce useful variations.

Training Technique

Training objective of our model is similar to the VAE objective function equation 1

. To increase the log likelihood of generated paraphrase from all decoders the average of cross entropy (CE) of each decoder output compared to target paraphrase is minimized along with KLD loss. Thus our loss function is:


Also in order to induce variations in the generated paraphrases, we conduct training by sampling three different latent vectors and generating the corresponding outputs . This is done by adding different Gaussian noises to mean and variance vectors obtained corresponding to the input sentence and feeding them to the decoder. We take the final state of the decoder after generating output

as the representation of the corresponding output sentence and minimize pairwise cosine similarity between them by adding the following to the loss function:


where CS denotes cosine similarity. The objective is to tune the model in a way such that different noises added to the mean and variance vector while sampling z results in diverse and different paraphrases while being coherent with the input sentence. We now discuss different experiments conducted for different model variations discussed above.



We present a qualitative and quantitative discussion of the results on two different datasets - Quora question pairs and MSCOCO - across different model variations. Quora dataset111 comprises of questions asked by the users of the platform and consists of question pairs which are potential paraphrases of each other, as denoted by a binary 1-0 value provided against each pair. We use the pairs with value 1 and discard the remaining ones. MSCOCO222 dataset comprises of about 200k labeled images with each image annotated with 5 captions which are potential paraphrases. We use 2014 release of the dataset which provided separate train and validation splits in order to compare our results with previous baselines and work on paraphrase generation. We randomly select 4 captions out of 5 for each image and randomly divide them into 2 input-paraphrase sentence pairs. Before feeding the sentences to the model for training and inference, we preprocess them by removing punctuations and include only the pairs where both the input sentence and its paraphrase have length

. Sentences having length less than 15 are padded appropriately using a separate pad token. The number of sentence pairs on which the model is trained and validated after preprocessing is summarized in table


Dataset # Training Samples # Testing Samples
Quora 87116 18773
MSCOCO 149438 73221
Table 1: Dataset Statistics
Approach Quora MSCOCO
Residual LSTM [Prakash et al.2016] NA NA NA 27.0 37.0 51.6
VAE-SVG [Gupta et al.2017] 32.0 37.1 40.8 30.9 41.3 40.8
VAE-S 22.75 16.55 73.63 12.11 4.44 88.94
VAE-REF 21.85 14.86 67.12 10.9 3.36 84.2
VAE-VAR 25.5 19.92 70.01 12.38 4.7 88.73
VAE-ITERDEC2 39.07 54.19 32.5 57.44 84.64 7.2
VAE-ITERVAR 39.71 54.95 30.45 59.88 87.71 5.8
VAE-ITERDEC3 41.95 61.23 26.86 53.01 77.84 11.85
Table 2: METEOR, BLEU and TER scores for different models on test sets of Quora and MSCOCO

Implementation Details

To train our model, we use pre-trained 300 dimensional Glove embeddings333 to represent the input words in a sentence and keep them non-trainable. The encoder LSTM in is a single layer LSTM with 600 units. The dimension of mean and variance vectors is kept at 1100 through all the experiments with a batch size of and learning rate of . and are two layer stacked LSTM cells with the number of units in LSTM cell fixed at 600. We have used Adam optimizer [Kingma and Ba2014] for training our model parameters. This configuration is common across different experimental settings.

Baseline and Evaluation Measures

We compare our model with VAE-SVG model [Gupta et al.2017] which is current state of the art on benchmark datasets and Residual LSTM model [Prakash et al.2016]. We directly cite the scores as reported in the respective papers. In our work, we do not train the word embedding as done in [Gupta et al.2017]. To make a fair comparison, we also implemented and trained VAE-SVG model in the same setting. We denote this model as VAE-REF.

For quantitative evaluation of our model, we calculate scores on well known evaluations metrics in the domain of machine translation

444We used the software available at : METEOR [Lavie and Agarwal2007], BLEU [Papineni et al.2002] and Translation Edit Rate (TER)[Snover et al.2006]

. These scores have been shown to correlate well with human judgment. madnani2012re madnani2012re show that these measures perform well for the task of paraphrase recognition also. BLEU score is based on weighted n-gram precision scores of the reference paraphrase with candidate paraphrase. METEOR uses stemming and synonymy detection as well while computing precision and recall. TER measures the edit distance between reference and candidate sentence, so lower the TER better the score.


In order to evaluate our approach, we experimented with following variations of the model: (1) Basic VAE based generative model (VAE-S), (2) VAE-S with reference paraphrase as an additional input (VAE-REF), (3) VAE with attention and loss (VAE-VAR), (4) VAE with Iterative decoding having 2 decoders and attention (VAE-ITERDEC2), (5) VAE model comprising of loss and 2 decoders with attention (VAE-ITERVAR), and (6) VAE with 3 decoders and attention (VAE-ITERDEC3). Results for each of these models have been summarized in table 2 for both Quora and MSCOCO datasets. We report all our results and improvements in absolute points.

Figure 2: Attention visualization examples from Quora dataset demonstrating parts of paraphrase generated by first decoder (x-axis) on which decoder 2 attends while generating its output (y-axis). Input to the model was - ‘what can substitute red wine in cooking’ (left) and ‘how do i start an export company or llc in new york city’ (right). It can be seen that second decoder corrects the output of the first one by attending on incorrect words while replacing them with a better phrase.
Input what are the top universities for computer
science in the world
Decoder 1 what are the best universities for computer
science in the world
Decoder 2 what are the best computer science colleges
Expected what are the best computer science schools
Input which is best time for exercise
Decoder 1 what is best time exercise
Decoder 2 when is the best time to exercise
Expected when is the best time to workout
Input what can substitute red wine in cooking
Decoder 1 what are the best sides in cooking
Decoder 2 what is a good substitute for red wine in
Expected what is a good replacement for red wine in
Input how do i start an export company or llc in
new york city
Decoder 1 how do i start preparing for donations in
new york
Decoder 2 how do i start new llc capital company in
new york
Expected how do i start an import/export llc in new
york city
Table 3: Few examples of paraphrases generated by our VAE-ITERDEC2 model on Quora dataset


As we can see our proposed iterative decoding mechanism improves the score by a huge margin as compared to baseline VAE-S. The improved scores are better than any other previous work done in paraphrase generation - with near and absolute increase in and scores respectively compared to the previous best scores [Gupta et al.2017] - thus establishing a new state of the art in this task. Our score is also less than the previously established best score. Table 3 shows a comparison between paraphrases generated by the first decoder and the improvements made by the second decoder on a few example sentences.

In some cases, as in first example in table 3, the output generated by the second decoder resembles the expected paraphrase more than the paraphrase generated by the first decoder which leads to a better score - the first decoder just replaces the word ‘top’ in the input sentence with ‘best’ while the second decoder changes the sentence structure by introducing the phrase ‘best computer science’ which also matches with the expected paraphrase. Another observation is that many times the second decoder makes the generated paraphrase correct and semantically more similar to the input sentence than the output of the first decoder like in the third and last example in table 3. Figure 2 shows attention heatmaps demonstrating the phrases in the output of the first decoder where the second decoder focuses while generating the paraphrase. For the last example in table 3 it can be seen in figure 2 (right) that the second decoder attends on ‘start preparing for donations’ while replacing it with ‘start new llc’. Similarly for the third example in table 3 the second decoder generates ‘good substitute’ while attending on ‘best sides’ - as can be seen in figure 2 (left). Thus the second decoder is focusing on mistakes in the previous output to make a guided decision while generating output.

On adding the loss to VAE-ITERDEC2 model, reduces by . We also extended the VAE-ITERDEC2 model (without loss) by using an additional decoder resulting in 3 decoders which further boosted up the score to , to and reduced to .

Input a group of motorcyclists are driving down
the city street
Decoder 1 a group of people that are sitting on a street
Decoder 2 a group of motorcycles drive down a city
Expected a group of motorcycles drive down a city
Input a man sits with a traditionally decorated
Decoder 1 a man is sitting on a large grill in a
Decoder 2 an equestrian man in armor costume sitting
with a decorated cow
Expected an indian man in religious attire sitting with
a decorated cow
Input a beautiful dessert waiting to be shared by
two people
Decoder 1 a table with three plates of food and a fork
Decoder 2 there is a piece of cake on a plate with
flowers on it
Expected there is a piece of cake on a plate with
decorations on it
Input a home office with laptop printer scanner
and extra monitor
Decoder 1 a desk with a laptop and a mouse
Decoder 2 office setting with office equipment on desk
Expected office space with office equipment on desk
Table 4: Few examples of paraphrases generated by our VAE-ITERDEC2 model on MSCOCO dataset
Figure 3: Attention visualization examples from MSCOCO dataset demonstrating parts of paraphrase generated by first decoder (x-axis) on which decoder 2 attends while generating its output (y-axis). Input to the model was - ‘a group of motorcyclists are driving down the city street’ (left) and ‘a beautiful dessert waiting to be shared by two people’ (right).
Comparison of decoder 1 output Quora MSCOCO
with expected paraphrase 27.09 22.12 67.12 15.15 8.09 79.52
with decoder 2 output 26.2 23.19 68.52 14.99 8.02 79.65
Table 5: Comparison of METEOR, BLEU and TER scores of output of decoder 1 with - expected paraphrase and output of decoder 2 - in VAE-ITERDEC2 model on test sets of Quora and MSCOCO


Our VAE-ITERDEC2 model provides significant improvements on this dataset outperforming previously best approaches on all three metrics with score, and . This is an improvement of over , and reduction of in , and respectively compared to the previous state of the art. Contrary to Quora, however, VAE-ITERDEC3 attains slightly less score compared to VAE-ITERDEC2 in terms of these metrics which shows that addition of the third decoder does not necessarily lead to better results. But using the second decoder significantly improves the results. Thus it still needs to be explored what is the optimal number of decoders needed for a dataset or if it can be decided dynamically. Adding loss to VAE-ITERDEC2 gives best results giving a score of , and . Few example paraphrases generated by VAE-ITERDEC2 on MSCOCO have been shown in table 4.

In the first example, first decoder generates a paraphrase which has little relevance with respect to the input, however, the second decoder corrects it by replacing ‘group of people that are sitting’ with ‘group of motorcycles drive down’ as can be seen in the attention map also in figure 3 (left). In the third example in table 4 first decoder uses a generic term ‘food’ as a replacement for ‘desert’ while the second decoder introduces the word ‘cake’ while attending on ‘food’ as can be seen in the attention visualization in figure 3 (right). It also introduces ‘with flowers on it’ to represent the notion of ‘beautiful dessert’ in the original sentence. Similarly in the last example in the table, the paraphrase generated by the second decoder includes ‘office setting’, making it coherent with the input while its structure resembles the expected paraphrase.

To compare the outputs generated by the two decoders in VAE-ITERDEC2 model, we computed the metric scores of decoder 1 output with - expected paraphrase and decoder 2 output as shown in table 5. score with expected paraphrase is sufficiently low compared to VAE-ITERDEC2 scores in table 2. This implies that the second decoder significantly improves the scores over the first decoder. The same observation holds for and . Comparing decoder 1 output with decoder 2 outputs, we get a high which suggests second decoder generates sufficiently different outputs from the first one.


In this paper, we have proposed attention based ReDecode framework for iterative refinement of generated paraphrases using VAE based Seq2Seq model. It comprises of a sequence of decoders which generate paraphrases turn by turn. Given a decoder, it attends on the output generated by the preceding decoder and modifies it by rectifying errors and introducing semantically coherent phrases, while generating its output. Quantitatively, it improves the previous best scores on standard metrics and benchmark datasets, establishing a new state of the art in this task.

We experimented with maximum three decoders using our ReDecode framework. On Quora dataset, using three decoders improved the scores over two decoders model contrary to MSCOCO. Determining the optimal number of decoders, which can be dataset dependent, remains future work. Furthermore, the proposed architecture is generic and might be beneficial in other sequence generation tasks such as machine translation.