Paraphrases refer to texts which express the same meaning in different ways. For example, ”Can time travel ever be possible?” and ”Is time travel a possibility?”
are paraphrases of each other. Human conversations typically involve a high level of paraphrasing to express similar intent, but comprehending such sentences as semantically similar and generating them is a difficult task for a machine. Automatic paraphrase generation is an important task in NLP that has practical significance in many text-to-text generation tasks such as question answering, conversational systems, information retrieval, summarization, etc. Knowledge-based QA systems are highly sensitive to the way a question is asked. Using paraphrases of the asked question while ranking answers in the knowledge base improves the system performance[Dong et al.2017]. Paraphrasing also fosters incorporating variations in domain specific conversational bots which have a fixed set of responses to prevent them from being repetitive. In the task of query reformulation, paraphrasing has direct utility, e.g. in search engines, paraphrase generation module can be used for recommending different possible variations of the user query or directly show the search results after incorporating the variations as part of the search process. In the case of end-to-end conversational systems, training data can be augmented with paraphrases of available dialogues which helps in improving semantic understanding capability of the system.
Early paraphrase generation systems used handcrafted rule-based systems[McKeown1983], relied on automatic extraction of paraphrase patterns from available parallel corpus data [Barzilay and Lee2003] or used knowledge base like word-net for paraphrase generation [Bolshakov and Gelbukh2004]. Statistical machine translation tools have also been applied for paraphrase generation [Quirk, Brockett, and Dolan2004]. These approaches are limited because of their methodology and don’t generalize well.
Recent advances in deep neural network based models for sequence generation has advanced state of the art in various NLP tasks such as machine translation[Bahdanau, Cho, and Bengio2014] and question answering [Yin et al.2015]. For the task of paraphrase generation, prakash2016neural prakash2016neural for the first time explored sequence-to-sequence (Seq2Seq) [Sutskever, Vinyals, and Le2014]
based neural network model and proposed an improved variant of the model - a stacked LSTM Seq2Seq network with residual connections.
In this paper, we present a framework for automatic paraphrase generation which is based on variational autoencoder (VAE)[Kingma and Welling2013]. VAE is used extensively for generative tasks in image domain and has been experimented with in text domain [Bowman et al.2015] as well; the model usually consists of LSTM RNN [Sundermeyer, Schlüter, and Ney2012]
as encoder and decoder (VAE-LSTM), for processing sequential input. Unlike the traditional reconstruction task of VAE, paraphrasing involves generating outputs which are different in their expression but have same semantic meaning. To achieve this objective, gupta2017deep gupta2017deep introduced a supervised variant (VAE-S) of VAE-LSTM where decoder is conditioned on the vector representation of input sentence obtained through another RNN, instead of only depending on latent representation. Our approach is based on the supervised generative sequence modeling through VAE where supervision is obtained through decoder attending over the hidden states of an LSTM RNN that encodes the input sentence.
In this work, we introduce a methodology for iterative improvement of output using the VAE-S framework as compared to previous sequence generation models that decode output sequence only once. This concept is inspired from the idea that given a crude paraphrase and the original sentence, the model should be able to generate a better quality paraphrase in the next iteration by rectifying errors and identifying regions of improvement; similar to what humans can do. We achieve the task of iterative improvement by having multiple decoders in the model and each decoder, except the first one, attends on the output of previous decoder for supervision. We establish the effectiveness of this approach for the domain of paraphrase generation by showing significant improvements in the scores of standard metrics on benchmark paraphrase datasets over state of the art. Our approach is applicable to any other domain which involves sequence generation such as conversational systems, question answering etc. However we do not explore its capabilities in other domains in this paper. Our contributions in this paper can be listed as:
We introduce an iterative improvement framework for the output using multiple decoders under VAE based generative model. The first decoder is conditioned on the input sentence encoding whereas further decoders are conditioned on the outputs generated by preceding decoders.
We improve the existing state of the art in paraphrase generation task by a significant margin using our above mentioned approach.
Paraphrase generation has been modeled as a Seq2Seq learning problem from the input sentence to the target paraphrase. The first Seq2Seq neural network based approach for this was proposed by prakash2016neural prakash2016neural which was a stacked LSTM RNN model with residual connections. The authors compared it with other variants of Seq2Seq model which included attention and bidirectional LSTM unit. cao2017joint cao2017joint introduced a Seq2Seq model fusing two decoders, one of them is copying decoder and the other is a restricted generative decoder inspired from the human way of paraphrasing that principally involves copying or rewriting. gupta2017deep gupta2017deep introduced VAE based model for paraphrase generation. VAE as introduced by kingma2013auto kingma2013auto is a generative deep neural network model that maps the input to latent variables and decodes the latent variable to reconstruct the data. VAE is ideal for generating new data as it explicitly learns a probability distribution on the latent code from which a sample is used for decoding. gupta2017deep gupta2017deep condition the decoder on input sentence and use reference paraphrase along with input sentence as input for generating latent code to obtain better quality paraphrases.
There has also been some work on improving paraphrase generation models inspired from machine translation. It has been shown that paraphrase pairs obtained using back-translated texts from bilingual machine translation corpora has data quality at par with manually-written English paraphrase pairs [Wieting, Mallinson, and Gimpel2017]. There has been work done in syntactically controlled paraphrase generation as well where parse tree template of paraphrase to be generated is also given as input [Iyyer et al.2018].
Our work in paraphrase generation is similar to the approach of gupta2017deep gupta2017deep in that our model is also based on VAE. The main difference lies in our methodology to iteratively improve the decoded output and use attention mechanism to condition the decoder on input sentence while training. We also introduce a specific loss term to promote generation of varied paraphrases of a given sentence.
In this section, we explain our model architecture, which is based on VAE. We first give a brief overview of VAE and then explain our framework in detail.
Variational Autoencoder as introduced by kingma2013auto kingma2013auto is a generative model that learns a posterior distribution over latent variables for generating output. Input data x is mapped to a latent code z from which x can be reconstructed back. It differs from traditional auto-encoders in the sense that instead of learning a deterministic mapping function to latent code , it learns a posterior distribution from the data starting with a prior . The posterior distribution is usually taken to be and prior distribution as
to facilitate stochastic back propagation based training. The encoder can be a neural network with a feed forward layer at the end to estimateand from .
is sampled from the normal distributionand passed to the decoder as input. The decoder which is also a neural network learns the probability distribution to reconstruct input data from latent code. The network is trained by maximizing the following objective function:
and are the parameters of encoder and decoder respectively; stands for KL divergence. The objective function maximizes the log likelihood of reconstructed data from the posterior and at the same time reduces the KL divergence between the prior and posterior distribution of latent code . This objective is a valid lower bound on the true log likelihood of data as shown by the authors, therefore maximizing it ensures that the total log likelihood of data is maximized. The first term in equation 1 is maximized by minimizing the cross entropy error over the training dataset.
Since VAE learns the probability distribution , it is ideal for generative modeling tasks. For sequence generation task in text, bowman2015generating bowman2015generating proposed RNN based variational autoencoder model. Both the encoder and decoder are LSTM RNN with a feed forward layer at the end of encoder to estimate and . They introduce techniques like KL cost annealing and word dropout in decoder for efficient learning. gupta2017deep gupta2017deep improved upon this model in a supervised setting of paraphrase generation by conditioning decoder on input sentence encoding computed by a separate encoder and using at every time step of decoding as input. From now on we use VAE-S (S stands for supervision) to denote this model.
In our model as well, is concatenated with the word encoding as input to decoder (as in standard Seq2Seq technique) and decoder is conditioned on input sentence. Since paraphrase generation is subtly different from sentence reconstruction, using alone may not result in good paraphrases. We condition the decoder on the inputs using well known Attention Mechanism [Luong, Pham, and Manning2015] while generating the paraphrases to enable the model to learn phrase level semantics. Attention Mechanism has been widely used in sequence tasks such as Recognizing Text Entailment (RTE) [Rocktäschel et al.2015], Machine Translation (MT) [Vaswani et al.2017] etc.
We explain our attention based ReDecode model architecture in the next section.
Training data consists of sentence and its expected paraphrase . Input to the model is a sequence of vector encodings of represented as which we take as pre-trained Glove [Pennington, Socher, and Manning2014] vector embeddings instead of training word vectors from the scratch. The architecture diagram of our model is as shown in figure 1. It consists of a Sampling Encoder (), Sentence Encoder () and sequence of decoders ; , and are parameters of the model respectively. Below we explain each module and training strategy in detail.
: is used to encode the original sentence for sampling the latent vector . As shown in figure 1, it consists of a single layer LSTM RNN that sequentially processes the word embeddings of words in the original sentence and creates the vector representation of sentence. is then passed through two separate fully connected layers and to estimate mean () and variance(). Final latent code is sampled from distribution.
: computes a vector representation of the input sentence used for generating the output paraphrase in the decoding stage. It is a two layer stacked LSTM unit which sequentially processes the input sentence and generates a set of hidden vectors corresponding to each time step of the input sequence. These hidden vectors are attended upon by the decoder. In attention mechanism, given a sequence of vectors , attributed as memory M with vectors arranged along the columns, the decoder LSTM learns a context vector derived using weighted sum of columns of M as a function of its input and hidden state at time step j and uses it for generating the output. The decoder learns to identify and focus on specific parts of the memory while generating the words.
: In the decoding stage, we propose to use multiple decoders , , …, to generate the output iteratively. While training, the input to each decoder is sampled using concatenated with encoding of at each time step. During inference, the generated word is given as input to the next step of decoding as in standard Seq2Seq paradigm. Each decoder is a two layer stacked LSTM unit followed by a projection layer which outputs a likelihood distribution over the vocabulary. In addition, decoder attends on the softmax vectors generated by whereas attends over the outputs generated by Sentence Encoder . More formally we iteratively generate a sequence of paraphrases such that,
where is a sequence of words in the paraphrase generated by , H is set of outputs generated by the Sentence Encoder , are the softmax vectors generated by the previous decoder () and are the context vectors obtained by attending over the softmax vectors.
As shown in experimental results section, (i1) iteratively improves the output generated by . In single decoder model, the output at time-step t is decided based on the outputs at time-steps less than t. In case of multiple decoders, (i1) has the information about complete paraphrase generated by . We hypothesize that further decoders have prior notion of output to be generated at every time step; this enables them to rectify errors, modify the structure and introduce useful variations.
Training objective of our model is similar to the VAE objective function equation 1
. To increase the log likelihood of generated paraphrase from all decoders the average of cross entropy (CE) of each decoder output compared to target paraphrase is minimized along with KLD loss. Thus our loss function is:
Also in order to induce variations in the generated paraphrases, we conduct training by sampling three different latent vectors and generating the corresponding outputs . This is done by adding different Gaussian noises to mean and variance vectors obtained corresponding to the input sentence and feeding them to the decoder. We take the final state of the decoder after generating output
as the representation of the corresponding output sentence and minimize pairwise cosine similarity between them by adding the following to the loss function:
where CS denotes cosine similarity. The objective is to tune the model in a way such that different noises added to the mean and variance vector while sampling z results in diverse and different paraphrases while being coherent with the input sentence. We now discuss different experiments conducted for different model variations discussed above.
We present a qualitative and quantitative discussion of the results on two different datasets - Quora question pairs and MSCOCO - across different model variations. Quora dataset111https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs comprises of questions asked by the users of the platform and consists of question pairs which are potential paraphrases of each other, as denoted by a binary 1-0 value provided against each pair. We use the pairs with value 1 and discard the remaining ones. MSCOCO222http://cocodataset.org/#download dataset comprises of about 200k labeled images with each image annotated with 5 captions which are potential paraphrases. We use 2014 release of the dataset which provided separate train and validation splits in order to compare our results with previous baselines and work on paraphrase generation. We randomly select 4 captions out of 5 for each image and randomly divide them into 2 input-paraphrase sentence pairs. Before feeding the sentences to the model for training and inference, we preprocess them by removing punctuations and include only the pairs where both the input sentence and its paraphrase have length
. Sentences having length less than 15 are padded appropriately using a separate pad token. The number of sentence pairs on which the model is trained and validated after preprocessing is summarized in table1.
|Dataset||# Training Samples||# Testing Samples|
|Residual LSTM [Prakash et al.2016]||NA||NA||NA||27.0||37.0||51.6|
|VAE-SVG [Gupta et al.2017]||32.0||37.1||40.8||30.9||41.3||40.8|
To train our model, we use pre-trained 300 dimensional Glove embeddings333https://nlp.stanford.edu/projects/glove/ to represent the input words in a sentence and keep them non-trainable. The encoder LSTM in is a single layer LSTM with 600 units. The dimension of mean and variance vectors is kept at 1100 through all the experiments with a batch size of and learning rate of . and are two layer stacked LSTM cells with the number of units in LSTM cell fixed at 600. We have used Adam optimizer [Kingma and Ba2014] for training our model parameters. This configuration is common across different experimental settings.
Baseline and Evaluation Measures
We compare our model with VAE-SVG model [Gupta et al.2017] which is current state of the art on benchmark datasets and Residual LSTM model [Prakash et al.2016]. We directly cite the scores as reported in the respective papers. In our work, we do not train the word embedding as done in [Gupta et al.2017]. To make a fair comparison, we also implemented and trained VAE-SVG model in the same setting. We denote this model as VAE-REF.
For quantitative evaluation of our model, we calculate scores on well known evaluations metrics in the domain of machine translation444We used the software available at https://github.com/jhclark/multeval : METEOR [Lavie and Agarwal2007], BLEU [Papineni et al.2002] and Translation Edit Rate (TER)[Snover et al.2006]
. These scores have been shown to correlate well with human judgment. madnani2012re madnani2012re show that these measures perform well for the task of paraphrase recognition also. BLEU score is based on weighted n-gram precision scores of the reference paraphrase with candidate paraphrase. METEOR uses stemming and synonymy detection as well while computing precision and recall. TER measures the edit distance between reference and candidate sentence, so lower the TER better the score.
In order to evaluate our approach, we experimented with following variations of the model: (1) Basic VAE based generative model (VAE-S), (2) VAE-S with reference paraphrase as an additional input (VAE-REF), (3) VAE with attention and loss (VAE-VAR), (4) VAE with Iterative decoding having 2 decoders and attention (VAE-ITERDEC2), (5) VAE model comprising of loss and 2 decoders with attention (VAE-ITERVAR), and (6) VAE with 3 decoders and attention (VAE-ITERDEC3). Results for each of these models have been summarized in table 2 for both Quora and MSCOCO datasets. We report all our results and improvements in absolute points.
|Input||what are the top universities for computer|
|science in the world|
|Decoder 1||what are the best universities for computer|
|science in the world|
|Decoder 2||what are the best computer science colleges|
|Expected||what are the best computer science schools|
|Input||which is best time for exercise|
|Decoder 1||what is best time exercise|
|Decoder 2||when is the best time to exercise|
|Expected||when is the best time to workout|
|Input||what can substitute red wine in cooking|
|Decoder 1||what are the best sides in cooking|
|Decoder 2||what is a good substitute for red wine in|
|Expected||what is a good replacement for red wine in|
|Input||how do i start an export company or llc in|
|new york city|
|Decoder 1||how do i start preparing for donations in|
|Decoder 2||how do i start new llc capital company in|
|Expected||how do i start an import/export llc in new|
As we can see our proposed iterative decoding mechanism improves the score by a huge margin as compared to baseline VAE-S. The improved scores are better than any other previous work done in paraphrase generation - with near and absolute increase in and scores respectively compared to the previous best scores [Gupta et al.2017] - thus establishing a new state of the art in this task. Our score is also less than the previously established best score. Table 3 shows a comparison between paraphrases generated by the first decoder and the improvements made by the second decoder on a few example sentences.
In some cases, as in first example in table 3, the output generated by the second decoder resembles the expected paraphrase more than the paraphrase generated by the first decoder which leads to a better score - the first decoder just replaces the word ‘top’ in the input sentence with ‘best’ while the second decoder changes the sentence structure by introducing the phrase ‘best computer science’ which also matches with the expected paraphrase. Another observation is that many times the second decoder makes the generated paraphrase correct and semantically more similar to the input sentence than the output of the first decoder like in the third and last example in table 3. Figure 2 shows attention heatmaps demonstrating the phrases in the output of the first decoder where the second decoder focuses while generating the paraphrase. For the last example in table 3 it can be seen in figure 2 (right) that the second decoder attends on ‘start preparing for donations’ while replacing it with ‘start new llc’. Similarly for the third example in table 3 the second decoder generates ‘good substitute’ while attending on ‘best sides’ - as can be seen in figure 2 (left). Thus the second decoder is focusing on mistakes in the previous output to make a guided decision while generating output.
On adding the loss to VAE-ITERDEC2 model, reduces by . We also extended the VAE-ITERDEC2 model (without loss) by using an additional decoder resulting in 3 decoders which further boosted up the score to , to and reduced to .
|Input||a group of motorcyclists are driving down|
|the city street|
|Decoder 1||a group of people that are sitting on a street|
|Decoder 2||a group of motorcycles drive down a city|
|Expected||a group of motorcycles drive down a city|
|Input||a man sits with a traditionally decorated|
|Decoder 1||a man is sitting on a large grill in a|
|Decoder 2||an equestrian man in armor costume sitting|
|with a decorated cow|
|Expected||an indian man in religious attire sitting with|
|a decorated cow|
|Input||a beautiful dessert waiting to be shared by|
|Decoder 1||a table with three plates of food and a fork|
|Decoder 2||there is a piece of cake on a plate with|
|flowers on it|
|Expected||there is a piece of cake on a plate with|
|decorations on it|
|Input||a home office with laptop printer scanner|
|and extra monitor|
|Decoder 1||a desk with a laptop and a mouse|
|Decoder 2||office setting with office equipment on desk|
|Expected||office space with office equipment on desk|
|Comparison of decoder 1 output||Quora||MSCOCO|
|with expected paraphrase||27.09||22.12||67.12||15.15||8.09||79.52|
|with decoder 2 output||26.2||23.19||68.52||14.99||8.02||79.65|
Our VAE-ITERDEC2 model provides significant improvements on this dataset outperforming previously best approaches on all three metrics with score, and . This is an improvement of over , and reduction of in , and respectively compared to the previous state of the art. Contrary to Quora, however, VAE-ITERDEC3 attains slightly less score compared to VAE-ITERDEC2 in terms of these metrics which shows that addition of the third decoder does not necessarily lead to better results. But using the second decoder significantly improves the results. Thus it still needs to be explored what is the optimal number of decoders needed for a dataset or if it can be decided dynamically. Adding loss to VAE-ITERDEC2 gives best results giving a score of , and . Few example paraphrases generated by VAE-ITERDEC2 on MSCOCO have been shown in table 4.
In the first example, first decoder generates a paraphrase which has little relevance with respect to the input, however, the second decoder corrects it by replacing ‘group of people that are sitting’ with ‘group of motorcycles drive down’ as can be seen in the attention map also in figure 3 (left). In the third example in table 4 first decoder uses a generic term ‘food’ as a replacement for ‘desert’ while the second decoder introduces the word ‘cake’ while attending on ‘food’ as can be seen in the attention visualization in figure 3 (right). It also introduces ‘with flowers on it’ to represent the notion of ‘beautiful dessert’ in the original sentence. Similarly in the last example in the table, the paraphrase generated by the second decoder includes ‘office setting’, making it coherent with the input while its structure resembles the expected paraphrase.
To compare the outputs generated by the two decoders in VAE-ITERDEC2 model, we computed the metric scores of decoder 1 output with - expected paraphrase and decoder 2 output as shown in table 5. score with expected paraphrase is sufficiently low compared to VAE-ITERDEC2 scores in table 2. This implies that the second decoder significantly improves the scores over the first decoder. The same observation holds for and . Comparing decoder 1 output with decoder 2 outputs, we get a high which suggests second decoder generates sufficiently different outputs from the first one.
In this paper, we have proposed attention based ReDecode framework for iterative refinement of generated paraphrases using VAE based Seq2Seq model. It comprises of a sequence of decoders which generate paraphrases turn by turn. Given a decoder, it attends on the output generated by the preceding decoder and modifies it by rectifying errors and introducing semantically coherent phrases, while generating its output. Quantitatively, it improves the previous best scores on standard metrics and benchmark datasets, establishing a new state of the art in this task.
We experimented with maximum three decoders using our ReDecode framework. On Quora dataset, using three decoders improved the scores over two decoders model contrary to MSCOCO. Determining the optimal number of decoders, which can be dataset dependent, remains future work. Furthermore, the proposed architecture is generic and might be beneficial in other sequence generation tasks such as machine translation.
- [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- [Barzilay and Lee2003] Barzilay, R., and Lee, L. 2003. Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, 16–23. Association for Computational Linguistics.
- [Bolshakov and Gelbukh2004] Bolshakov, I. A., and Gelbukh, A. 2004. Synonymous paraphrasing using wordnet and internet. In International Conference on Application of Natural Language to Information Systems, 312–323. Springer.
- [Bowman et al.2015] Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; Jozefowicz, R.; and Bengio, S. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
- [Cao et al.2017] Cao, Z.; Luo, C.; Li, W.; and Li, S. 2017. Joint copying and restricted generation for paraphrase. In AAAI, 3152–3158.
- [Dong et al.2017] Dong, L.; Mallinson, J.; Reddy, S.; and Lapata, M. 2017. Learning to paraphrase for question answering. arXiv preprint arXiv:1708.06022.
- [Gupta et al.2017] Gupta, A.; Agarwal, A.; Singh, P.; and Rai, P. 2017. A deep generative framework for paraphrase generation. arXiv preprint arXiv:1709.05074.
- [Iyyer et al.2018] Iyyer, M.; Wieting, J.; Gimpel, K.; and Zettlemoyer, L. 2018. Adversarial example generation with syntactically controlled paraphrase networks. arXiv preprint arXiv:1804.06059.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- [Lavie and Agarwal2007] Lavie, A., and Agarwal, A. 2007. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, 228–231. Association for Computational Linguistics.
- [Luong, Pham, and Manning2015] Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
- [Madnani, Tetreault, and Chodorow2012] Madnani, N.; Tetreault, J.; and Chodorow, M. 2012. Re-examining machine translation metrics for paraphrase identification. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 182–190. Association for Computational Linguistics.
- [McKeown1983] McKeown, K. R. 1983. Paraphrasing questions using given and new information. Computational Linguistics 9(1):1–10.
- [Papineni et al.2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, 311–318. Association for Computational Linguistics.
[Pennington, Socher, and
Pennington, J.; Socher, R.; and Manning, C.
Glove: Global vectors for word representation.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543.
- [Prakash et al.2016] Prakash, A.; Hasan, S. A.; Lee, K.; Datla, V.; Qadir, A.; Liu, J.; and Farri, O. 2016. Neural paraphrase generation with stacked residual lstm networks. arXiv preprint arXiv:1610.03098.
- [Quirk, Brockett, and Dolan2004] Quirk, C.; Brockett, C.; and Dolan, B. 2004. Monolingual machine translation for paraphrase generation. Association for Computational Linguistics.
- [Rocktäschel et al.2015] Rocktäschel, T.; Grefenstette, E.; Hermann, K. M.; Kočiskỳ, T.; and Blunsom, P. 2015. Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664.
- [Snover et al.2006] Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; and Makhoul, J. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, volume 200.
- [Sundermeyer, Schlüter, and Ney2012] Sundermeyer, M.; Schlüter, R.; and Ney, H. 2012. Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association.
- [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 3104–3112.
- [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998–6008.
- [Wieting, Mallinson, and Gimpel2017] Wieting, J.; Mallinson, J.; and Gimpel, K. 2017. Learning paraphrastic sentence embeddings from back-translated bitext. arXiv preprint arXiv:1706.01847.
Williams, R. J., and Zipser, D.
A learning algorithm for continually running fully recurrent neural networks.Neural computation 1(2):270–280.
- [Yin et al.2015] Yin, J.; Jiang, X.; Lu, Z.; Shang, L.; Li, H.; and Li, X. 2015. Neural generative question answering. arXiv preprint arXiv:1512.01337.