The latent bag of words model for paraphrase generation
Paraphrase generation is a longstanding important problem in natural language processing. In addition, recent progress in deep generative models has shown promising results on discrete latent variables for text generation. Inspired by variational autoencoders with discrete latent structures, in this work, we propose a latent bag of words (BOW) model for paraphrase generation. We ground the semantics of a discrete latent variable by the BOW from the target sentences. We use this latent variable to build a fully differentiable content planning and surface realization model. Specifically, we use source words to predict their neighbors and model the target BOW with a mixture of softmax. We use Gumbel top-k reparameterization to perform differentiable subset sampling from the predicted BOW distribution. We retrieve the sampled word embeddings and use them to augment the decoder and guide its generation search space. Our latent BOW model not only enhances the decoder, but also exhibits clear interpretability. We show the model interpretability with regard to (i) unsupervised learning of word neighbors (ii) the step-by-step generation procedure. Extensive experiments demonstrate the transparent and effective generation process of this model.[%s]READ FULL TEXT VIEW PDF
We consider the problem of molecular graph generation using deep models....
Generative models with both discrete and continuous latent variables are...
While neural language models have recently demonstrated impressive
Variational auto-encoder (VAE) with Gaussian priors is effective in text...
Past work on story generation has demonstrated the usefulness of conditi...
Variational encoder-decoders (VEDs) have shown promising results in dial...
Many applications, such as text modelling, high-throughput sequencing, a...
The latent bag of words model for paraphrase generation
The generation of paraphrases is a longstanding problem for learning natural language (McKeown, 1983). Paraphrases are defined as sentences conveying the same meaning but with different surface realization. For example, in a question answering website, people may ask duplicated questions like How do I improve my English v.s. What is the best way to learn English. Paraphrase generation is important, not only because paraphrases demonstrate the diverse nature of human language, but also because the generation system can be the key component to other important language understanding tasks, such as question answering(Buck et al., 2018; Dong et al., 2017), machine translation (Cho et al., 2014), and semantic parsing (Su and Yan, 2017).
Traditional models are generally rule based, which find lexical substitutions from WordNet (Miller, 1995) style resources, then substitute the content words accordingly (Bolshakov and Gelbukh, 2004; Narayan et al., 2016; Kauchak and Barzilay, 2006). Recent neural models primary rely on the sequence-to-sequence (seq2seq) learning framework (Sutskever et al., 2014; Prakash et al., 2016), achieving inspiring performance gains over the traditional methods. Despite its effectiveness, there is no strong interpretability of seq2seq learning. The sentence embedding encoded by the encoder is not directly associated with any linguistic aspects of that sentence222The linguistic aspects we refer include but not limited to: words, phrases, syntax, and semantics.. On the other hand, although interpretable, many traditional methods suffer from suboptimal performance (Prakash et al., 2016). In this work we introduce a model with optimal performance that maintains and benefits from semantic interpretability.
To improve model interpretability, researchers typically follow two paths. First, from a probabilistic perspective, one might encode the source sentence into a latent code with certain structures (Kim et al., 2018) (e.g. a Gaussian variable for the MNIST(Kingma and Welling, 2013) dataset). From a traditional natural language generation(NLG) perspective, one might explicitly separate content planning and surface realization (Moryossef et al., 2019). The traditional word substitution models for paraphrase generation are an example of planning and realization: first, word neighbors are retrieved from WordNet (the planning stage), and then words are substituted and re-organized to form a paraphrase (the realization stage). Neighbors of a given word refer to words that are semantically close to the given word (e.g. improve learn). Here the interpretability comes from a linguistic perspective, since the model performs generation step by step: it first proposes the content, then generates according to the proposal. Although effective across many applications, both approaches have their own drawbacks. The probabilistic approach lacks explicit connection between the code and the semantic meaning, whereas for the traditional NLG approach, separation of planning and realization is (across most models) nondifferentiable (Cao et al., 2017; Moryossef et al., 2019)
, which then sacrifices the end-to-end learning capabilities of network models, a step that has proven critical in a vast number of deep learning settings.
In an effort to bridge these two approaches, we propose a hierarchical latent bag of words model for planning and realization. Our model uses words of the source sentence to predict their neighbors in the bag of words from target sentences 333In practice, we gather the words from target sentences into a set. This set is our target BOW.. From the predicted word neighbors, we sample a subset of words as our content plan, and organize these words into a full sentence. We use Gumbel top-k reparameterization(Jang et al., 2017; Maddison et al., 2017) for differentiable subset sampling(Xie and Ermon, 2019), making the planning and realization fully end-to-end. Our model then exhibits interpretability of from both of the two perspectives: from the probabilistic perspective, since we optimize a discrete latent variable towards the bag of words of the target sentence, the meaning of this variable is grounded with explicit lexical semantics; from the traditional NLG perspective, our model follows the planning and realization steps, yet fully differentiable. Our contributions are:
We endow a hierarchical discrete latent variable with explicit lexical semantics. Specifically, we use the bag of words from the target sentences to ground the latent variable.
We use this latent variable model to build a differentiable step by step content planning and surface realization pipeline for paraphrase generation.
We demonstrate the effectiveness of our model with extensive experiments and show its interpretability with respect to clear generation steps and the unsupervised learning of word neighbors.
Our goal is to extend the seq2seq model (figure 1 lower part) with differentiable content planning and surface realization (figure 1 upper part). We begin with a discussion about the seq2seq base model.
The classical seq2seq model encodes the source sequence into a code , and decodes it to the target sequence (Figure 1 lower part), where and are the length of the source and the target, respectively (Sutskever et al., 2014; Bahdanau et al., 2015). The encoder enc and the decoder dec are both deep networks. In our work, they are implemented as LSTMs(Hochreiter and Schmidhuber, 1997)
. The loss function is the negative log likelihood.
where is the true data distribution. The model is trained with gradient based optimizer, and we use Adam (Kingma and Ba, 2014) in this work. In this setting, the code does not have direct interpretability. To add interpretability, in our model, we ground the meaning of a latent structure with lexical semantics.
Now we consider formulating a plan as a bag of words before the surface realization process. Formally, let be the vocabulary of size ; then a bag of words (BOW) of size is a random set formulated as a
-hot vector in. We assume is sampled from a base categorical distribution by times without replacement. Directly modeling the distribution of is hard due to the combinatorial complexity, so instead we model its base categorical distribution . In paraphrase generation datasets, one source sentence may correspond to multiple target sentences. Our key modeling assumption is that the BOW from target sentences (target BOW) should be similar to the neighbors of the words in the source sentence. As such, we define the base categorial variable as the mixture of all neighbors of all source words. Namely, first, for each source word , we model its neighbor word with a one-hot :
The support of is also the word vocabulary andfor each (
). We then mix the probabilities of these neighbors:
where is the maximum number of predicted words.
is a categorical variable mixing all neighbor words. We construct the bag of wordsby sampling from by times without replacement. Then we use as the plan for decoding . The generative process can be written as:
For optimization, we maximize the negative log likelihood of and :
where is the true distribution of the BOW from the target sentences. is a -hot vector representing the target bag of words. One could also view as a regularization of using the weak supervision from target bag of words. Another choice is to view as completely latent and infer them like a canonical latent variable model.444In such case, one could consider variational inference (VI) with and regularize the variational posterior. Similar with our relexation of using , there should also be certain relaxations over the variational family to make the inference tractable. We leave this to future work. We find out using the target BOW regularization significantly improves the performance and interpretability. is the total loss to optimize over the parameters . Note that for a given source word in a particular training instance, the NLL loss does not penalize the predictions not included in the targets of this instance. This property makes the model be able to learn different neighbors from different data points, i.e., the learned neighbors will be at a corpus level, rather than sentence level. We will further demonstrate this property in our experiments.
As is discussed in the previous section, the sampling of (sample k items from a categorical distribution) is non-differentiable.
555 One choice could be the score function estimator, but empirically it suffers from high variance.
One choice could be the score function estimator, but empirically it suffers from high variance.To back-propagate the gradients through in in equation 5
, we choose a reparametrized gradient estimator, which relies on the gumbel-softmax trick. Specifically, we perform differentiable subset sampling with the gumbel-topk reparametrization(Kool et al., 2019). Let the probability of to be , we obtain the perturbed weights and probabilities by:
Retrieving the largest weights topk will give us sample without replacement. This process is shown in dashed lines in figure 1. We retrieve the sampled word embeddings and re-weight them with their probability . Then we used the average of the weighted word embeddings as the decoder LSTM’s initial state to perform surface realization.
Intuitively, in addition to the sentence code , the decoder also takes the weighted sample word embeddings and performs attention(Bahdanau et al., 2015) to them; thus differentiability is achieved. This generated plan will restrict the decoding space towards the bag of words of the target sentences. More detailed information about the network architecture and the parameters are in the appendix. In section 4, we use extensive experiments to demonstrate the effectiveness of our model.
Paraphrase Generation. Paraphrases capture the essence of language diversity (Pavlick et al., 2015) and often play important roles in many challenging natural language understanding tasks like question answering (Buck et al., 2018; Dong et al., 2017), semantic parsing (Su and Yan, 2017) and machine translation (Cho et al., 2014). Traditional methods generally employ rule base content planning and surface realization procedures (Bolshakov and Gelbukh, 2004; Narayan et al., 2016; Kauchak and Barzilay, 2006). These methods often rely on WordNet (Miller, 1995) style word neighbors for selecting substitutions. Our model can unsupervised learn the word neighbors and predict them on the fly. Recent end-to-end models for paraphrase generation include the attentive seq2seq model(Su and Yan, 2017), the Residual LSTM model (Prakash et al., 2016), the Gaussian VAE model (Gupta et al., 2018), the copy and constrained decoding model (Cao et al., 2017)
, and the reinforcement learning approach(Li et al., 2018). Our model has connections to the copy and constrained decoding model by Cao et al. (2017). They use an IBM alignment (Collins, ) model to restrict the decoder’s search space, which is not differentiable. We use the latent BOW model to guide the decoder and use the gumbel topk to make the whole generation differentiable. Compared with previous models, our model learns word neighbors in an unsupervised way and exhibits a differentiable planning and realization process.
Latent Variable Models for Text. Deep latent variable models have been an important recent trend (Kim et al., 2018; Fu, 2018) in text modeling. One common path is for researchers to start from a standard VAE with a Gaussian prior (Bowman et al., 2016), which may perhaps encouter issues due to posterior collapse (Dieng et al., 2018; He et al., 2019). Multiple approaches have been proposed to control the tradeoff between the inference network and the generative network (Zhao et al., 2018; Xu and Durrett, 2018). In particular, the VAE (Higgins et al., 2017) use a balance parameter to balance the two models in an intuitive way. This approach will form one of our baselines.
Many discrete aspects of the text may not be captured by a continuous latent variable. To better fit the discrete nature of sentences, with the help of the Gumbel-softmax trick (Maddison et al., 2017; Jang et al., 2017), recent works try to add discrete structures to the latent variable Ziegler and Rush (2019); Wiseman et al. (2018); Choi et al. (2018). Our work directly maps the meaning of a discrete latent variable to the bag of words from the target sentences. To achieve this, we utilize the recent differentiable subset sampling (Xie and Ermon, 2019) with the Gumbel top-k (Kool et al., 2019) reparameterization. It is also noticed that the multimodal nature of of text can pose challenges for the modeling process (Ziegler and Rush, 2019). Previous works show that mixture models may come into aid (Arora et al., 2017; Yang et al., 2017). In our work, we show the effectiveness of the mixture of softmax for the multimodal bag of words distribution.
Content Planning and Surface Realization. The generation process of natural language can be decomposed into two steps: content planning and surface realization (also called sentence generation) (Moryossef et al., 2019). The seq2seq model (Sutskever et al., 2014) implicitly performs the two steps by encoding the source sentence into an embedding and generating the target sentence with the decoder LSTM. A downside is that this intermediate embedding makes it hard to explicitly control or interpret the generation process (Bahdanau et al., 2015; Moryossef et al., 2019). Previous works have shown that explicit planning before generation can improve the overall performance. Puduppully et al. (2019); Sha et al. (2018); Gehrmann et al. (2018); Liu et al. (2018) embed the planning process into the network architecture. Moryossef et al. (2019) use a rule based model for planning and a neural model for realization. Wiseman et al. (2018) use a latent variable to model the sentence template. Wang et al. (2019) use a latent topic to model the topic BOW, while Ma et al. (2018) use BOW as regularization. Conceptually, our model is similar to Moryossef et al. (2019) as we both perform generation step by step. Our model is also related to Ma et al. (2018). While they use BOW for regularization, we map the meaning of the latent variable to the target BOW, and use the latent variable to guide the generation.
Datasets and Metrics. Following the settings in previous works (Li et al., 2018; Gupta et al., 2018), we use the Quora666https://www.kaggle.com/aymenmouelhi/quora-duplicate-questions dataset and the MSCOCO(Lin et al., 2014) dataset for our experiments. The MSCOCO
dataset was originally developed for image captioning. Each image is associated with 5 different captions. These captions are generally close to each other since they all describe the same image. Although there is no guarantee that the captions must be paraphrases as they may describe different objects in the same image, the overall quality of this dataset is favorable. In our experiments, we use 1 of the 5 captions as the source and all content words777 We view nouns, verbs, adverbs, and adjectives as content words. We view pronouns, prepositions, conjunctions and punctuation as non-content words. from the rest 4 sentences as our BOW objective. We randomly choose one of the rest 4 captions as the seq2seq target. The Quora dataset is originally developed for duplicated question detection. Duplicated questions are labeled by human annotators and guaranteed to be paraphrases. In this dataset we only have two sentences for each paraphrase set, so we randomly choose one as the source, the other as the target. After processing, for the Quora dataset, there are 50K training instances and 20K testing instances, and the vocabulary size is 8K. For the MSCOCO dataset, there are 94K training instances and 23K testing instances, and the vocabulary size is 11K. We set the maximum sentence length for the two datasets to be 16. More details about datasets and pre-processing are shown in the appendix.
Although the evaluation of text generation can be challenging (Novikova et al., 2017; Liu et al., 2016; Wang et al., 2018), previous works show that matching based metrics like BLEU (Papineni et al., 2002) or ROUGE (Lin, 2004) are suitable for this task as they correlate with human judgment well (Li et al., 2018). We report all lower ngram metrics (1-4 grams in BLEU, 1-2 gram in ROUGE) because these have been shown preferable for short sentences (Li et al., 2018; Liu et al., 2016).
We use the seq2seq LSTM with residual connections(Prakash et al., 2016) and attention mechanism (Bahdanau et al., 2015) as our baseline (Residual Seq2seq-Attn). We also use the VAE as a baseline generative model and control the parameter to balance the reconstruction and the recognition networks. Since the VAE models do not utilize the attention mechanism, we also include a vanilla sequence to sequence baseline without attention (Seq2seq). It should be noted that although we do not include other SOTA models like the Transformer (Vaswani et al., 2017), the Seq2seq-Attn model is trained with 500 state size and 2 stacked LSTM layers, strong enough and hard to beat. We also use a hard version of our BOW model (BOW-hard) as a lower bound, which optimizes the encoder and the decoder separately, and pass no gradient back from the decoder to the encoder. We compare two versions of our latent BOW model: the topk version (LBOW-Topk), which directly chooses the most k probable words from the encoder, and the gumbel version (LBOW-Gumbel), which samples from the BOW distribution with gumbel reparameterization, thus injecting randomness into the model. Additionally, we also consider a cheating model that is able see the BOW of the actual target sentences during generation (Cheating BOW). This model can be considered as an upper bound of our models. The evaluation of the LBOW models are performed on the held-out test set so they cannot see the target BOW. All above models are approximately the same size, and the comparison is fair. In addition, we compare our results with Li et al. (2018). Their model is SOTA on the Quora dataset. The numbers of their model are not directly comparable to ours since they use twice larger data containing negative samples for inverse reinforcement learning.888They do not release their code so their detailed data processing should also be different with ours, making the results not directly comparable. Experiments are repeated three times with different random seeds. The average performance is reported. More configuration details are listed in the appendix.
|Seq2seq(Prakash et al., 2016)||54.62||40.41||31.25||24.97||57.27||33.04||54.62|
|Residual Seq2seq-Attn (Prakash et al., 2016)||54.59||40.49||31.25||24.89||57.10||32.86||54.61|
|-VAE, (Higgins et al., 2017)||43.02||28.60||20.98||16.29||41.81||21.17||40.09|
|-VAE, (Higgins et al., 2017)||47.86||33.21||24.96||19.73||47.62||25.49||45.46|
|BOW-Hard (lower bound)||33.40||21.18||14.43||10.36||36.08||16.23||33.77|
|RbM-SL(Li et al., 2018)||-||43.54||-||-||64.39||38.11||-|
|RbM-IRL(Li et al., 2018)||-||43.09||-||-||64.02||37.72||-|
|Cheating BOW (upper bound)||72.96||61.78||54.40||49.47||72.15||52.61||68.53|
|Seq2seq(Prakash et al., 2016)||69.61||47.14||31.64||21.65||40.11||14.31||36.28|
|Residual Seq2seq-Attn (Prakash et al., 2016)||71.24||49.65||34.04||23.66||41.07||15.26||37.35|
|-VAE, (Higgins et al., 2017)||68.81||45.82||30.56||20.99||39.63||13.86||35.81|
|-VAE, (Higgins et al., 2017)||70.04||47.59||32.29||22.54||40.72||14.75||36.75|
|BOW-Hard (lower bound)||48.14||28.35||16.25||9.28||31.66||8.30||27.37|
|Cheating BOW (upper bound)||80.87||75.09||62.24||52.64||49.95||23.94||43.77|
Table 1 show the overall performance of all models. Our models perform the best compared with the baselines. The Gumbel version performs slightly worse than the topk version, but they are generally on par. The margins over the Seq2seq-Attn are not that large (approximately 1+ BLEUs). This is because the capacity of all models are large enough to fit the datasets fairly well. The BOW-Hard model does not perform as well, indicating that the differentiable subset sampling is important for training our discrete latent model. Although not directly comparable, the numbers of RbM models are higher than ours since they are SOTA models on Quora. But they are still not as high as the Cheating BOW’s, which is consistent with our analysis. The cheating BOW outperforms all other models by a large margin with the leaked BOW information in the target sentences. This shows that the Cheating BOW is indeed a meaningful upper bound and the accuracy of the predicted BOW is essential for an effective decoding process. Additionally, we notice that VAEs are not as good as the vanilla Seq2seq models. The conjecture is that it is difficult to find a good balance between the latent code and the generative model. In comparison, our model directly grounds the meaning of the latent variable to be the bag of words from target sentences. In the next section, we show this approach further induces the unsupervised learning of word neighbors and the interpretable generation stages.
Figure 2 shows the planning and realization stages of our model. Given a source sentence, it first generates the word neighbors, samples from the generated BOW (planning), and generates the sentence (realization). In addition to the statistical interpretability, our model shows clear linguistical interpretability. Compared to the vanilla seq2seq model, the interpretability comes from: (1). Unsupervised learning of word neighbors (2). The step-by-step generation process.
Unsupervised Learning of Word Neighbors. As highlighted in Figure 2, we notice that the model discovers multiple types of lexical semantics among word neighbors, including: (1). word morphology, e.g., speak - speaking - spoken (2). synonym, e.g., big - large, racket - racquet. (3). entailment, e.g., improve - english (4). metonymy999Informally, if A is the metonymy of B, then A is a stand-in for B, e.g., the White House - the US government; Google - search engines; Facebook - social media., e.g., search - googling. The identical mapping is also learned (e.g., blue - blue) since all words are neighbors to themselves.
The model can learn this is because, although without explicit alignment, words from the target sentences are semantically close to the source words. The mixture model drops the order information of the source words and effectively match the predicted word set to the BOW from the target sentences. The most prominent word-neighbor relation will be back-propagated to words in the source sentences during optimization. Consequently, the model discovers the word similarity structures.
The Generation Steps. A generation plan is formulated by the sampling procedure from the BOW prediction. Consequently, an accurate prediction of the BOW is essential for guiding the decoder, as is demonstrated by the Cheating BOW model in the previous section. The decoder then performs surface realization based on the plan. During realization, the source of word choice comes from (1). the plan (2). the decoder’s language model. As we can see from the second example in Figure 2, the planned words include english, speak, improve, and the decoder generates other necessary words like how, i, my from its language model to connect the plan words, forming the output: how can i improve my english speaking? In the next section, we quantitatively analyze the performance of the BOW prediction and investigate how it is utilized by the decoder.
Distributional Coverage. We first verify our model effectiveness for the multimodal BOW distribution. Figure 3(left) shows the number of the learned modes during the training process, compared with the number of target modes (number of words in the target sentences). For a single categorical variable in our model, if the largest probability in the softmax is greater than 0.5, we define it as a discovered mode. The figure shows an increasing trend of the mode discovery. In the MSCOCO dataset, after convergence, the number of discovered modes is less than the target modes, while in the Quora dataset, the model learns more modes than the target. This difference comes from the two different aspects of these datasets. First, the MSCOCO dataset has more target sentences (4 sentences) than the Quora dataset (1 sentence), which is intuitively harder to cover. Second, the MSCOCO dataset has a noisier nature because the sentences are not guaranteed to be paraphrases. The words in the target sentences might not be as strongly correlated with the source. For the Quora dataset, since the NLL loss does not penalize modes not in the label, the model can discover the neighbor of a word from different context in multiple training instances. In figure 3 right lower, word neighbors like pokemon-manaply, much-spending are not in the target sentence, they are generalized from other instances in the training set. In fact, this property of the NLL loss allows the model to learn the corpus level word similarity (instead of the sentence level), and results in more predicted word neighbors than the BOW from one particular target sentence.
BOW Prediction Performance and Utilization. As shown in Figure 3
(right), the precision and recall of the BOW prediction is not very high (39+ recall forMSCOCO, 46+ precision for Quora). The support of the precision/ recall correspond the to number of predicted/ target modes respectively in the left figure. We notice that the decoder heavily utilizes the predicted words since more than 50% of the decoder’s word choices come from the BOW. If the encoder can be accurate about the prediction, the decoder’s search space would be more effectively restricted to the target space. This is why leaking the BOW information from the target sentences results in the best BLEU and ROUGE scores in Table 1. However, although not being perfect, the additional information from the encoder still provides meaningful guidance, and improves the decoder’s overall performance. Furthermore, our model is orthogonal to other techniques like conditioning the decoder’s each input on the average BOW embedding (BOW emb), or the Copy mechanism(Gu et al., 2016) (copy). When we integrate our model with such techniques that better exploit the BOW information, we see consistent performance improvement (Figure 4 left).
One advantage of latent variable models is that they allow us to control the final output from the latent code. Figure 4 shows this property of our model. While the interpolation in previous Gaussian VAEs Kingma and Welling (2013); Bowman et al. (2016) can only be interpreted as the arithmetic of latent vectors from a geometry perspective, our discrete version of interpolation can be interpreted from a lexical semantics perspective: adding, deleting, or changing certain words. In the first example, the word sitting is changed to be riding, and one additional road is added. This results the final sentence changed from man … sitting … to man riding … on … road. The second example is another addition example where holding, picture are added to the sentence. Although not quite stable in our experiments101010The model does not guarantee the the input words would appear in the output, to add such constraints one could consider constrained decoding like Grid Beam Search (Hokamp and Liu, 2017). , this property may induce further application potential with respect to lexical-controllable text generation.
The latent BOW model serves as a bridge between the latent variable models and the planning-and-realization models. The interpertability comes from the clear generation stages, while the performance improvement comes from the guidance by the sampled bag of words plan. Although effective, we find out that the decoder heavily relies on the BOW prediction, yet the prediction is not as accurate. On the other hand, when there exists information leakage of BOW from the target sentences, the decoder can achieve significantly higher performance. This indicates a future direction is to improve the BOW prediction to better restrict the decoder’s search space. Overall, the step by step generation process serves an move towards more interpretable generative models, and it opens new possibilities of controllable realization through directly injecting lexical information into the middle stages of surface realization.
We thank the reviewers for their detailed feedbacks and suggestions. We thank Luhuan Wu and Yang Liu for the meaningful discussions. This research is supported by China Scholarship Council, Sloan Fellowship, McKnight Fellowship, NIH, and NSF.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 224–232. JMLR. org, 2017.
Thirty-First AAAI Conference on Artificial Intelligence, 2017.
Bag-of-words as target for neural machine translation.In ACL, 2018.
Why we need new evaluation metrics for nlg.In EMNLP, 2017.