Learning Sparse Prototypes for Text Generation

06/29/2020 ∙ by Junxian He, et al. ∙ Carnegie Mellon University University of California, San Diego 0

Prototype-driven text generation uses non-parametric models that first choose from a library of sentence "prototypes" and then modify the prototype to generate the output text. While effective, these methods are inefficient at test time as a result of needing to store and index the entire training corpus. Further, existing methods often require heuristics to identify which prototypes to reference at training time. In this paper, we propose a novel generative model that automatically learns a sparse prototype support set that, nonetheless, achieves strong language modeling performance. This is achieved by (1) imposing a sparsity-inducing prior on the prototype selection distribution, and (2) utilizing amortized variational inference to learn a prototype retrieval function. In experiments, our model outperforms previous prototype-driven language models while achieving up to a 1000x memory reduction, as well as a 1000x speed-up at test time. More interestingly, we show that the learned prototypes are able to capture semantics and syntax at different granularity as we vary the sparsity of prototype selection, and that certain sentence attributes can be controlled by specifying the prototype for generation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Language models (LMs) predict a probability distribution over text, and are a fundamental technology widely studied in the natural language processing (NLP) community

(Bengio et al., 2003; Merity et al., 2018; Dai et al., 2019). Modern LMs are almost exclusively based on parametric recurrent (Mikolov et al., 2010; Sundermeyer et al., 2012) or self-attentional (Vaswani et al., 2017; Al-Rfou et al., 2019)neural networks. These models are of interest scientifically as one of the purest tests of our ability to capture the intricacies of human language mathematically (Linzen et al., 2016; Kuncoro et al., 2017; Petroni et al., 2019). They also have broad downstream applications in generating text in systems such as machine translation (Bahdanau et al., 2015), summarization (Rush et al., 2015), or dialog generation (Sordoni et al., 2015), as well as in the unsupervised representation learners that now power many applications in NLP (Devlin et al., 2018; Liu et al., 2019; Yang et al., 2019).

However, there has been a recent move towards non-parametric neural LMs (Guu et al., 2018) that generate sentences in a two-step process of (1) selecting a prototype sentence from a large database, and (2) editing this prototype to the final desired output. Intuitively, these methods are attractive because they help remove some of the pressure on the parametric model to memorize the entirety of the language it must model. These intuitive advantages are also born out in superior performance on language modeling tasks (Guu et al., 2018), as well as down-stream applications such as dialogue response generation (Weston et al., 2018; Wu et al., 2019), style transfer (Li et al., 2018), and code generation (Hashimoto et al., 2018; Hayati et al., 2018). In addition, the prototypes and continuous representations of the edits in such models lend an element of interpretability to the modeling process. On the down side, however, previous prototype-driven generation methods usually need to store and index a large prototype candidate pool (in general the whole training dataset), leading to significant issues with memory and speed efficiency at test time.

In this paper, we hypothesize that, in fact, a small set of prototypes is sufficient to achieve the great majority of the gains afforded by such non-parametric models. Intuitively, in a large corpus many sentences look very similar and may be represented by minor transformations of a single prototype sentence. For example, the sentence “I ordered a burger with fries” can serve as the prototype for data samples with the form “I ordered [NOUN PHRASE] with [NOUN PHRASE]”. This is evidenced by Guu et al. (2018)’s observation that 70 of the test set in the Yelp restaurant review corpus (Yelp, 2017) is within word-token Jaccard distance 0.5 of one training sentence.

To take advantage of this intuition, we propose a novel generative model that samples prototypes from a latent prototype distribution, which itself is sampled from a symmetric Dirichlet prior, as shown in Figure 1 (Section 3.1

). The Dirichlet prior with appropriate hyperparameters is able to encourage a

sparse prototype selection distribution, allowing us to reduce the prototype support set at test time to greatly improve efficiency. Moreover, we utilize amortized variational inference (Kingma and Welling, 2013) to train our model, which introduces a learnable prototype retriever to identify prototypes useful for generating each sentence (Section 3.2). This is different from (Guu et al., 2018) where prototypes for each sentence are fixed before training through edit distance heuristics.

We evaluate our approach on the MSCOCO (Lin et al., 2014) and Yelp restaurant review (Yelp, 2017) corpora. Our method is able to improve perplexity over the neural language model baseline by up to 14 points and previous neural editor model by 6 points while achieving over 1000x memory savings and a 1000x speedup at test time (Section 4.2

). Interestingly, we find that the learned prototypes are able to represent different features when varying sparsity levels – a strong sparsity prior forces the model to share prototypes and the induced prototypes turn out to represent more generic features (e.g. syntactic form of the sentence). On the text generation side, our model is able to generate sentences that resemble the given prototype while allowing for smooth interpolation on the edit space as well (Section


2 Background

Figure 1: Left:

the proposed generative model to generate data by editing prototypes. Shaded circles denote the observed variables and unshaded denote the latents. Prototypes are sampled from a sparse prototype distribution which itself is a random variable sampled from a Dirichlet prior distribution.

Right: the inference diagram of the model, with being the prototype retriever and being the inverse editor.

The prototype-then-edit framework defines a non-parametric way to augment text generation models. Formally, given a corpus ,111Below, we sometimes ignore the subscript to simplify notation when there is no confusion. the model generates each observed example by: (1) retrieving a prototype sequence , (2) generating a continuous edit representation , and (3) generating conditioned on and . These intermediate prototype and edit representation variables can depend on extra context in conditional generation tasks (Hodosh et al., 2013; Gu et al., 2018), or are randomly sampled in unconditioned language modeling. In this paper, we focus on the latter, but our methods could likely be applied to the former as well.

For unconditioned language modeling, Guu et al. (2018) define the data likelihood as:



is the prior distribution over prototypes and defined as a uniform distribution over

all training examples,

is a continuous distribution over the edit vector, and

represents a sequence-to-sequence model parameterized by . Guu et al. (2018)’s stated goal of this prototype-driven model is to take direct advantage of training examples to improve language modeling performance while capturing interpretable semantic or syntactic properties in the latent prototype and edit vector variables. However, because the prototypes are selected from the entire training dataset, such a formulation sacrifices memory and speed efficiency due to the necessity of indexing and searching every training example at test time. In the following section, we detail our approach to mitigate this issue through the learning of sparse prototypes.

3 Method

First we present our our proposed generative model, then we describe the learning and inference techniques for this model class.

3.1 Model Structure

In the previous formulation, Eq. 1, maintaining the entire training dataset at test time is necessary due to assuming a uniform prior over prototypes . Motivated by the hypothesis that a much smaller prototype set would suffice to achieve comparable performance, however, we believe that can be a sparse distribution where the probability mass concentrates on only a few representative prototypes. Since which training examples are representative as prototypes is unknown in advance, we propose to model the prototype distribution as a latent variable, endowing the model with freedom to infer a sparse prototype posterior automatically. We define and further assume that the latent prototype distribution is sampled from a prior distribution (detailed below) which is able to encourage a sparse probability distribution, given appropriate hyperparameters. The graphical model is depicted in Figure 1, which gives the following joint likelihood:


The log marginal likelihood of the data, which we will approximate during training is:


This is a general framework for learning sparse prototypes, and in this work we specifically experiment with the following parameterization to instantiate this model class:

Prior over prototype distribution :

We employ the Dirichlet distribution as : . The support of Dirichlet distribution is the standard probability simplex. Here we use the symmetric Dirichlet distribution which has the same value for all components since we have no prior knowledge favoring one component over another. is positive and also referred to as the concentration parameter, where smaller prefers a sparser prototype distribution , with equivalent to a uniform distribution over the probability simplex. In our experiments, we often choose to encourage sparsity.

Prior over edit vector :

We follow Guu et al. (2018) and utilize a von-Mises Fisher (vMF) distribution to model . The vMF distribution places mass on the surface of the unit sphere, and is parameterized by the mean direction vector and concentration parameter as . Thus, information about the edit is captured through the directions of different unit vector samples.  Xu and Durrett (2018) shows that the vMF distribution has the advantage of being free of the issue of posterior collapse that plagues a large amount of previous work in latent variable models of text (Bowman et al., 2016; He et al., 2019). While Guu et al. (2018) add additional randomness on the norm of edit vectors by multiplying the vMF distribution with another uniform distribution, we sample edit vectors from a uniform vMF distribution directly, which simplifies the model but we nonetheless found sufficient to obtain competitive results. Formally, we define .

The editor :

Generally can be parameterized by any standard Seq2Seq model with the edit vector incorporated. To compare with Guu et al. (2018) directly in the experiments, in this work we adopt the same attentional LSTM architecture (Hochreiter and Schmidhuber, 1997). is utilized to predict the initial hidden state of the decoder and concatenated to the input for the decoder.

3.2 Learning and Inference

Ideally the log marginal likelihood in Eq. 3 should be optimized during training. However, computation is intractable due to marginalization of latent variables, and we resort to amortized variational inference (Kingma and Welling, 2013), optimizing its evidence lower bound (ELBO) instead:


where represents the variational distribution to approximate the model posterior distribution and admits the following factorization form:


Note that we make conditional independence assumption between and other latent variables in to simplify the approximate posterior, following common practice in traditional mean field variational inference. The inference diagram is depicted in Figure 1. The optimal to maximize Eq. 4 is a Dirichlet distribution parameterized by (proof is in Appendix A), i.e., .222 is not symmetric and is a vector. And , the prototype retriever, is a multinomial distribution over training examples parameterized by a neural network . We assume , the inverse neural editor, is a vMF distribution where the mean direction parameter is from an encoder that encodes and parameterized by , and the scalar concentration parameter is a hyperparameter. Pre-fixing results in a constant KL divergence term associated with and proves to be effective to mitigate the posterior collapse issue (Xu and Durrett, 2018) where and become independent. Yet there might be still posterior collapse on , in practice we follow (Li et al., 2019) to combine annealing (Bowman et al., 2016) and free-bits techniques (Kingma et al., 2016) on to mitigate this issue.

Notably, the variational distribution family defined in Eq. 5 admits tractable closed-form expressions of all three KL divergence terms in Eq. 4 (detailed derivations and expressions are in Appendix A). To compute the reconstruction log likelihood , expectations over can be efficiently approximated by the reparameterization trick for the vMF distribution (Guu et al., 2018). However, the prototype is discrete and non-differentiable, and summing over all prototypes to compute is infeasible due to the evaluation burden of . Thus, we use the REINFORCE algorithm (Williams, 1992) to compute the gradients of contributed from as:


where are samples from . We use an average reward from samples as the baseline . The neural parameters

are updated with stochastic gradient descent to maximize Eq. 

4. With respect to the posterior Dirichlet parameter , we found in preliminary experiments that classic gradient descent was unable to update it effectively – was updated too slowly and the Dirichlet prior became decoupled with the model. Thus, we instead update with stochastic variational inference (SVI,  Hoffman et al. (2013)) based on the formula of the optimal given (derivations can be found in Appendix A):


It is infeasibly expensive to keep optimal under current at each training step, as it would involve summing over all training examples. Thus we perform SVI, which uses a batch of examples to approximate Eq. 7, leading to the following update form:


where is the batch size, is the step-size at iteration , is the forgetting rate, and is the delay parameter to down-weight early iterations.

We note that our training algorithm is different from Guu et al. (2018) in that we use a learnable prototype retriever to derive a lower bound as the objective while Guu et al. (2018) directly approximate marginalization over . They use heuristics to fix the prototype set for each to be examples similar to in terms of edit distance, which might produce suboptimal prototypes for the generative model and also does not permit the learning of sparse prototype support.

Sparsity and scalability:

After training we expect to be able to infer a sparse prototype distribution with most components being almost zero, based on which we can prune and store the entries over a particular probability threshold only, improving memory- and time-efficiency at test time. Specifically, we compute mean of under the Dirichlet posterior: , and then take the largest entries that occupy of the probability mass. At test time, we only maintain these prototypes and the prototype retriever is re-normalized accordingly. One issue present during training is that cannot fit into memory when dealing with large datasets since it is a categorical distribution over all training examples. In this work, we randomly downsample a subset of training data as our prototype library before training if memory is unable to fit all training examples, and learn the sparse prototypes on top of this subsampled corpus. This acts like a rough pre-filtering and in Section 4 we show that it suffices to learn good prototypes and achieve competitive language modeling performance. We leave more advanced techniques to address this issue (e.g. dynamically updating the prototype library) as future work.


We now describe the neural architectures we use for the prototype retriever and inverse neural editor . is defined as:

Figure 2: Example of aligned sequences.

where we prevent selecting the data example itself as the prototype during training to avoid overfitting. is a pretrained sentence encoder,

is a linear transformation matrix, and

is a temperature hyperparameter to control the entropy of , which is critical to stabilize the Reinforce algorithm at the initial training stage. To ease the computation of encoding all training examples at each update step, we fix the parameters of and update only, which proves to be sufficient in our experiments. We use the average embeddings of the last layer in pretrained BERT (Devlin et al., 2018)333We use pretrained uncased BERT base from the transformers library (Wolf et al., 2019). as our sentence encoder.

While Guu et al. (2018) uses sum of inserted/deleted word vectors as the mean direction parameter of vMF distribution , we choose a more powerful encoder following recent advances on representing edits (Yin et al., 2019). Specifically, a standard diffing algorithm is run to compute an alignment of tokens in and , and produces two aligned sequences and an additional edit sequence that indicates edit operation (insertion), (deletion), (substitution), and (equal) at each position. We use

to denote padding. This is illustrated in Figure 

2. Word embeddings of all three sequences are concatenated and fed into a single-layer LSTM to obtain the edit representation.

4 Experiments

Our experiments below are designed to (1) examine the efficacy of the proposed method on language modeling, (2) examine the efficiency of the proposed method on memory savings and speed-up at test time, and (3) demonstrate the interpretable semantics/syntax captured by prototypes and edit vectors.

4.1 Setup

We perform experiments on three different-scale datasets to test our method in different scenarios:

  • [leftmargin=*]

  • MSCOCO (Lin et al., 2014): MSCOCO is an image caption dataset and we only focus on its captions as our data. The average length of captions is 12.6. The sentences do not have complex variations and are easy to find similar sentences as prototypes. We randomly sample 40K sentences as our training data and 4K as validation and test set respectively. This dataset represents a simple and small-scale setting to test our approach.

  • Yelp Medium/Yelp Large: These datasets consist of sentences from Yelp restaurant reviews (Yelp, 2017) preprocessed by Guu et al. (2018), allowing us to perform a direct comparison with their method. The medium and large datasets consist of 1.5M and 17M sentences respectfully, allowing us to test in moderate and relatively large data settings respectively. Note that Yelp Medium is obtained by further filtering Yelp Large to keep sentences that are generally shorter and have less variations, but the test sets for these two are the same.


We mainly consider neural language models (NLM) and the neural editor (Guu et al., 2018) as our baseline. It is worth noting that the neural editor model does not have likelihood defined on test sentences that are not similar to any example in the prototype library, thus it is necessary to interpolate with another NLM at test time for smoothing purposes, while our model is able to be used on its own. Note that we only report the neural editor baseline on Yelp Large since their public code is not ready to run on other datasets due to the required pre-built index to retrieve prototypes.


We evaluate language modeling performance with perplexity (PPL). For our model, we approximate the log marginal data likelihood through 1000 importance-weighted latent variable samples (Burda et al., 2015) and compute PPL based on this likelihood. At test time our approach prunes and has access to only prototypes that occupy probability mass of the posterior prototype distribution as described in Section 3.2. We report as “prototypes”. To have an intuitive notion about how similar the prototype and data examples are, we compute average of smoothed sentence-level BLEU scores (Lin and Och, 2004) of the data examples on the validation set with their most likely prototype as the reference. We also report BLEU scores based on part-of-speech sequences (POS-BLEU)444POS tagging is performed using the Stanza library (Qi et al., 2020). to view the similarity from a more syntactic perspective. Test speed is evaluated on a single Nvidia 1080 Ti GPU.


We try different Dirichlet prior parameters to control the sparsity and report different sparsity settings for all the datasets. The temperature parameter in the prototype retriever is tuned on the MSCOCO validation data and set as for all datasets. The concentration parameter of the vMF distribution is tuned and set as 30 for MSCOCO and Yelp Medium and 40 for Yelp Large. The number of Reinforce samples is set as 10 across all datasets. We sample 50K examples for Yelp Medium and 100K examples for Yelp Large as our training prototype library to address the training memory issue discussed in Section 3.2. We employ the same attentional LSTM Seq2Seq model as in Guu et al. (2018) to parameterize for a direct comparison. Our implementation is based on the fairseq toolkit (Ott et al., 2019) and complete hyperparameter settings can be found in Appendix B.

4.2 Results

Dataset Model PPL prototypes test speed (sent/s) BLEU POS-BLEU
MSCOCO random retrieval 10.9 31.6
NLM 20.0 3714
Ours () 18.9 25 (0) 388 13.2 38.8
Ours () 18.6 778 (2) 313 17.2 42.5
Ours () 19.0 16K (40) 250 20.9 46.7
Ours () 19.2 22K (56) 217 22.2 47.9
Yelp Medium random retrieval 8.1 17.8
NLM 74.7 236
Ours () 63.6 77 (0) 157 12.3 24.7
Ours () 61.9 1.5K (0.1) 107 21.6 38.4
Ours () 63.2 31K (2.1) 95 29.9 48.3
Yelp Large random retrieval 6.6 16.0
NLM 34.2 272
Ours () 30.2 2K (0.01) 108 10.5 24.8
Ours () 30.3 5.5K (0.03) 98 10.8 25.3
Interpolated w/ NLM
Neural Editor (Guu et al., 2018) 26.9 17M (100)
Neural Editor (our runs) 31.2 17M (100) 0.1
Ours () 20.2 2K (0.01) 108 10.5 24.8
Table 1: Results on three datasets. Numbers in the parentheses indicate the percentage of prototypes over all training examples. BLEU score is computed by comparing validation examples against their most likely prototypes. POS-BLEU represents the BLEU score on part-of-speech sequences. We also list BLEU scores from random prototype retrieval as a reference point. Results in the starred entry () are obtained by running the public code of the neural editor.

Results are shown in Table 1. Our approach outperforms the NLM baseline across all datasets in terms of PPL, often by a large margin. When interpolated with an NLM at test time, our method outperforms the NLM baseline by 14 PPL points and neural editor by 6.7 PPL points.555For a fair comparison, we interpolate with the same pretrained NLM from (Guu et al., 2018). This effect is also observed in (Guu et al., 2018) – prototype-driven language models are especially strong at modeling test sentences that have similar prototypes but relatively weak at modeling others, thus interpolation with a normal NLM is likely to help. Furthermore, in line with the goal of this work to learn a sparse prototype set, our method is able to achieve superior language modeling performance while utilizing only a small fraction of training examples as prototypes. This verifies our hypothesis that a sparse prototype set suffices for such non-parametric language modeling. Also, sparsity learned in our model allows for over a 1000x memory savings and 1000x speed-up666We include the time to retrieve prototypes for new test sentences when computing test speed. We use Guu et al. (2018)’s public implementation of the neural editor, where the computation of edit distance between all training examples and test sentences to find nearest neighbors accounts for much of the runtime. More efficient implementation of this operation, for example through tries, may speed this to some extent. at test time on Yelp Large compared with a previous neural editor that memorizes all training examples.

Table 1 demonstrates the trend that a smaller Dirichlet hyperparameter leads to a sparser prototype set, which agrees with our expectation. BLEU scores are also improved as increases, implying that the less sparse the prototype set the closer the match between the sentence and its prototype. Interestingly, the BLEU score on Yelp Large is a bit low and the model tends to generally favor sparse prototypes. We suspect that this is because it is difficult to learn prototypes that capture fine-grained semantic features and large sentence variations among 17M examples with a limited prototype memory budget, thus the model has to learn prototypes that represent more generic shared features among examples to reach an optimum – for example, the syntactic feature as somewhat reflected by the decent POS-BLEU scores.

We want to emphasize that different sparsity may lead to different notions of prototypes and it is hard to judge which one should be preferred – memorizing more prototypes pays cost on language modeling performance and does not necessarily produce better PPL. Also, prototypes that are “less similar” to the examples on the superficial level but capture coarse-grained features may have potentially interesting application on sentence generation since the model is able to generate more diverse output conditioned on prototypes.

4.3 Analysis

Ours (31K prototypes) 91.2K 14.4K 9.6K 9.3K 9.0K 7.2K 6.4K 5.5K
Ours (1.5K prototypes) 74.7K 9.9K 8.5K 8.2K 7.3K 5.6K 4.4K 5.0K
Relative Change -18.1 -31.3 -11.5 -11.8 -18.9 -22.2 -31.3 -9.1
Table 2: Number of matching tokens between examples and their prototypes on the Yelp Medium validation set. Results are reported in cluster of POS tags. Relative changes that are larger than the overall change are bolded.
Data Examples Prototypes
the best corned beef hash i ’ve ever had ! (dense) the best real corned beef hash i ’ve had .
(sparse) the chicken satay is the best i ’ve ever had .
the grilled chicken was flavorful , but too flavorful . (dense) the chicken was moist but it lacked flavor .
(sparse) my sandwich was good but the chicken was a little plain .
i asked her what time they close and she said (dense) i asked what time they closed <date> , and was told <cardinal> .
(sparse) we asked how long the wait was and we were informed it
<cardinal> o’clock . would be <time> .
Table 3: Qualitative examples of prototypes when using denser and sparser prototype supports.
Prototype: A man is using a small laptop computer Prototype: A cat sitting on a sidewalk behind a bush
A man is using his laptop computer with his hands on the keyboard A cat laying on top of a wooden bench
A man is using a laptop computer with his hands on the keyboard A cat standing next to a tree in a park
A man is using a laptop computer while sitting on a bench Two cats sitting on a bench near a park bench
A man is using a laptop computer in the middle of a room A dog sitting on a bench near a park bench
A young man is using a laptop computer in the middle of a room A dog sitting on a bench near a park bench
Table 4: Qualitative examples from the MSCOCO dataset on interpolated sentence generation given the prototype. The first row is the given prototype, the second-row and the last-row sentences are obtained by sampling edit vectors from the prior, the rest three sentences are generated by interpolating between the two edit vectors.

How do prototypes change when they grow sparser?

In Table 1 we notice the BLEU scores usually drop when the prototype set is sparser, implying the learned prototypes change in some way under different sparsity settings. Here we take the Yelp Medium dataset as an example to analyze how prototypes are trained to capture sentence attributes differently in the sparse () and relatively dense () situations. Specifically, we align prototype and example sequences to minimize edit distance, and focus on words that were aligned between the two sequences. This allows us to obtain a notion of what kind of words are more likely to be kept the same as the prototype during the model’s editing process and how this pattern changes in different sparsity settings. We cluster these matched words in terms of POS tag and report the most common ones.777Note that alignment is performed on word sequences instead of POS sequences.

Results are shown in Table 2. While the overall number of matching tokens decreases as the prototype set becomes sparser, the content words exhibit a more drastic change (e.g. the nouns, adjectives, and verbs). In contrast, the function words experience a moderate decrease only (e.g. the determiners, auxiliaries, and coordinating conjunctions). This shows that the model tends to learn prototypes that drop fine-grained semantic distinctions but keep the same general syntax when a sparsity constraint is enforced, which is not surprising since it is difficult for a limited number of prototypes to capture large semantic variations in a large dataset. We list qualitative examples in Table 3 to demonstrate this phenomenon.

Interpolation on the edit vector space:

We take our model with 1.5K prototypes on the MSCOCO dataset (i.e. ) and perform sentence interpolation in edit vector space. Specifically, we sample two edit vectors from the uniform vMF prior to produce two sentences for each prototype with beam search decoding, then we perform spherical linear interpolation of the two edit vectors to generate interpolated sentences in-between. Qualitative examples in Table 4 (more examples in Appendix C) show that the edit vectors are able to smoothly capture minor edits over the given prototypes.

5 Conclusion

In this work, we propose a novel generative model that discovers a sparse prototype set automatically by optimizing a variational lower bound of the log marginal data likelihood. We demonstrate its effectiveness on language modeling and its efficiency advantages over previous prototype-driven generative models. The framework proposed here might be generalized to automatically discover salient prototypes from a large corpus. New kinds of prototype structure in text might be discovered through either injecting different biases into the model (e.g. sparsity biases in this paper), or incorporating prior knowledge into the prototype library before training. Finally, the approach might be easily extended to conditional generation (e.g. with the edit vectors depending on other data input), and we envision that inducing a sparse prototype set in this case may potentially facilitate controlling text generation through prototypes. We leave exploration in this direction as our future work.


This work was supported in part by the DARPA GAILA project (award HR00111990063), and a gift of computation credits from Amazon AWS. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government or Amazon.


  • R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones (2019) Character-level language modeling with deeper self-attention. In Proceedings of AAAI, Cited by: §1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR, Cited by: §1.
  • Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin (2003) A neural probabilistic language model.

    Journal of machine learning research

    3 (Feb), pp. 1137–1155.
    Cited by: §1.
  • C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: §A.1.
  • S. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio (2016) Generating sentences from a continuous space. In Proceedings of CoNLL, Cited by: §3.1, §3.2.
  • Y. Burda, R. Grosse, and R. Salakhutdinov (2015)

    Importance weighted autoencoders

    arXiv preprint arXiv:1509.00519. Cited by: §4.1.
  • Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. Le, and R. Salakhutdinov (2019) Transformer-XL: attentive language models beyond a fixed-length context. In Proceedings of ACL, Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §3.2.
  • J. Gu, Y. Wang, K. Cho, and V. O. Li (2018) Search engine guided neural machine translation. In Proceedings of AAAI, Cited by: §2.
  • K. Guu, T. B. Hashimoto, Y. Oren, and P. Liang (2018) Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics 6, pp. 437–450. Cited by: Appendix B, §1, §1, §1, §2, §3.1, §3.1, §3.2, §3.2, §3.2, 2nd item, §4.1, §4.1, §4.2, Table 1, footnote 5, footnote 6.
  • T. B. Hashimoto, K. Guu, Y. Oren, and P. S. Liang (2018) A retrieve-and-edit framework for predicting structured outputs. In Proceedings of NeurIPS, Cited by: §1.
  • S. A. Hayati, R. Olivier, P. Avvaru, P. Yin, A. Tomasic, and G. Neubig (2018) Retrieval-based neural code generation. In Proceedings EMNLP, Cited by: §1.
  • J. He, D. Spokoyny, G. Neubig, and T. Berg-Kirkpatrick (2019) Lagging inference networks and posterior collapse in variational autoencoders. In Proceedings of ICLR, Cited by: §3.1, footnote 8.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.
  • M. Hodosh, P. Young, and J. Hockenmaier (2013)

    Framing image description as a ranking task: data, models and evaluation metrics


    Journal of Artificial Intelligence Research

    47, pp. 853–899.
    Cited by: §2.
  • M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley (2013) Stochastic variational inference. The Journal of Machine Learning Research 14 (1), pp. 1303–1347. Cited by: §3.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix B.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §3.2.
  • D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling (2016) Improved variational inference with inverse autoregressive flow. In Proceedings of NeurIPS, Cited by: §3.2.
  • A. Kuncoro, M. Ballesteros, L. Kong, C. Dyer, G. Neubig, and N. A. Smith (2017)

    What do recurrent neural network grammars learn about syntax?

    In Proceedings of EACL, Cited by: §1.
  • B. Li, J. He, G. Neubig, T. Berg-Kirkpatrick, and Y. Yang (2019) A surprisingly effective fix for deep latent variable modeling of text. In Proceedings of EMNLP, Cited by: Appendix B, §3.2.
  • J. Li, R. Jia, H. He, and P. Liang (2018) Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of NAACL, Cited by: §1.
  • C. Lin and F. J. Och (2004) Orange: a method for evaluating automatic evaluation metrics for machine translation. In Proceedings of COLING, Cited by: §4.1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Proceedings of ECCV, Cited by: §1, 1st item.
  • T. Linzen, E. Dupoux, and Y. Goldberg (2016) Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4, pp. 521–535. Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1.
  • S. Merity, N. S. Keskar, and R. Socher (2018) Regularizing and optimizing LSTM language models. In Proceedings of ICLR, Cited by: §1.
  • T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, Cited by: §1.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL (Demo Track), Cited by: §4.1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of EMNLP, Cited by: Appendix B.
  • F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller (2019) Language models as knowledge bases?. In Proceedings of EMNLP, Cited by: §1.
  • P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning (2020) Stanza: a Python natural language processing toolkit for many human languages. In Proceedings of ACL (Demo Track), Cited by: footnote 4.
  • A. M. Rush, S. Chopra, and J. Weston (2015)

    A neural attention model for abstractive sentence summarization

    In Proceedings of EMNLP, Cited by: §1.
  • A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J. Nie, J. Gao, and B. Dolan (2015) A neural network approach to context-sensitive generation of conversational responses. In Proceedings of NAACL, Cited by: §1.
  • M. Sundermeyer, R. Schlüter, and H. Ney (2012) LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of NeurIPS, Cited by: §1.
  • J. Weston, E. Dinan, and A. Miller (2018) Retrieve and refine: improved sequence generation models for dialogue. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, Cited by: §1.
  • R. J. Williams (1992)

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Machine learning 8 (3-4), pp. 229–256. Cited by: §3.2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: footnote 3.
  • Y. Wu, F. Wei, S. Huang, Y. Wang, Z. Li, and M. Zhou (2019) Response generation by context-aware prototype editing. In Proceedings of AAAI, Cited by: §1.
  • J. Xu and G. Durrett (2018) Spherical latent spaces for stable variational autoencoders. In Proceedings of EMNLP, Cited by: §A.2, Appendix B, §3.1, §3.2.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In Proceedings of NeurIPS, Cited by: §1.
  • Yelp (2017) Yelp dataset challenge, round 8. External Links: Link Cited by: §1, §1, 2nd item.
  • P. Yin, G. Neubig, M. Allamanis, M. Brockschmidt, and A. L. Gaunt (2019) Learning to represent edits. In Proceedings of ICLR, Cited by: §3.2.

Appendix A Derivations of Variational Inference and ELBO

a.1 Derivation of optimal

Here we try to show that the optimal variational distribution over , , is a Dirichlet distribution and derive its optimal as given in Eq. 7. According to (Bishop, 2006) (Chapter 10.1.1), we have:


where denotes expectation over all latent variables except for . We expand Eq. 10 as:


where is the indicator function. We conclude that has the form of Dirichlet distribution and the optimal Dirichlet parameter .

a.2 Derivation of Three KL Divergence Terms

There are three KL divergence terms in our training objective ELBO (Eq. 4). Now we show that all three KL divergence terms can be computed exactly and efficiently at training time and we derive their expressions respectively:

(1). :

As shown in (Xu and Durrett, 2018), the KL divergence between any vMF distribution with fixed concentration parameter and a uniform vMF distribution is a constant:


where is the number of dimensions, stands for the modified Bessel function of the first kind at order . Therefore,


(2). :


where is the digamma function, and step computes the expectation of over Dirichlet variable by using the general fact that the derivative of the log normalization factor with respect to the natural parameter is equal to the expectation of the sufficient statistic.

(3). :


where is the multivariate beta function and is the normalization factor for Dirichlet distribution parameterized by .

Appendix B Experimental Details

On MSCOCO dataset, we use a single-layer attentional LSTM seq2seq architecture with word embedding size 100 and hidden state size 400 as , the latent edit vector dimension is 50. This configuration follows the hyperparameters for vMF-VAE (Xu and Durrett, 2018). On Yelp Medium and Yelp Large datasets, we follow (Guu et al., 2018) to use a three-layer attentional LSTM seq2seq architecture as with word embedding size 300 and hidden state size 256, the edit vector dimension is 128. Skip connections are also used between adjacent LSTM layers. In the inverse editor , we use a single-layer LSTM to encode three sequences – the aligned prototype, aligned data example, and the edit operation sequence, of which the word embedding size for text sequences and the hidden state size are the same as in , and the word embedding size for edit operation sequence is 10 (since the vocobulary size of edit operations is very small). Across all datasets, we initialize word embeddings in (both encoder and decoder sides) and with GloVe word embeddings (Pennington et al., 2014) following (Guu et al., 2018). All NLM baselines use the same architecture as in our model for a fair comparison.

With respect to hyperparameter tuning, we tune the temperature parameter in the prototype retriver on the MSCOCO validation data in the range of , and set as for all datasets. The concentratioon parameter of the vMF distribution is tuned in the range of for all datasets. is set as 30 for MSCOCO and Yelp Medium and 40 for Yelp Large. We run different as for each dataset to obtain prototype set with varying sparsity. On MSCOCO dataset we also add an additional run with since is a large value on this dataset already (90 of training examples are selected as the prototype set when ). We apply annealing and free-bits techniques following (Li et al., 2019) to the KL term on prototype variable, , to mitigate posterior collapse. Specifically, in our training objective becomes in practice. This objective means that we can downweight this KL term with and optimize it only when this KL is larger than a threshold value . We increase from 0 to 1 linearly in the first epochs (annealing). is tuned in the range of for MSCOCO and for Yelp Medium and Yelp Large.888It is unreasonable to set a large on Yelp dataset since there are tens of thousands update steps per epoch on Yelp, the annealinig process would be too slow if is large which usually hurts the language modeling performance (He et al., 2019). is tuned in the range of . To obtain the reported results in Section 4, is set as 5 for MSCOCO, 2 for Yelp Medium and 3 for Yelp Large. is set as 5 for MSCOCO, 6 for Yelp Medium and 8 for Yelp Large. We use Adam (Kingma and Ba, 2014) to optimize the training objective with learning rate 0.001.

Appendix C Qualitative Results on Interpolation

As in Section 4.3, here we show more generated examples through interpolation on MSCOCO dataset.

Prototype: a horse drawn carriage on the side of a city street Prototype: A baseball pitcher on the mound having just threw a pitch
Two horses drawn carriage on a city street a baseball player swinging a bat at home plate
Two horses standing next to each other on a city street a man about to hit a ball with his bat
Two horses on the side of a city street a man swinging a bat at the ball during a game
Two horses on the side of a city street a person swinging a bat at the ball during a game
A brown and white horse drawn carriage on a city street A person swinging a bat during a baseball game
Prototype: A man walking on the beach carrying a surfboard Prototype: A group of people are raising an umbrella on a beach
Two people standing next to each other on a beach A group of people are walking on the beach with umbrellas
A person standing on the beach holding a surfboard A group of people are walking on the beach next to each other
A man walking along the beach with a surfboard A group of people are walking on the beach with umbrellas
A man walking on the beach with a surfboard A group of people are holding umbrellas on the beach
A young man walking on the beach with a surfboard A group of people are walking on the beach
Prototype: there is a white truck that is driving on the road Prototype: A couple of bags of luggage sitting up against a wall
there are many cows that are standing in the dirt A large pile of luggage sitting on top of a wall
there are many cows that are standing in the dirt A pile of luggage sitting on top of a wall
the truck is driving down the road in the rain Two bags of luggage sitting on the ground
this truck is driving down the road in the rain Two bags of luggage sitting in a room
This truck is pulled up to the side of the road A couple of bags of luggage on a wooden floor
Prototype: A man riding a sailboat in the ocean next to a shore Prototype: A beer bottle sitting on a bathroom sink next to a mirror
A man on a boat in a body of water A white cell phone sitting next to a toilet in a bathroom
A man riding a boat on a body of water A white bottle of wine sitting next to a toilet
A man riding a boat in a body of water A glass of wine sitting next to a toilet in a bathroom
A man riding a small boat on a body of water A pair of scissors is placed next to a toilet
A man riding a wave on top of a boat A pair of scissors sitting next to each other on a toilet
Prototype: A little boy sitting on a mattress holding a stuffed animal Prototype: A giraffe has its nose pressed against the trunk of a tree
A little girl playing with a stuffed animal Two giraffes look at a wire fence to eat
A little girl playing with a stuffed animal Two giraffes look at a fence to eat
A little boy holding a stuffed animal in his mouth a couple of giraffes are standing by a fence
A little girl sitting on a bed with stuffed animals a close up of a giraffe is eating a carrot
A little girl sitting on a bed with stuffed animals a close up of a giraffe has its mouth open
Table 5: Qualitative examples from the MSCOCO dataset on interpolated sentence generation given the prototype. For each example, the first row is the given prototype, the second-row and the last-row sentences are obtained by sampling edit vectors from the prior, the rest three sentences are generated by interpolating between the two edit vectors.