POINTER: Constrained Text Generation via Insertion-based Generative Pre-training

05/01/2020 ∙ by Yizhe Zhang, et al. ∙ Duke University Microsoft 5

Large-scale pre-trained language models, such as BERT and GPT-2, have achieved excellent performance in language representation learning and free-form text generation. However, these models cannot be directly employed to generate text under specified lexical constraints. To address this challenge, we present POINTER, a simple yet novel insertion-based approach for hard-constrained text generation. The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner. This procedure is recursively applied until a sequence is completed. The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable. Since our training objective resembles the objective of masked language modeling, BERT can be naturally utilized for initialization. We pre-train our model with the proposed progressive insertion-based objective on a 12GB Wikipedia dataset, and fine-tune it on downstream hard-constrained generation tasks. Non-autoregressive decoding yields a logarithmic time complexity during inference time. Experimental results on both News and Yelp datasets demonstrate that POINTER achieves state-of-the-art performance on constrained text generation. We intend to release the pre-trained model to facilitate future research.



There are no comments yet.


page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Real-world editorial assistant applications must often generate text under specified lexical constraints, for example, convert a meeting note with key phrases into a concrete meeting summary, recast a user-input search query as a fluent sentence, generate a conversational response using grounding facts Mou et al. (2016), or create a story using a pre-specified set of keywords Fan et al. (2018).

Generating text under specific lexical constraints is challenging. Constrained text generation broadly falls into two categories, depending on whether inclusion of specified keywords in the output is mandatory. In soft-constrained (a.k.a. priming) generation Qin et al. (2019); Tang et al. (2019), keyword-text pairs are typically first constructed (sometimes along with other conditioning information), and a conditional text generation model is trained to capture their co-occurrence, so that the model learns to incorporate the constrained keywords into the generated text. While soft-constrained models are easy to design, keywords are apt to be lost during generation, especially when multiple keywords must be included, or the keywords are less correlated. Soft enforcing algorithms such as attention and copy mechanisms  Bahdanau et al. (2014); Gu et al. (2016); Chen et al. (2019a) can be helpful in preserving keywords, but do not guarantee that constraints will be included in the output sentence.

Stage Generated text sequence
0 () sources sees structure perfectly
1 () sources company sees change structure perfectly legal
2 () sources suggested company sees reason change tax structure which perfectly legal .
3 () my sources have suggested the company sees no reason to change its tax structure , which are perfectly legal .
4 () my sources have suggested the company sees no reason to change its tax structure , which are perfectly legal .
Table 1: Generated example of the progressive generation process with multiple stages from the proposed Pointer model. Words in blue indicate newly generated words at the current stage. denotes the generated partial sentence at Stage . Five stages are considered in this example, and are the same, which indicates the end of the generation process. Interestingly, our method allows informative words (e.g., company, change) generated first, while non-informative words (e.g., the, to) generated at the end.

Hard-constrained generation Hokamp and Liu (2017); Post and Vilar (2018); Hu et al. (2019); Miao et al. (2019); Welleck et al. (2019), on the other hand, requires that all the lexical constraints be present in the output sentence. This approach typically involve sophisticated design of network architectures. Hokamp and Liu (2017) construct a lexical-constrained grid beam search decoding algorithm to incorporate constraints. However, Hu et al. (2019) observe that a naive implementation of this algorithm has a high running time complexity. Miao et al. (2019)

introduces a sampling-based conditional generation method, where the constraints are first placed in a template, then words in a random position are either inserted, deleted or updated under a Metropolis-Hastings-like scheme. However, individually sampling each token result in slow convergence, as the joint distribution of all the tokens in a sentence is highly correlated.

222For example, it may take infinite time for the model to move from “Hong Kong” to “New York”. Welleck et al. (2019) propose a tree-based text generation scheme, where a token is first generated in an arbitrary position, and then the model recursively generates words to its left and right, yielding a binary tree. However, the constructed tree may not reflect the progressive hierarchy/granularity from high-level concepts to low-level details. Further, the time complexity of generating a sentence using this approach is , like standard auto-regressive generation methods/

Motivated by the above, we propose a novel non-autoregressive model for hard-constrained text generation, called

Pointer (PrOgressive INsertion-based TransformER). As illustrated in Table 1, generation of words in Pointer is progressive, and iterative. Given lexical constraints, Pointer first generates high-level words (e.g., informative nouns, verbs and adjectives) that bridge the keyword constraints, then these words are used as pivoting points at which to insert details of finer granularity. This process iterates until a sentence is finally completed by adding the least informative words (typically pronouns and prepositions).

Since the training objective of our method is similar to the masked language modeling (MLM) objective as used in BERT Devlin et al. (2018), we use BERT to initialize our model training. Further, we perform large-scale pre-training on 12GB of Wikipedia raw data to obtain a pre-trained Pointer model that which can be readily fine-tuned on specific downstream tasks, boosting performance.

Compared with previous work, Pointer manifests several advantages: () it allows long-term control when generating long paragraphs due to the top-down hierarchical structure of the progressive generation process; () it provides a novel way of utilizing BERT for text generation333Other large-scale pre-trained language models can also be directly adopted.; () during inference time, the token generation at each position can be parallel, yielding a significant reduction over time complexity from to . This is because in each round, our model inserts the tokens at each gap simultaneously over the entire sentence. Moreover, to better coordinate the parallel generation process, we develop a novel beam search algorithm customized to our approach, further improving the generation quality.

The main contributions of this paper are summarized as follows. () We present Pointer, a novel insertion-based Transformer model for hard-constrained text generation. () Large-scale pre-training and novel beam search algorithms are proposed to further boost performance. () Experiments are conducted on several datasets across different domains (including News, Yelp and Wiki), demonstrating the superiority of Pointer over strong baselines. Our approach is simple to understand and implement, yet powerful, and can be leveraged as a building block for future research.

2 Related Work

Language Model Pre-training

Large-scale pre-trained language models, such as BERT Devlin et al. (2018), RoBERTa Liu et al. (2019), XLNet Yang et al. (2019), Text-to-text Transformer Raffel et al. (2019) and ELECTRA Clark et al. (2020), have achieved great success on natural language understanding benchmarks. GPT-2 Radford et al. (2018) first demonstrates great potential for leveraging Transformer models in generating realistic text with given prompts. MASS Song et al. (2019) and BART Lewis et al. (2019) propose methods for sequence-to-sequence pre-training. UniLM Dong et al. (2019) unifies the generation and understanding tasks within a single pre-training scheme. DialoGPT Zhang et al. (2020) and MEENA Adiwardana et al. (2020) focus on open-domain conversations, while SC-GPT Peng et al. (2020) focuses on task-oriented dialog; these models demonstrate potential for human-like response generation. Controllable pre-trained language generation models have also been proposed. For example, CTRL Keskar et al. (2019) and Grover  Zellers et al. (2019) guide text generation with pre-defined control codes, Optimus Li et al. (2020) guides text generation with the abstract-level latent codes. Complementary to this, PPLM Dathathri et al. (2020) introduces a controllable scheme in the text decoding stage. In addition, recent work has also investigated how to leverage BERT for conditional text generation Chen et al. (2019b); Mansimov et al. (2019). With massive training data, these models exhibit strong capacity for generating realistic chunks of text.

Despite their great success, however, existing pre-trained models cannot generate text with user-specified keywords in the hard-constrained text generation setting. To the best of our knowledge, ours is the first work that studies large-scale pre-training for hard-constrained text generation.

Non-autoregressive Generation

  There have been many attempts in employing non-autoregressive models for text generation tasks. For neural machine translation, the promise of such methods mostly lies in their decoding efficiency. For example,

Gu et al. (2018) employs a non-autoregressive decoder that generates all the tokens simultaneously. Generation can be further refined with a post-processing step to remedy the conditional independence of the parallel decoding process Lee et al. (2018); Ghazvininejad et al. (2019); Ma et al. (2019); Sun et al. (2019); Kasai et al. (2020). Deconvolutional decoders Zhang et al. (2017); Wu et al. (2019) have also been studied for title generation and machine translation. The Insertion Transformer Stern et al. (2019); Gu et al. (2019); Chan et al. (2019) is a partially autoregressive model, which predicts both insertion positions and tokens, and is trained to maximize the entropy over all valid insertions, providing fast inference while maintaining good performance.

The Pointer model hybridizes the BERT and Insertion Transformer models, inheriting the advantages of both, and generates text in a progressive coarse-to-fine manner. Further, Pointer pushes this line of research to a large-scale setting with pre-training.

3 Method

Let denote a sequence of discrete tokens444Such as a sentence represented in the subword level, where each token , and is a finite vocabulary set. For the hard-constrained text generation task, the goal is to generate a complete text sequence , given a set of key words as constraints, where the key words have to be exactly included in the final generated sequence with the same order. We first introduce the general framework of the Pointer model (Sec. 3.1), then describe data preparation (Sec. 3.2), model training (Sec. 3.3), and inference (Sec. 3.4) in detail.

3.1 Model Overview

Denote the lexical constraints as , the generation procedure of our method can be understood as a (progressive) sequence of stages: , such that for each , is a sub-sequence of . The following stage can be perceived as a finer-resolution text sequence compared to the preceding stage. is the final generation, under the condition that the iterative procedure is converged (i.e., ).

Table 1 shows an example of our progressive text generation process. Starting from the original lexical constraints (), at each stage, the algorithm inserts tokens progressively to formulate the target sequence. At each step, at most one new token can be generated between two existing tokens. Formally, we propose to factorize the distribution according to the importance555Importance of a token will be defined later. of each token:


where the more important tokens that form the skeleton of the sentence, such as nouns and verbs, appear in earlier stages, and the auxiliary tokens, such as articles and prepositions, are generated at the later stages. Contrast our progressive generation process in (1) with the commonly used autoregressive model, which factorizes the joint distribution of in a standard left-to-right manner, i.e., , ignoring the word importance.

Though the Insertion Transformer Stern et al. (2019) attempts to implement the progressive generation agenda in (1

), it does not directly address how to effectively train the model to generate important tokens first. One possible solution is to construct a training objective, so that generating an important token first yields a lower loss. This can complicate the design of the loss function, and there is no obvious ways to achieve that. To obviate the difficulty in hand-crafting a proper training objective, we prepare data in a form that eases model training.

3.2 Data Preparation

To prepare data for efficient training, we construct pairs of text sequences at adjacent stages, i.e., , as the model input. Therefore, each training instance is broken into a consecutive series of pairs: , where is the number of such pairs. Two properties are desired when constructing such a dataset: () important tokens should appear in an earlier stage (corresponding to a smaller ), so that the generation follows a coarse-to-fine progressive manner; () the number of steps is small, thus the generation of a sentence is fast during inference time.

Token Importance Scoring  We consider three different schemes to assess the importance score of a token: term frequency-inverse document frequency (TF-IDF), part-of-speech (POS) tagging, and Yet-Another-Keyword-Extractor (YAKE) Campos et al. (2018, 2020)

. The TF-IDF score provides the uniqueness and local enrichment evaluation of a token at a corpus level. POS tagging is also called word-category disambiguation, which indicates the role of a token at a sequence level. In our experiments, tokens from the noun or verb category are considered as more important than tokens from other categories, and are given a higher POS tagging score. YAKE is a commonly used unsupervised automatic keyword extraction method, which rests on text statistical features extracted from single documents to select the most important keywords of a sentence 

Campos et al. (2020). YAKE is good at extracting common key words, but relatively weak at extracting special nouns (e.g., names), and also does not provide any importance level for less informative tokens. Therefore, we propose to combine the above three metrics for token importance scoring. Specifically, the overall score of a token is defined as:


where , and represent the TF-IDF, POS tagging and YAKE scores (each is rescaled to ), respectively. Additionally, stop words are manually assigned a low importance score. If a token appears several times in a sequence, the latter occurrences are assigned a decayed importance score; otherwise, the model tends to generate the same token multiple times in one step during the inference stage.

DP-based Data Pair Construction  The construction of data-instance pairs reverses the generation process. Specifically, starting from the original sentence , at each iteration, the algorithm masks out a proportion of existing tokens to yield a sub-sequence , creating a training instance pair . This procedure is iterated until only less than ( is small) tokens are left. The less important tokens, according to the calculated scores, will be masked at earlier iterations.

Since the Insertion Transformer, on which our method is based, allows at most one new token to be generated between each two existing tokens, sentence length at most doubles at each iteration. Consequently, the optimal number of iterations is , where is the length of the sequence. Therefore, generation efficiency can be optimized by encouraging more tokens to be discarded during each masking step when preparing the data. However, masking positional interleaving tokens ignores token importance, thus loses the property of progressive planning from high-level concepts to low-level details at inference time. In practice, sequences generated by such an approach can be less semantically consistent as less important tokens occasionally steer the generation towards random content.

We design an approach to mask the sequence by considering both token importance and efficiency using dynamic programming (DP). The masking procedure is under the constraint that no consecutive tokens can be masked. The reason for this constraint is that the insertion-based generation only allows maximally one new token to be generated between two existing tokens, thus two consecutive tokens cannot be masked at the same stage. Under such a condition, we score each token and select a subset of tokens that add up to the highest score (all scores are positive). This allows the algorithm to adaptively choose the highest scored token and selecting as many token as possible to mask.

We formulate the above objective as an integer linear programming problem

Richards and How (2002), where the objective is to find an optimal masking pattern , where , and represents discarding the corresponding token , and indicates remains. For sub-sequence , the objective can be formulated as:


where is the length of the current sub-sequence . The condition prevents two adjacent tokens from being masked at the same stage.

Though solving (3) is computationally expensive, one can resort to an analogous problem for a solution , the so-called House Robbery Problem, a variant of Maximum Subarray Problem Bentley (1984), where a professional burglar plans to rob houses along a street and tries to maximize the outcome, but cannot break into two adjacent houses without triggering an alarm. This can be solved using dynamic programming Bellman (1954) (also known as Kadane’s algorithm Gries (1982)) as shown in Algorithm 1.

1:Input: A sequence of discrete tokens and its corresponding score list
2:Output: Masking pattern
3:Initialization: Accumulating scores and ; position tracker and ; = 0
4:while  do
6:     if  then
7:     else
8:     end if
10:end while
11:if  then
13:end if
14:while  do
15:     ,
16:end while
Algorithm 1 DP-based Data Pair Construction.

3.3 Model Training

Stage-wise Insertion Prediction

With all the data-instance pairs created as described above as the model input, the model is trained using an objective similar to the masked language modeling (MLM) objective in BERT Devlin et al. (2018). For each instance, we optimize the following objective:


where , and

denotes an indicator vector in the

-th stage, representing whether an insertion operation is applied in a slot.

While the MLM objective in BERT only predicts the token of a masked placeholder, our objective is different from it in comprising both () likelihood of an insertion indicator for each slot (between two existing tokens), and () the likelihood of each new token conditioning on the activated slot. To handle this case, we expand the vocabulary with a special no-insertion token . As a result, during inference time, the model can predict either a token from the vocabulary to insert, or an token indicating no new token will be inserted at a certain slot at the current stage. By utilizing this special token, the two objectives are merged. See Figure 1 for an illustration. Note that the same insertion transformer module is re-used at different stages.

Figure 1: Illustration of the generation process () of the proposed Pointer model. At each stage, the Insertion Transformer module generates either a regular token or a special (no-insertion) token for each gap between two existing tokens. The generation stops when all the gaps predict . The data preparation process () reverses the above generative process.

During inference time, once in a stage (), all the slots predict for the next stage, the generation procedure is converged and is the final output sequence. Note that to account for this final stage , during data preparation we incorporate an pair for each sentence in the training data, where denotes a sequence of with the same length of . To enable the model to insert at the beginning and end of the sequence, an (start of sentence) token and an (end of sentence) token are added in the beginning and at the end of each sentence, respectively.

In light of the similarity with the MLM objective, we use BERT-base model to initialize the Insertion Transformer module. Since our model is non-autoregressive, all the tokens in Stage can be attended in generating new tokens in Stage . Such a structure empowers the model to utilize all the context information at the previous stage to predict the new tokens.

Large-scale Pre-training  In order to provide a general large-scale pretrained model that can benefit various downstream tasks with fine-tuning, we train a model on the massive publicly available English Wiki dataset, which covers a wide range of topics. The Wiki dataset is first preprocessed according to Sec. 3.2. We then initialize the model with BERT, and perform model training on the processed data using our training objective (4). After pre-training, the model can be used to generate an appropriate sentence with open-domain keyword constraints, in a tone that represents the Wiki style. In order to adapt the pre-trained model to a new domain (e.g., News and Yelp reviews), the pre-trained model is further fine-tuned on new datasets, which empirically demonstrates better performance than training the model on the target domain alone.

3.4 Inference

During inference time, the proposed model generates text stage-by-stage, by applying the insertion module repeatedly until convergence (i.e., no additional token is generated), starting from the given lexical constraint

. Specifically, at each stage in the generation process, the model first computes the probability of the possible next stages via (

4). Each new token is then generated according to this probability, using either greedy search or top-K sampling Fan et al. (2018). If a token is generated, it is deleted at the next round as it indicates no token should be inserted into that location. The newly generated sequence can be re-fed into the Insertion Transformer module as input to recursively generate the sub-sequence in the new stage. This generation process stops, when all the positions predict tokens for the next stage.

Inner-Layer Beam Search  According to (4), all new tokens are simultaneously generated based on the existing tokens at the previous stage. Despite of being fully parallel, like BERT Yang et al. (2019) and NAT Ghazvininejad et al. (2019); Kasai et al. (2020) this approach suffers from a conditional independence problem, where the predicted tokens are conditional-independently generated and are agnostic of each other. This can result in generating repeating or inconsistent new tokens at each generation round.666For example, from an existing token “and”, the model generates “clean and clean”.

To address this weak-dependency issue, we perform a modified beam search algorithm for decoding. Specifically, at generation step , suppose the existing tokens from last stage are , where is the length of . For predicting next stage , there will be available slots. A naive approach to perform beam search would be maintaining a priority queue of top candidate token series prediction when moving from the leftmost slot to the rightmost slot. The token series are initialized by interleaving the tokens in with . At the -th move, the priority queue contains top sequences: ,,,,, , , where and denotes the predicted token for the -th slot in the -th top sequence. The model then evaluates the likelihood of each token (including ) in the vocabulary in the slot , i.e., computing the likelihood of , , for . This is followed by a ranking step to select the top most likely series among the series to grow. However, such a naive approach is expensive, as the runtime complexity takes evaluations.

Instead, we approximate the search by constraining it in a narrow band. We design a customized beam search algorithm for our model, called inner-layer beam search (ILBS). This method applies an approximate local beam search at each iteration to find the optimal stage-wise decoding. At the -th slot, ILBS first generates top token candidates by applying one evaluation step based on existing generation. Then, the prediction is limited to these top token candidates, thus the beam search procedure as described above is applied on the narrow band of instead of the full vocabulary . This reduces the computation to evaluations of the model.

After iterating for steps, the beam search terminates with the final set of hypotheses at the current level . We then choose the top hypotheses as the parent stage for the next stage generation. For the final stage, we choose the hypothesis with highest score as the final output sequence.

4 Experiments

We evaluate the Pointer model on constrained text generation over News and Yelp datasets. Details of the datasets and experimental results are provided in the following sub-sections.

4.1 Experimental Setup

Datasets and Pre-processing

We evaluate our model on two datasets. The EMNLP2017 WMT News dataset777http://www.statmt.org/wmt17/ contains 268,586 sentences, and we randomly pick 10k sentences as the validation set, and 1k sentences as the test set. The average length of sentences is words. The Yelp English review dataset is from Cho et al. (2018), which contains 160k training examples, 10k validation examples and 1k test examples. The average length of the examples is words. These two datasets vary in text length and domain, which enables the assessment of our model’s performance in different scenarios.

The English Wikipedia dataset we used for pre-training is first pre-processed into a set of natural sentences, with maximum sequence length of 64 tokens, which results in 1.99 million sentences for model training in total (12.6 GB raw text). On average, each sentence contains 27.4 tokens.

For inference, we extract the testing lexical constraints for all the compared methods using the 3rd party extracting tool YAKE888https://github.com/LIAAD/yake. To account the fact that sentences from Yelp are much longer than sentences from News, the maximum length of the lexical constraints we used for News and Yelp is set to 4 and 7, respectively.

Method N-2 N-4 B-2 B-4 E-4 D-1 D-2
CGMH 1.60 1.61 7.09% 1.61% 12.55% 9.32 16.60% 70.55% 215.3 14.29
NMSTG 2.70 2.70 10.67% 1.58% 13.56% 10.10 11.09% 65.96% 267.1 27.85
Greedy 2.90 2.80 12.13% 1.63% 15.66% 10.41 5.89% 39.42% 97.1 47.40
Greedy (+Wiki) 3.04 3.06 13.01% 2.51% 16.38% 10.22 11.10% 57.78% 56.7 31.32
ILBS (+Wiki) 3.20 3.22 14.00% 2.99% 15.71% 9.86 13.17% 61.22% 66.4 22.59
Human - - - - - 10.05 11.80% 62.44% 47.4 27.85
Table 2: Results on the News dataset. ILBS denotes beam search. “+Wiki” denotes fine-tuning on the Wiki-pretrained model. “Human” represents the held-out human reference.
Method N-2 N-4 B-2 B-4 E-4 D-1 D-2
CGMH 0.50 0.51 4.53% 1.45% 11.87% 9.48 12.18% 57.10% 207.2 16.70
NMSTG 1.11 1.12 10.06% 1.92% 13.88% 10.09 8.39% 50.80% 326.4 27.92
Greedy 2.15 2.15 11.48% 2.16% 17.12% 11.00 4.19% 31.42% 99.5 87.30
Greedy (+Wiki) 3.27 3.30 15.63% 3.32% 16.14% 10.64 7.51% 46.12% 71.9 48.22
ILBS (+Wiki) 3.34 3.38 16.68% 3.65% 15.57% 10.44 9.43% 50.66% 61.0 35.18
Human - - - - - 10.70 10.67% 52.57% 55.4 50.36
Table 3: Results on the Yelp dataset. ILBS denotes beam search. “+Wiki” denotes fine-tuning on the Wiki-pretrained model. “Human” represents the held-out human reference.

Baselines   We compare our model with two state-of-the-art methods for hard-constrained text generation: () Non-Monotonic Sequential Text Generation (NMSTG) Welleck et al. (2019), and () Constrained Sentence Generation by Metropolis-Hastings Sampling (CGMH) Miao et al. (2019). Note that the Insertion Transformer Stern et al. (2019) focuses on machine translation rather than hard-constrained generation task, and therefore is not considered for comparison. Other methods based on grid beam search typically have long inference time, and they only operate on the inference stage; these are also excluded from comparison.

For NMSTG, we first convert the lexical constraints into a prefix sub-tree, and then sample a sentence to complete the sub-tree. We use the default settings suggested by the authors, and use an LSTM with hidden size of 1024 as the text generator, and select the best performed variants (annealed

) as our baseline. For CGMH, we use their default setting, which uses an LSTM with hidden size of 300, and set the vocabulary size as 50k. Both models are trained until the evaluation loss does not decrease. During inference, we run CGMH for 500 iterations with default hyperparameters.

Implementation Details  We employ the tokenizer from BERT, and use WordPiece Embeddings Wu et al. (2016) with a 30k token vocabulary for all the tasks. A special no-insertion token is added to the vocabulary. We utilize the BERT-base model with 12 self-attention layers and 768 hidden dimension as our model initialization. Each model is trained until there is no progress on the validation loss. We use a learning rate of 3e-5 without any warming-up schedule for all the training procedures. The optimization algorithm is Adam Kingma and Ba (2015)

. We pre-train our model on the Wiki dataset for 2 epochs, and fine-tune on the News and Yelp datasets for around 10 epochs.

Evaluation Metrics  Following Galley et al. (2019); Zhang et al. (2020), we perform automatic evaluation using commonly adopted text generation metrics, including BLEU Papineni et al. (2002), METEOR Lavie and Agarwal (2007), and NIST Doddington (2002)

. NIST is a variant of BLEU which uses information gain to weight n-gram matches. It indirectly penalizes uninformative n-grams. The perplexity over the test set is reported for assessing the syntactic and semantic coherence of generated sentences, which is evaluated by running the inference of the pre-trained GPT-2 medium (345M) model. We also use Entropy

Zhang et al. (2018) and Dist-n Li et al. (2016) to evaluate lexical diversity.

Constraints estate pay stay policy
CGMH An economic estate developer that could pay for it is that a stay policy
NMSTG As estate owners , they cannot pay for households for hundreds of middle - income property , buyers stay in retail policy .
Pointer if you buy new buildings from real estate company, you may have to pay down a mortgage and stay with the policy for financial reasons .
Pointer (ILBS) but no matter what foreign buyers do , real estate agents will have to pay a small fee to stay consistent with the policy .
Constraints looked report realized wife
CGMH He looked at the report and said he realized that if his wife Jane
NMSTG I looked at my report about before I realized I return to travel holidays but - it doesn ’ t haven ’ t made anything like my wife .
Pointer when i turned and looked at a file report from the airport and realized it was not my wife and daughter .
Pointer (ILBS) when i turned around and looked down at the pictures from the report , i realized that it was my wife .
Table 4: Generated examples from the News dataset.
Constraints service perfect delicious service awesome good place
CGMH great service perfect food delicious atmosphere very good service very good awesome good place .
NMSTG service was perfect , delicious and great service awesome service good food . this place will go back .
Pointer excellent food , great service , really nice atmosphere , perfect amount of spring rolls , delicious especially the chicken and eel . the service was very friendly and the prices are awesome too . for a female who loves good japanese restaurant , this is definitely your place !
Pointer (ILBS) from the food to service . the foods are perfect , they were delicious . and service is beyond expectation . christina was awesome , so many good things about this place .
Constraints joint great food great drinks greater staff
CGMH super this joint has great food , great drinks and always greater wait staff .
NMSTG awesome joint . great service . great food great drinks . good to greater and great staff !
Pointer my favorite local joint around old town . great atmosphere , amazing food , delicious and delicious coffee , great wine selection and delicious cold drinks , oh and maybe even a greater patio space and energetic front desk staff .
Pointer (ILBS) the best breakfast joint in charlotte . great service and amazing food . they have great selection of drinks that suits the greater aesthetic of the staff .
Table 5: Generated examples from the Yelp dataset.

4.2 Experimental Results

News Generation   We first conduct experiments on the News dataset to generate sentences from 4 lexical constraints. Quantitative results are summarized in Table 2, and some qualitative examples are provided in Table 4 and Appendix A. Pointeris able to take full advantage of BERT initialization and Wiki pre-training to improve relevance scores (NIST, BLEU and METEOR). Leveraging the ILBS further improves performance. For diversity scores, as CGMH is a sampling-based method in nature, it achieves the highest Dist-n scores. We observed that the generated sentences from CGMH are relatively short; CGMH may yield less fluent generation when the constraints are more disconnected.

Semantics: A and B, which is more semantically meaningful and consistent?
News dataset Yelp dataset
System A Neutral System B System A Neutral System B
Pointer 60.9% 17.4% 21.8% CGMH Pointer 59.8% 17.3% 23.0% CGMH
Pointer 55.2% 21.7% 23.1% NMSTG Pointer 57.5% 23.0% 19.6% NMSTG
Pointer 21.7% 21.4% 56.9% Human Pointer 26.8% 25.9% 47.3% Human
Fluency: A and B, which is more grammatical and fluent?
News dataset Yelp dataset
System A Neutral System B System A Neutral System B
Pointer 57.7% 19.9% 22.4% CGMH Pointer 54.2% 20.0% 25.8% CGMH
Pointer 52.7% 24.1% 23.2% NMSTG Pointer 59.0% 22.8% 18.2% NMSTG
Pointer 16.6% 20.0% 63.4% Human Pointer 24.0% 26.1% 49.9% Human
Informativeness: A and B, which is more informative?
News dataset Yelp dataset
System A Neutral System B System A Neutral System B
Pointer 70.4% 12.8% 16.8 % CGMH Pointer 69.9% 10.9% 19.3 % CGMH
Pointer 57.7% 18.7% 23.6% NMSTG Pointer 65.2% 18.1% 16.7% NMSTG
Pointer 31.7% 19.0% 49.4% Human Pointer 32.8% 19.0% 48.2% Human
Table 6: Results of Human Evaluation on News and Yelp dataset for semantic consistency, fluency and informativeness, showing preferences (%) for our Pointer model vis-a-vis baselines and real human responses. Numbers in bold indicate the most preferred systems. Differences in mean preferences are statistically significant at .
Model Training Inference
CGMH 4382 toks/s 33h
NMSTG 357 toks/s 487s
Pointer 5096 toks/s 67s
Table 7: Speed comparison among different methods. “toks/s” represents tokens per second. Inference time is computed on 1000 test examples.

Yelp Generation   We further evaluate our method on the Yelp dataset, where the goal is to generate a long-form text from more constraint words. Generating a longer piece of text with more lexical constraints is generally more challenging, since the model needs to capture the long-term dependency structure from the text, and effectively conjure up with a plan to realize the generation. Results of automatic evaluation are provided in Table 3. Generated examples are shown in Table 5 and Appendix B. Generally, the generation from our model effectively considers all the lexical constraints, and is semantically more coherent and grammatically more fluent, compared with the baseline methods. We also observe that greedy generation occasionally contains repeated words in a single generation stage, while the use of ILBS dramatically alleviates this issue by sequentially generating tokens at one stage, at a cost of efficiency. Compared with greedy approach, ILBS is typically more concise and contains less repeated information, a defect the greedy approach occasionally suffers (e.g., Table 5 example 2, “delicious and delicious coffee”).

We perform additional experiments on zero-shot generation from the pre-trained model on both datasets, to test the versatility of pre-training. The generated sentences, albeit Wiki-like, are relatively fluent and coherent (see examples in Appendix A and B), and yield relatively high relevance scores as shown in Appendix C.

Human Evaluation  We conducted a human evaluation of 300 randomly sampled outputs of CGMH, NMSTG and our greedy method. Systems were paired and each pair of system outputs was randomly presented (in random order) to 5 crowd-sourced judges, who ranked the outputs pairwise for coherence, informativeness and fluency using a 5-point Likert-like scale. 6. The human evaluation template is provided in Appendix E. The overall judge preferences for fluency, informativeness and semantic coherence are presented as percentages of the total ”vote” in Table 6. Inter-annotator agreement was only ”slight”, with Krippendorff’s alpha of 0.23 on the News dataset and 0.18 on the Yelp dataset). Despite the noise, the judgments show a strong across-the-board preference for Pointer over the two baseline systems on all categories. A clear preference for the human ground truth over our method is also observed.

Running-time Comparison  One of the motivations for applying our non-autoregressive generation is that at each stage the generation can be parallel, leading to a significant reduction in training and inference. We compare the model training time and the inference decoding time of all the methods on the Yelp dataset, and summarize the results in Table 7. The evaluation is based on a single Nvidia V100 GPU. For training time, CGMH and Pointer are relatively fast, while NMSTG processes fewer tokens per second since it needs to generate a tree-like structure for each sentence. With respect to inference time, CGMH is slow, as it typically needs hundreds of sampling iterations to decode one sentence. Our method only requires approximately 3 rounds of BERT-like decoding, which enables fast decoding of 1000 sentences within one minute. Note that our method in Table 7 uses greedy decoding. ILBS is around 20 times slower than greedy decoding.

5 Conclusion

We have presented Pointer, a simple yet powerful approach to generating text from a given set of lexical constraints in a non-autoregressive manner. The proposed method leverages a large-scale pre-trained model (such as BERT initialization and our insertion-based pre-training on Wikipedia) to generate text in a progressive manner using an insertion-based Transformer. Both automatic and human evaluation demonstrate the effectiveness of Pointer and its potential in constrained text generation. In future work, we plan to leverage understanding of sentence structure, such as the use of constituency parsing, to further enhance the design of the progressive hierarchy. Our model can be also extended to allow inflected/variant forms and arbitrary ordering of given lexical constraints.


  • D. Adiwardana, M. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, et al. (2020) Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. Cited by: §2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv. Cited by: §1.
  • R. Bellman (1954) The theory of dynamic programming. Technical report Rand corp santa monica ca. Cited by: §3.2.
  • J. Bentley (1984) Programming pearls: algorithm design techniques. Communications of the ACM 27 (9), pp. 865–873. Cited by: §3.2.
  • R. Campos, V. Mangaravite, A. Pasquali, A. M. Jorge, C. Nunes, and A. Jatowt (2018) YAKE! collection-independent automatic keyword extractor. In European Conference on Information Retrieval, Cited by: §3.2.
  • R. Campos, V. Mangaravite, A. Pasquali, A. Jorge, C. Nunes, and A. Jatowt (2020) YAKE! keyword extraction from single documents using multiple local features. Information Sciences. Cited by: §3.2.
  • W. Chan, N. Kitaev, K. Guu, M. Stern, and J. Uszkoreit (2019) KERMIT: generative insertion-based modeling for sequences. arXiv preprint arXiv:1906.01604. Cited by: §2.
  • L. Chen, Y. Zhang, R. Zhang, C. Tao, Z. Gan, H. Zhang, B. Li, D. Shen, C. Chen, and L. Carin (2019a) Improving sequence-to-sequence learning via optimal transport. In ICLR, Cited by: §1.
  • Y. Chen, Z. Gan, Y. Cheng, J. Liu, and J. Liu (2019b) Distilling the knowledge of bert for text generation. arXiv preprint arXiv:1911.03829. Cited by: §2.
  • W. S. Cho, P. Zhang, Y. Zhang, X. Li, M. Galley, C. Brockett, M. Wang, and J. Gao (2018) Towards coherent and cohesive long-form text generation. arXiv preprint arXiv:1811.00511. Cited by: §4.1.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. In ICLR, Cited by: §2.
  • S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu (2020) Plug and play language models: a simple approach to controlled text generation. In ICLR, Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv. Cited by: §1, §2, §3.3.
  • G. Doddington (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second international conference on Human Language Technology Research, Cited by: §4.1.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. In NeurIPS, Cited by: §2.
  • A. Fan, M. Lewis, and Y. Dauphin (2018) Hierarchical neural story generation. In ACL, Cited by: §1, §3.4.
  • M. Galley, C. Brockett, X. Gao, J. Gao, and B. Dolan (2019) Grounded response generation task at dstc7. In AAAI Dialog System Technology Challenges Workshop, Cited by: §4.1.
  • M. Ghazvininejad, O. Levy, Y. Liu, and L. Zettlemoyer (2019) Mask-predict: parallel decoding of conditional masked language models. In EMNLP, Cited by: §2, §3.4.
  • D. Gries (1982) A note on a standard strategy for developing loop invariants and loops. Science of Computer Programming. Cited by: §3.2.
  • J. Gu, J. Bradbury, C. Xiong, V. O. Li, and R. Socher (2018) Non-autoregressive neural machine translation. In ICLR, Cited by: §2.
  • J. Gu, Q. Liu, and K. Cho (2019) Insertion-based decoding with automatically inferred generation order. TACL. Cited by: §2.
  • J. Gu, Z. Lu, H. Li, and V. O. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393. Cited by: §1.
  • C. Hokamp and Q. Liu (2017) Lexically constrained decoding for sequence generation using grid beam search. In ACL, Cited by: §1.
  • J. E. Hu, H. Khayrallah, R. Culkin, P. Xia, T. Chen, M. Post, and B. Van Durme (2019) Improved lexically constrained decoding for translation and monolingual rewriting. In NAACL, Cited by: §1.
  • J. Kasai, J. Cross, M. Ghazvininejad, and J. Gu (2020) Parallel machine translation with disentangled context transformer. arXiv preprint arXiv:2001.05136. Cited by: §2, §3.4.
  • N. S. Keskar, B. McCann, L. Varshney, C. Xiong, and R. Socher (2019) CTRL - A Conditional Transformer Language Model for Controllable Generation. arXiv preprint arXiv:1909.05858. Cited by: §2.
  • D. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
  • A. Lavie and A. Agarwal (2007) METEOR: an automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, Cited by: §4.1.
  • J. Lee, E. Mansimov, and K. Cho (2018) Deterministic non-autoregressive neural sequence modeling by iterative refinement. EMNLP. Cited by: §2.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Cited by: §2.
  • C. Li, X. Gao, Y. Li, X. Li, B. Peng, Y. Zhang, and J. Gao (2020) Optimus: organizing sentences via pre-trained modeling of a latent space. In arXiv preprint arXiv:2004.04092, Cited by: §2.
  • J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In NAACL, Cited by: §4.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.
  • X. Ma, C. Zhou, X. Li, G. Neubig, and E. Hovy (2019) FlowSeq: non-autoregressive conditional sequence generation with generative flow. arXiv preprint arXiv:1909.02480. Cited by: §2.
  • E. Mansimov, A. Wang, and K. Cho (2019) A generalized framework of sequence generation with application to undirected sequence models. arXiv preprint arXiv:1905.12790. Cited by: §2.
  • N. Miao, H. Zhou, L. Mou, R. Yan, and L. Li (2019) Cgmh: constrained sentence generation by metropolis-hastings sampling. In AAAI, Cited by: §1, §4.1.
  • L. Mou, Y. Song, R. Yan, G. Li, L. Zhang, and Z. Jin (2016) Sequence to backward and forward sequences: a content-introducing approach to generative short-text conversation. arXiv preprint arXiv:1607.00970. Cited by: §1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, Cited by: §4.1.
  • B. Peng, C. Zhu, C. Li, X. Li, J. Li, M. Zeng, and J. Gao (2020) Few-shot natural language generation for task-oriented dialog. arXiv preprint arXiv:2002.12328. Cited by: §2.
  • M. Post and D. Vilar (2018) Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. In NAACL, Cited by: §1.
  • L. Qin, M. Galley, C. Brockett, X. Liu, X. Gao, B. Dolan, Y. Choi, and J. Gao (2019) Conversing by reading: contentful neural conversation with on-demand machine reading. In ACL, Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2018) Language models are unsupervised multitask learners. Technical report OpenAI. Cited by: §2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    arXiv preprint arXiv:1910.10683. Cited by: §2.
  • A. Richards and J. P. How (2002) Aircraft trajectory planning with collision avoidance using mixed integer linear programming. In Proceedings of American Control Conference, Cited by: §3.2.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) Mass: masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450. Cited by: §2.
  • M. Stern, W. Chan, J. Kiros, and J. Uszkoreit (2019) Insertion transformer: flexible sequence generation via insertion operations. arXiv preprint arXiv:1902.03249. Cited by: §2, §3.1, §4.1.
  • Z. Sun, Z. Li, H. Wang, D. He, Z. Lin, and Z. Deng (2019) Fast structured decoding for sequence models. In NeurIPS, Cited by: §2.
  • J. Tang, T. Zhao, C. Xiong, X. Liang, E. P. Xing, and Z. Hu (2019) Target-guided open-domain conversation. In ACL, Cited by: §1.
  • S. Welleck, K. Brantley, H. Daumé III, and K. Cho (2019) Non-monotonic sequential text generation. arXiv preprint arXiv:1902.02192. Cited by: §1, §4.1.
  • F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli (2019) Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430. Cited by: §2.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §4.1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. In NeurIPS, Cited by: §2, §3.4.
  • R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi (2019) Defending against neural fake news. In NeurIPS, Cited by: §2.
  • Y. Zhang, M. Galley, J. Gao, Z. Gan, X. Li, C. Brockett, and B. Dolan (2018) Generating informative and diverse conversational responses via adversarial information maximization. In NeurIPS, Cited by: §4.1.
  • Y. Zhang, D. Shen, G. Wang, Z. Gan, R. Henao, and L. Carin (2017) Deconvolutional paragraph representation learning. In NIPS, Cited by: §2.
  • Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2020) DialoGPT: large-scale generative pre-training for conversational response generation. In ACL (system demonstration), Cited by: §2, §4.1.


Appendix A Additional Generated Examples for News Dataset

Constraints aware negative immediately sites
Oracle Where we become aware of any accounts that may be negative , we immediately contact companies such as Instagram , although we have no control over what they allow on their sites .
CGMH Not even aware of negative events including video events immediately at stations , Facebook sites.
NMSTG Health providers in a country for England are aware of small health systems - and not non - health care but all negative is immediately treated by heads of businesses and departments in the sites .
Pointer ‘ if users are aware of the negative impact of blocking , how can they so immediately ban these sites ? ’ the researchers wrote .
Pointer (ILBS) if the users are aware of or the negative messages , they can immediately be transferred to other sites .
Wiki zero-shot he is not aware of the negative , and will immediately go to the positive sites .
Constraints children fault left charge
Oracle My relationship with my children was seriously affected as they were told time and again that everything was my fault , they were even left ‘ in charge ’ of me if my wife went out of the house .
CGMH His two children are the rare fault that left the police charge
NMSTG But despite children from hospitals to last one by fault backing this month , there have arrived as Mr Hunt has been left charge .
Pointer but i found that these children were not at school however this was not their fault , and if so they were left without a parent in charge .
Pointer (ILBS) but my lovely wife and children consider that it is not our own fault and we should not be left alone in charge .
Wiki zero-shot but for the children who are not at a fault , they are left behind on the charge .
Constraints estate pay stay policy
Oracle How many people on the estate does he think will be affected by the new pay - to - stay policy ?
CGMH An economic estate developer that could pay for it is that a stay policy
NMSTG As estate owners , they cannot pay for households for hundreds of middle - income property , buyers stay in retail policy .
Pointer if you buy new buildings from real estate company, you may have to pay down a mortgage and stay with the policy for financial reasons .
Pointer (ILBS) but no matter what foreign buyers do , real estate agents will have to pay a small fee to stay consistent with the policy .
Wiki zero-shot however , his real estate agent agreed to pay him for the stay under the same policy .
Constraints managers cut costs million
Oracle He was the third of four managers sent in to cut costs and deal with the city ’ s $ 13 million deficit .
CGMH The managers , who tried to cut off their costs , added 20 million euros
NMSTG Business managers cut demand for more expensive costs in 2017 - by October - is around 5 million 8 per cent , and has fallen by 0 . 3 per cent in January and 2017 .
Pointer under one of its general managers , the firm had already cut its annual operating costs from $ 13 . 5 million to six million euros .
Pointer (ILBS) and last month , the managers announced that it had cut its operating costs by $ 30 million .
Wiki zero-shot but then he and all of his managers agreed to cut off all of the operating costs by about 1 million .
Constraints looked report realized wife
Oracle I looked at the report and saw her name , and that’s when I realized it was my ex-wife .
CGMH He looked at the report and said he realized that if his wife Jane
NMSTG I looked at my report about before I realized I return to travel holidays but - it doesn ’ t haven ’ t made anything like my wife .
Pointer when i turned and looked at a file report from the airport and realized it was not my wife and daughter .
Pointer (ILBS) when i turned around and looked down at the pictures from the report , i realized that it was my wife .
Wiki zero-shot but when he looked up at the report , he realized that it was not his wife .
Constraints time claim tax year
Oracle Walker says there is still time to claim this higher protection if you haven ’ t already as the deadline is the end of the 2016 / 2017 tax year .
CGMH ” Two states , one - time voters can claim a federal tax year
NMSTG This time they had three to claim of an equal tax and 34 women at which indicated they should leave that over the year of 16 .
Pointer it is the very first time in history that trump will ever claim over $ 400 million in federal income tax that he had held last year , the same report says .
Pointer (ILBS) is this the very first time someone has to claim federal income tax twice in a single year ?
Wiki zero-shot but at the time , the claim was that the same sales tax that was from the previous fiscal year .
Constraints great past decade city
Oracle It ’ s been a great time , the past decade or so , to be the mayor of a major capital city .
CGMH The great past decade is that so much of a new home city
NMSTG I like to thank you for me and I ’ ve wanted it to grow in every great past decade over the city , a very amazing time .
Pointer this is one of the great cities that he have visited in the past two decade , the kansas city , missouri , he says .
Pointer (ILBS) you don ’ t feel as great as you ’ ve been in the past decade in a major city .
Wiki zero-shot there was a great success in the past during the last decade for the city .
Constraints model years big drama
Oracle The former model said : “ I haven ’ t seen him in so many years , I can ’ t make a big drama out of it . ”
CGMH The “ model ” continues , like many years of sexual and big drama going
NMSTG After model two years and did it like , could we already get bigger than others in a big drama ?
Pointer but i am a good role model , who has been around for 10 years now , and that is a big example of what i can do in drama on screen .
Pointer (ILBS) but the young actress and model , for 15 years , made a very big impact on the drama .
Wiki zero-shot she was a model actress for many years and was a big star in the drama .
Constraints made year resolution managed
Oracle I once made this my new year ’ s resolution , and it is the only one that I ’ ve actually ever managed to keep .
CGMH Indeed , as he made up the previous year , the GOP resolution was managed
NMSTG While additional sanctions had been issued last week made a year from the latest resolution , Russia ’ s Russian ministers have but have managed .
Pointer no progress has been made in syria since the security council started a year ago , when a resolution expressed confidence that moscow managed to save aleppo .
Pointer (ILBS) and the enormous progress we have made over the last year is to bring about a resolution that has not been managed .
Wiki zero-shot but despite all the same changes made both in both the previous fiscal year , and by the un resolution itself , only the federal government managed …
Constraints club believed centre window
Oracle The club are believed to be keen on bringing in cover at centre - back during the current transfer window , with a loan move most likely .
CGMH The club has also been believed that more than a new centre - up window
NMSTG One club believed it was not clear that the centre would hold place on the window until there were no cases that they had heard or had the decision disappeared .
Pointer he had been talking to the club since he is believed to have reached the centre spot in the queue before the january transfer window was suspended .
Pointer (ILBS) when he left his old club , chelsea , he was believed to be at the centre of the transfer window .
Wiki zero-shot during his first club as manager he was widely believed to be at the centre forward in the january transfer window .

Appendix B Additional Generated Examples for Yelp Dataset

Constraints service perfect delicious service awesome good place
Oracle yummy excellent service . ordered the carne asada medium rare . it was perfect . and delicious . their customer service was awesome . they were so friendly and made sure all was good . i definitely recommend this place .
CGMH great service perfect food delicious atmosphere very good service very good awesome good place .
NMSTG service was perfect , delicious and great service awesome service good food . this place will go back .
Pointer excellent food , great service , really nice atmosphere , perfect amount of spring rolls , delicious especially the chicken and eel . the service was very friendly and the prices are awesome too . for a female who loves good japanese restaurant , this is definitely your place !
Pointer (ILBS) from the food to service . the foods are perfect , they were delicious . and service is beyond expectation . christina was awesome , so many good things about this place .
Wiki zero-shot he said the service was perfect , and delicious , and the service that is awesome , and very good in its place .
Constraints good drinks love clients tighter great service
Oracle great atmosphere , good food and drinks . i love coming here in the fall to spring to meet with clients . their inside is a little small and makes summer a bit tighter , but still a great staff with excellent service .
CGMH good drinks . i love how out clients are tighter . great customer service .
NMSTG such good place with i love the mushroom drinks . the menu they love the clients . and tighter out the menu are great service .
Pointer this place is good . they have a wide variety of drinks . this really fits your taste . love the cozy bar that allows clients to be able to fit very tightly and tighter , better blending with the crowd . great coffee , reasonable prices , and friendly service !
Pointer (ILBS) nice place , with good vibe . nice mix of drinks and intimate space . what i really love about was there were so more mature clients , and they can fit in a tighter timeline . overall , great atmosphere and excellent service .
Wiki zero-shot she is a good at drinking , and in love for him and all his clients , and he enjoys a tighter schedule and has a great food and a generous service .
Constraints joint great food great drinks greater staff
Oracle apteka is seriously all around the best vegan joint in the burgh . great food , great drinks , greater staff .
CGMH super this joint has great food , great drinks and always greater wait staff .
NMSTG awesome joint . great service . great food great drinks . good to greater and great staff !
Pointer my favorite local joint around old town . great atmosphere , amazing food , delicious and delicious coffee , great wine selection and delicious cold drinks , oh and maybe even a greater patio space and energetic front desk staff .
Pointer (ILBS) the best breakfast joint in charlotte . great service and amazing food . they have great selection of drinks that suits the greater aesthetic of the staff .
Wiki zero-shot it is a joint owner of the great society of irish food , and the great britain and soft drinks , and the greater britain and its staff .
Constraints service polite professional affordable work safe tree
Oracle aron’s tree service were very polite and professional . they are very affordable . they arrived a little early and got right to work . they were quick and safe . they cleaned up and hauled out the tree trimmings . i highly recommend them .
CGMH excellent customer service , polite , professional , and affordable work , safe bike tree .
NMSTG excellent food and service and are amazing service and polite and professional . affordable it work out safe on sun tree !
Pointer amazing customer service . so polite , and very professional , and very affordable . such great work done at the safe end of a tree .
Pointer (ILBS) excellent customer service , very polite , and very professional . honest and affordable pricing . i will definitely get the work done here for the safe parts of my tree .
Wiki zero-shot customer service should be more polite , and more professional , and more affordable , and will work in a safe place under the family tree .
Constraints great great service happy found close home
Oracle great sushi and great service . i m really happy to have found a good sushi place so close to home !
CGMH great price and great customer service . very happy that i found this place close to my home .
NMSTG great food and great service . a happy and found a year in close for them . keep them home here .
Pointer amazing food . great quality food . great prices and friendly service staff . so happy and surprised to have finally found such a wonderful nail salon so close to my work and home .
Pointer (ILBS) this is just great food . great food and wonderful service . very happy to have finally found a chinese restaurant close to my home .
Wiki zero-shot he was a great teacher and a great love of the service he was very happy , and he found himself in the close to his home .
Constraints hesitate give customers chicken rice decent list
Oracle i hesitate to give them the five stars they deserve because they have a really small dining area and more customers , selfishly , would complicate things for me . chicken panang is quite good with a superb brown rice . decent wine list . after three visits the wait staff remembered what i like ( complicated ) and always get the order right .
CGMH i dont hesitate to give customers the chicken rice plate at a decent wine list .
NMSTG they hesitate to an wonderful time to give it about a table , love the customers chicken rice and dishes seafood and decent at the list .
Pointer i just did not even hesitate to admit , i should give credit cards to my customers here . the beijing chicken and fried rice were spot on , a decent side on my favorite list .
Pointer (ILBS) i don’t have to hesitate that they should give five stars . i will be one of their repeat customers . like the basil chicken and basil fried rice , it was decent on my list .
Wiki zero-shot he did not hesitate himself to give it to his customers , such as chicken , and steamed rice , a very decent item on the list .
Constraints good potential bad maintained replaced dirty disgusting

has good potential but very bad maintained . the padding is done , needs to be replaced , holes everywhere . so are those huge flowers or what ever those are . ripped . very dirty too . there was a a very dirty towel laying on the floor disgusting . please the city of vegas come and clean it !

CGMH great place but good potential bad management poorly maintained owner replaced the restroom dirty disgusting
NMSTG do a good price . not like the and potential bad maintained has disgusting . replaced been , dirty and disgusting .
Pointer the food was very good . it really has more potential maybe , but it smells really bad . its not very well maintained either . trash cans were replaced only when they were dirty . the floors were utterly disgusting .
Pointer (ILBS) the food is really good . this location has potential to be pretty bad and not very well maintained when it was replaced , its super dirty , just plain disgusting .
Wiki zero-shot it is good it has no potential , and the bad taste can be maintained until they are replaced by a dirty , and disgusting one .
Constraints love animal style long line expected quick
Oracle who doesn t love in and out . animal style is a must . long line but expected , it goes quick anyways so don t let that discourage you .
CGMH love this place . animal style food . long line than expected for quick .
NMSTG love animal chicken . it was style long a bit so good . the line is it was even on on a time and we expected to go but quick .
Pointer great little breakfast spot . i love having the double with animal style fries and protein style etc . have a super long wait line , but its just as expected and it always moves pretty quick too .
Pointer (ILBS) y all you just gotta love about this place is the double animal style and protein style . it was a long line , but i expected it to be quick .
Wiki zero-shot he also has love with the animal and his style , and was long as the finish line , and was expected to be quick .
Method N-2 N-4 B-2 B-4 E-4 D-1 D-2
Greedy (+Wiki) 3.04 3.06 13.01% 2.51% 16.38% 10.22 11.10% 57.78% 56.7 31.32
ILBS (+Wiki) 3.20 3.22 14.00% 2.99% 15.71% 9.86 13.17% 61.22% 66.4 22.59
Wiki zero-shot 2.80 2.82 11.38% 1.84% 15.12% 9.73 14.33% 53.97% 62.9 20.68
Human - - - - - 10.05 11.80% 62.44% 47.4 27.85
Table 8: Additional evaluation results on the News dataset. ILBS denotes beam search. “+Wiki” denotes fine-tuning on the Wiki-pretrained model. “Human” represents the held-out human reference. “Wiki zero-shot” represents zero-shot generation from the pre-trained model.
Method N-2 N-4 B-2 B-4 E-4 D-1 D-2
Greedy (+Wiki) 3.27 3.30 15.63% 3.32% 16.14% 10.64 7.51% 46.12% 71.9 48.22
ILBS (+Wiki) 3.34 3.38 16.68% 3.65% 15.57% 10.44 9.43% 50.66% 61.0 35.18
Wiki zero-shot 0.86 0.87 8.56% 1.30% 12.85% 9.90 10.09% 41.97% 62.9 26.80
Human - - - - - 10.70 10.67% 52.57% 55.4 50.36
Table 9: Additional evaluation results on the Yelp dataset. ILBS denotes beam search. “+Wiki” denotes fine-tuning on the Wiki-pretrained model. “Human” represents the held-out human reference. “Wiki zero-shot” represents zero-shot generation from the pre-trained model.

Appendix C Full Evaluation Data

We provide the full evaluation result data including Wikipedia zero-shot learning results in Table 8 and Table 9. Note that zero-shot generations from Wikipedia pre-trained model yield the lowest perplexity, presumably because the Wikipedia dataset is large enough so that the model trained on it can learn language variability, thus delivering fluent generated results.

Appendix D Inference Details

During inference time, we use a decaying schedule to discourage the model from generating non-interesting tokens, including and some other special tokens, punctuation and stop words. To do this, we use a decay multiplier

on the logits of these tokens before computing the softmax. The

is set to be , where is the current stage and is an annealing hyper-parameter. In most of the experiments, is set at

Appendix E Human Evaluation Template

The human evaluation template is provided in Figure 2.

Figure 2: Human evaluation template.