1 Introduction
Automatic generation of coherent text is an important and challenging task and the basis of many downstream applications, such as automatic story generation Yao et al. (2019); Tan et al. (2020)
Vinyals et al. (2015); Xu et al. (2015), machine translation Bahdanau et al. (2015); Liu et al. (2020), and dialogue system/chatbot Li et al. (2017b, a). The algorithmic essence of such a problem from a machine learning perspective is usually sequence modeling, also known as language modeling.
Autoregressive language model is the most prevalent generative model for language, which minimizes the negativeloglikelihood of a sequence of n tokens with a lefttoright factorization: during training. Then during decoding, the sentences are generated token by token from left to right. With the transformer architecture Vaswani et al. (2017), each step of likelihood estimation can be calculated in parallel while sharing the prefix context encoding calculations. This makes it possible to build powerful, efficiently trainable sequence models, like the GPT family (Radford et al., 2018, 2019; Brown et al., 2020),
Despite the practical success, there are several notable concerns about lefttoright autoregressive generators. One important concern is that the lefttoright formulation does not sufficiently reflect the recursive and compositional nature of language (Figure 1). As a result, progressive refinement of text with lefttoright sequence models is nontrivial. This motivated the community to explore paradigms outside of lefttoright autoregressive generation.
One direction is the outoforder generation, which formulates the generation process as an insertion process. A sentence is still gradually formed from void to completion, but each token (or tokens) could now be inserted into arbitrary positions of an existing context Stern et al. (2019); Welleck et al. (2019). The explorations of such a generation paradigm focus on three potential benefits: 1) Leveraging parallel decoding to reduce the number of iterations in the inference time to sublinear complexity w.r.t. the sequence length Stern et al. (2019). 2) Exploring the possibility to automatically learn the latent structures of sequences and reveal the compositional nature of languages (Welleck et al., 2019). 3) The ability to achieve perfect lexical control Zhang et al. (2020), where several given words are required to appear in the generated sentences nonconsecutively. This setup has broad application in story generation Yao et al. (2019), taskoriented dialog systems Wen et al. (2016), RDFtotext generation Gao et al. (2020), and lexically constrained machine translation Susanto et al. (2020).
However, outoforder generation brings computational challenges. Unlike traditional lefttoright generators, the absolute positions of the inserted tokens are dynamic, as is shown in Figure 2. In other words, the position encoding of each token is volatile as the generation proceeds, requiring a reencoding of the context after each expansion. Computationsharing among steps of a complete likelihood estimation in such models is usually considered impossible.
We design an efficiently trainable insertionbased sequence generator such that it enjoys the benefit of outoforder generation while having comparable training efficiency as lefttoright generation models. We achieve this by leveraging a modified relative position encoding that suits the model’s insertionbased nature so that the positional encoding of the inserted tokens will not change during the course of the insertionbased expansion. With the new component, we make it possible to share the computation among different likelihood estimation steps while the generation progresses.
With efficient training, we demonstrate that insertionbased generation can scale up to long, diverse text such as stories Mostafazadeh et al. (2016). We show the potential of such models in achieving perfect lexical control in a structuretotext generation setting Yao et al. (2019), and also better compositional generalization in captioning scenes rendered under the settings of CleVRCoGENT dataset Johnson et al. (2017). This opens up new possibilities for diverse and creative text generation.
2 Background
The core challenge for efficient training of insertionbased models is the position information encoding. We need to have a proper way to represent the position information changes as the context expands while making the representation incrementally computable so that we can amortize the encoding of the existing context. Our solution is largely inspired by the ideas of the XLNet family Yang et al. (2019); Dai et al. (2019); Shih et al. (2019). Also, we take insertion transformer Stern et al. (2019) as an important baseline. Therefore, we first review these two lines of work.
Insertion Transformer Insertion transformer Stern et al. (2019)
(ITvanilla or simply IT) proposes a design for insertionbased text generation. In each step, a bidirectional encoder transformer is performed on the expanded sequence to compose the representation for each candidate slot in between every two consecutive positions. After such, an optimization process for the joint distribution of positiontoken is performed to support an insertionbased text generation.
There are multiple variants of insertion transformer proposed in the original paper, varying in whether parallel prediction/decoding is enabled, how the model determines the termination of generation and how the model factorizes the text into a sequence of insertions. The probabilistic formulation for different variants of insertion transformer is slightly different. The common part of these variants is how each insertion is modeled in the step loss. On step where a token is inserted in between position and of context , the log step likelihood can be written as:
where stands for the ith position of bidirectional encoding of the sequence and
stands for vector concatenation. Note that since ITvanilla adopts the original absolute positional encoding of transformers (as illustrated in Figure
2), the representation of the generated sequence has to be completely reencoded after each step of context expansion to match the position changes of tokens. The expectation of the negative log step likelihood over all permitted contextinsertion pairs at each step is computed as the step loss. The step losses from the first step to the last one are summed up as the sequence loss.TransformerXL and XLNet TransformerXL Dai et al. (2019) proposes a powerful framework which supports relative position encoding and truncated gradient propagation in transformerbased sequence models. In replacement of absolute positions that are tied to each token in the sequence, the spatial layout of the sequence is defined by a matrix that records a directed distance from the column token to the row token, as is illustrated in Figure 3. Here in each cell denotes “being units away on the left/right”.
XLNet Yang et al. (2019)
exploits such an architecture to implement a generalized form of autoregressive language modeling, called permutation language models. Permutation language models shuffle the factorization order of the joint probability of a sequence and predict each token conditioned on the observed part of the sequence, given the predicted token’s position information.
The objective of PLM could be summarized as:
(1) 
Here and are the actual position and its encoding of the th element of the permutation. is the known/observed context upto step . Specifically, and . For each predicted token , a dummy token is created in an additional attention stream. The dummy token shares its positional information , but it contains no content information about what exactly is. The encoding of the dummy token from the second attention stream of the model results in a representation of .
Note that the permutation view in XLNet resembles a random insertion order to generate a sequence in an insertionbased model. However, when XLNet computes the relative position encoding, it assumes a global view of the oracle sequence. This implicitly assumes the span length between every two observed tokens are known a priori, violating the assumption of insertionbased generation, where we can insert arbitrarily many tokens in between two generated tokens in theory. This prohibits XLNet to act as an insertionbased generator, according to (Shih et al., 2019).
Computation Sharing for Context Encoding A naive process of likelihood estimation involves steps of context encoding. In practice, this is inefficient particularly for modeling long sequences. In the pretransformers era, some efforts towards exploiting the parallelism of GPUs have been proven useful (Bradbury et al., 2016; Gehring et al., 2017). A key insight behind such attempts is, usually, the context representations of incoming steps can be incrementally calculated based on the representations in previous steps. Thus computations could be shared among different steps of context encoding. The lefttoright decoder transformers enable this by applying a lower triangular mask to only allow leftwards attention. As a result, the representation for each position will not change as the sequence grows, and the computation of the representation for a new token can rely on previously computed prefix representations.
With the volatile positional information in insertionbased models, the leftwards attention no longer enables computation sharing in context encoding. Efficient computation thus becomes a challenge. Our model aims to solve this problem.
3 Efficient Insertionbased Generation
Since we generate sentences token by token, there are three major components we need to consider: 1) context encoding, which composes the representation of the generated context, 2) position/token prediction, which answers the question of where and what is the tobeinserted token, and 3) termination criteria during the decoding process. We discuss these three components for our model in detail.
3.1 Context Encoding
We start with discussions about the context encoding of our model, specifically the encoding of position information. In previous sections, we’ve discussed the challenges caused by the phenomenon shown in Figure 2, which makes an incremental calculation of new context representation, i.e. efficient likelihood estimation seemingly impossible. We argue that with a relative position encoding mechanism, incremental calculation of new context representation after each insertion is still feasible.
Consider a case where the partial context (to be completed by further insertions) is “I have pen .” To make this a grammatical sentence, one minimal way is to insert a in between have and pen. The directed distance vector for token “a” is illustrated in Figure 4. This relative position annotation clearly defines where the insertion happens by only describing the spatial relationship between each pair of tokens. If we pack all the relative position vectors together, we will get a matrix that reflects the relative spatial relation along the trajectory of insertions, with each row corresponds to an insertion step. We name it as the offset matrix. For example, for the sequence “I have a pen.” in the insertionbased generation order of “BOS” “ EOS” “have” “pen” “I” “.” “a”, the complete offset matrix is shown in Figure 5(a).^{1}^{1}1Given the partial generation “I have pen.”, representations for the generated tokens will not attend to token “a”, we can simply mask out these slots in both the token and position attention masks, resulting in a lowertriangular offset matrix. Figure 5(b) shows the offset matrix in an alternative view arranged in the original sequence order. We can see that the relative position encoding reflects the order of the original sequence with the masked positions i.e. “later inserted tokens” correctly skipped.
We disentangle the position information from the token embeddings as in XLNet Yang et al. (2019). Since the tokenonly information could be perfectly shared among different steps, constructing such insertionbased relative position lowertriangular matrix allows us to adopt the computation sharing trick in traditional decoder transformers to remarkably boost up the training.
Offset Compression We now show that given the insertion order of a sequence, described in the form of absolute position permutation indices (in the previous example, 0624135), the offset matrix can be computed efficiently with a process called offset compression. See Figure 6.
Specifically, we first convert the absolute position vector into a matrix by duplication. Then, the upper triangular elements are masked by “infinity” to remove their impact in relative position computation because the inserted token should not attend to future tobeinserted tokens. In the third step, each element is replaced by its inrow lowtohigh rank, i.e. its absolute position when skipping the masked positions. In the last step, each row is baselined by the diagonal element to reflect the fact that the model is attending from the last inserted token to previous ones.
Obviously, when the insertion order is from lefttoright, the resulting model is equivalent to the traditional autoregressive language model with relative positional encoding.
3.2 Slot Representation for Token Prediction
With the context encoding, the next step is to aggregate prefix representations for the next token/position prediction. Building upon the ideas of insertion transformer and XLNet, we propose two ways of aggregation, illustrated in Figure 7.
The deep aggregation uses the twostream attention mechanism proposed in XLNet to aggregate the information for token prediction. The advantage is that it fully utilizes the model capacity to compute slot representations. However, the computation of each slot representation requires a separate attention stream. In position prediction, we need to simultaneously obtain the representation of every candidate slot, which requires additional attention streams, making deep aggregation computationally expensive.
The shallow aggregation, mimicking the behavior vanilla insertion transformer, uses a concatenation of representation vectors from the leftneighboring and rightneighboring position as the slot representation. Since the aggregation operation from context embeddings to slot embeddings only include sparse operations like selection and concatenation, we can efficiently enumerate the slot representation in parallel for each time step, allowing us to compute the position likelihood and perform the sequencelevel termination control. A minor concern about the naive implementation of shallow aggregation is that, in some corner cases, the slot representation will not be correctly updated by new insertions. The example in Figure 8 demonstrates such cases: assume we fill SLOT A at step 10 while SLOT B’s representation is determined by step 4 and step 9’s representation vectors. The inserted token in SLOT A will not affect the representation for SLOT B, which is problematic. One remedy is to also concatenate the representation from the latest insertion step to the computation of each slot’s representation to make sure the information is complete.
3.3 Inference and Termination
The original focus of insertion transformer is to accelerate the decoding of machine translation systems to sublinear complexity. In their design, they highly rely on parallel decoding, which is to simultaneously insert multiple tokens in each step of the generation process. However, we found it hard to work well on more general, high conditional entropy text generation tasks, such as story generation. A typical failure in parallel decoding is shown in Figure 9, where the parallel decoding tends to generate extremely repetitive contents. Thus, we mostly focus on the “uniform” decoding variant of Insertion Transformer, which uniformrandomly predicts the next insertion position and token and performs one insertion operation per step.
For termination, we follow Insertion Transformer to use sequencelevel control, which relies on the estimated position distribution (with a special termination position to terminate the generation) to determine whether the algorithm should stop generating. Besides, for longer and more diverse text generation, we force the model to expand the context until the termination position loglikelihood hits the expectation of the loglikelihood on the development set.
4 Experimental setup
We examine InsNet’s ability as an insertionbased generative sequence model in multiple aspects, including computational efficiency, the compatibility with traditional lefttoright formulation, the ability to achieve lexically constrained text generation, and compositional generalizability. We introduce the datasets and experimental setup we use to support our empirical studies.
4.1 Efficient and Controllable Insertionbased Generation
To demonstrate that InsNet can scale up to longer, more diverse text generation, we evaluated the model on both synthetic and realworld datasets to show its efficiency and controllability.
Computational Efficiency Benchmark Due to architectural parallelism, with the model size and computational resources varying, there could be a huge deviation from the theoretical analysis on the computation efficiency when running the model in practical scenarios. We hereby create a synthetic contest to empirically show the computation efficiency of our model.
We create a random sequence dataset with variable length to reflect the growing speed of each model’s running time w.r.t. the predicted sequence. We set the vocabulary size to be 30000, mimicking the vocabulary size of the majority of the frequentlyused tokenizers. For each selected length , we sample 25000 random sequences
. We train three sequence models (InsNet, ITVanilla, and an L2R model with GPT2base architecture) to model the random sequence dataset, record the time cost per epoch for 5 epochs and take the average.
Long Text Generation Benchmark We use story generation as our task to showcase models’ ability to generate long texts. ROCStories Mostafazadeh et al. (2016) corpus contains 98,162 fiveline stories with a title. In addition, 1817 titleless stories are provided for development and test, respectively. The average length for the stories in the corpus is 50, which makes the dataset a good testbed for diverse, mediumlength generative sequence modeling. Following the data split in Yao et al. (2019), we further split the 98162 training stories with titles into train, development, and test splits, approximately with the ratio of 8:1:1.
Lexically Constrained Generation ROCStories with storyline annotation (Figure 10), firstly created and used in (Yao et al., 2019), is a good testbed for evaluating the model’s ability for lexically controlled text generation. Beyond the basic titlestory pair, a sequence of keywords, called “storyline”, is extracted from each story. In the lexically controlled generation setting, the model is trained to generate a story from a given storyline, and the generated story must contain all the storyline keywords.
4.2 Compositional Generalization
In addition to the story generation task, we created a simplified version of the compositional generalization (CoGENT) problem on CleVR dataset (Johnson et al., 2017) to study the impact of insertionbased, outoforder formulation on the model’s causal preference and how it affects the model’s compositional generalization ability. CleVR dataset is a dataset/data creator that contains scenes where one or more objects are placed on a gray table. The objects have five properties, including size, color, shape, material and location. In the basic setting of CleVR, the possible shapes are cubes, cylinders and spheres. The possible colors include gray, red, blue, green, brown, purple, cyan and yellow. The material of each object could either be plastic or metal. CoGENT is a specialized task that challenges the evaluated model’s generalization ability when the general principle of the i.i.d. is disobeyed. CoGENT contains two constrained subsets of configuration. Under both settings of CoGENT_A and CoGENT_B, there are no limitations for the spheres so that the model should know the colors are reasonably interchangeable values of the same property. In CoGENT_A, the cubes can only be gray, blue, brown or yellow and the cylinders can only be red, green, purple or cyan. In CoGENT_B the color limitations are exchanged for cubes and cylinders. Models are trained and developed on CoGENT_A, then tested on CoGENT_B.
Dataset Preparation To make the scenario closer to the cases we could encounter in language generation tasks, we reshape the dataset into a simple image captioning problem, namely CoGENTcaption. CoGENTcaption includes 2000 singleobject imagecaption pairs under CoGENT_A setting for training, 500 singleobject imagecaption pairs under CoGENT_A setting for development and 500 singleobject imagecaption pairs under CoGENT_B setting for compositional generalization testing. Figure 11 provided several examples in the dataset with visual descriptions.
4.3 Implementation Details
For all language generation task, a BPE tokenizer is applied for wordpiece level tokenization. Each of the evaluated transformer models, if not otherwise stated, is implemented as a basesized transformer model, which has 12layers with 12 attention head and 768 hidden dimensions. The batch size is set to 64. In cases where the model size exceeds the device capacity, the cumulative gradient trick is applied to support an equivalent optimization effect. The learning rate is selected from [5e5, 1e4, 2e4]. The dropout rate is selected from [0.1, 0.2, 0.333333]. The weight decay rate is selected from [0.02, 0.05]. All the models are trained with 400 warmup iterations and 80000 iterations of training in total. A lineardecay learning rate scheduler is applied for finegrained training of the model. For CoGENTcaption task, the image encoding is supported by a ResNet50 (He et al., 2016) model. All the hyperparameters are determined by a grid search strategy with 5 epochs of trial run under each setting.
5 Results and Analysis
5.1 Empirical Computational Cost Analysis
On a machine with RTX3090 GPU and a 12core 24thread CPU, we collect our results under different length settings of [20. 40, 60, 80, 100, 120, 160]. The results are illustrated in Figure 12.
Discussion As we can observe from the illustration, when the length of text increases, ITvanilla quickly used up the parallelization capacity and degenerate to an algorithm with approximately a sequential complexity, whereas InsNet and traditional lefttoright sequence models maintain a nearlinearly and slower increasing time cost.
5.2 Efficient Sequence Generation
Short Sequence Generation
Model  BLEU1  BLEU2  Token NLL 

InsNet  9.07  2.21  79.42/76.84 
ITvanilla  7.41  1.72  79.56/77.10 
L2R  7.68  1.65  74.22/72.85 
In the last subsection we’ve shown that it may not be practically affordable for us to obtain a welltrained ITvanilla on longer sequences. However, before we move along to using the efficiencyimproved InsNet as the representative for insertionbased methods, it would be both interesting and important for us to verify the performance consistency between InsNet and ITvanilla. In addition to the likelihood measure, for better comparison, we also collect and compare the decoding results from two insertionbased models and traditional lefttoright models. We found that if still trained with 80000 iterations, the L2R model overfit severely, so the L2R baseline is only trained with 20000 iterations, with the same lineardecay learning rate scheduler. The results are shown as in Table 1.
Model  NLL  BLEU 

InsNetl2r  168.80 / 167.90  28.81 / 10.91 / 5.01 / 2.31 
L2R  164.26 / 161.31  28.56 / 11.33 / 5.30 / 2.43 
Model  BLEU1  BLEU2  BLEU3  BLEU4  Storyline Inc. % 

Static(Yao et al., 2019)  28.20  12.80  6.36  3.44  78% 
Dynamic(Yao et al., 2019)  28.47  11.49  5.21  2.62  75% 
CondLM (Yao et al., 2019)  28.07  11.62  5.11  2.55   
InsNetfull (w/Generated Storyline Input)  27.85  12.27  5.80  2.97  100% 
InsNetsorted (w/Generated Storyline Input)  27.33  12.09  5.79  2.98  100% 
L2RPNW (w/Generated Storyline Input)  27.47  11.57  5.27  2.53  91.63% 
InsNetfull (Golden Storyline Input)  52.86  36.86  26.39  19.35  100% 
InsNetsorted (Golden Storyline Input)  51.75  34.35  22.13  16.71  100% 
L2RPNW (Golden Storyline Input)  51.74  35.74  25.64  18.95  95.40% 
Lefttoright Long Sequence Generation To show our proposed method is a generalized form of the traditional lefttoright sequence model, we verify that it can correctly reproduce a regular lefttoright sequence model.
We perform the experiment on the ROCStories dataset to train a titletostory conditional language model. For InsNetl2r, we always feed a regular lefttoright formulation to see whether the model could quickly degenerate to a regular lefttoright inorder sequence model, and achieve reasonable performance. We compare the results in terms of likelihood prediction (NLL and NLL) and generation performance (BLEU1,2,3,4). The results are shown in Table 2. We see that although InsNetl2r does not show superior performance compared to the lefttoright baseline, the performance is comparable.
5.3 Lexically Constrained Generation
We hereby show one of the most appealing properties of insertionbased sequence model over noninsertionbased autoregressive generators – lexically constrained generation. Since the insertionbased sequence model could expand the context without rewriting the context from the last iteration, the model can strictly follow a given storyline to generate the complete story instead of omitting part of the given guidance. We train a traditional lefttoright language model as the baseline model, conditioned on the given storylines, and compare its performance with two InsNet variants namely InsNetsorted and InsNetfull. During training, both models are trained to firstly generate the storyline then expand it into full context. Given the storyline, InsNetsorted is trained to reconstruct the context in lefttoright order, while InsNetfull fulfill the completion in completely random order. We collect and report the evaluated models’ performance on the titlestorylinestory generation pipeline. We also report the performances given golden storylines as inputs. The results are shown in Table 3. To verify the generation quality, we also conduct a human evaluation on Amazon Mechanical Turk for 200 randomly sampled generated stories, each evaluated by five annotators with likert scale rating from 15. The average scores are shown in Table 4.
Model  Fidelity  Fluency  Coherence 

InsNet  3.92  3.26  3.43 
L2R  3.89  3.26  3.37 
Discussion Results from our experiments indicate that, in the lexically controlled story generation task, the proposed InsNet could achieve at least comparable performance to the traditional lefttoright generators in terms of BLEU score and human ratings. As for the prompt incorporation rate, we observe a remarkable performance gain for transformerbased lefttoright models, compared to LSTMbased models in (Yao et al., 2019). However, all lefttoright models fail to guarantee perfect storyline incorporation, while InsNet naturally have a 100% incorporation rate due to its insertionbased nature.
5.4 Compositional Generalization
Another interesting property of outoforder sequence models is their generalizability over compositional properties. Specifically, we argue that if the novel samples are created with observed properties but in unobserved combination, outoforder sequence models have better compositional generalizability over lefttoright sequence models. We conduce a synthetic experiment on the CoGENTCaption dataset and show the results in Table 5.
Model  Color Acc.  Shape Acc.  Joint Acc. 

InsNet  44.00%  37.60%  22.67% 
L2R  94.93%  6.93%  1.87% 
Discussion Although completely achieving compositional generalization is still hard, the outoforder sequence model shows a remarkable gain in the joint accuracy on the CoGENTcaption dataset over the baseline. We see the proposed InsNet shows a more balanced accuracy on the two attributes color and shape. In contrast, the lefttoright model is biased towards recognizing the color of the object, possibly because the majority (2/3) of the templates describe the color before the shape. One possible explanation for such observations is, in the stochastic observation reordering process of insertionbased sequence models, the probability for the model to first predict the shape then the color and the color then the shape are equal. Thus, the model is forced to enumerate and analyze all possible logic dependencies in between the context. The lefttoright model, on the contrary, learned to overly rely on the predicted color to help the shape inference, which is erroneous in compositional generalization. We believe this shows that insertionbased sequence models are more robust and have better compositional generalizability.
6 Conclusion and Future Work
We propose InsNet, an insertionbased sequence model with the capacity for efficient likelihood estimation. We empirically show the computational efficiency of such a model over ITvanilla with a synthetic variablelength experiment. We also show several promising properties of our model, including its compatibility with lefttoright generation order, the ability to generate long and diverse text, the power to achieve perfect lexical control in a structuretotext generation setting, and also better compositional generalization.
One interesting future direction is to train a largescale version of InsNet as a universal pretrained encoder for natural language understanding and lexically constrained natural language generation tasks. Another interesting direction is to investigate how to combine InsNet with parallel decoding on tasks like machine translation so that we can build a model that is both efficient during training and inference.
References
 Neural machine translation by jointly learning to align and translate. (English (US)). Note: 3rd International Conference on Learning Representations, ICLR 2015 ; Conference date: 07052015 Through 09052015 Cited by: §1.
 . arXiv preprint arXiv:1611.01576. Cited by: §2.
 Language models are fewshot learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
 TransformerXL: attentive language models beyond a fixedlength context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2978–2988. External Links: Link, Document Cited by: §2, §2.

Rdftotext generation with graphaugmented structural neural encoders.
In
Proceedings of the TwentyNinth International Joint Conference on Artificial Intelligence, IJCAI20
, pp. 3030–3036. Cited by: §1.  Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1243–1252. External Links: Link Cited by: §2.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §4.3.  Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910. Cited by: §1, §4.2.
 Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547. Cited by: §1.
 Endtoend taskcompletion neural dialogue systems. arXiv preprint arXiv:1703.01008. Cited by: §1.
 Multilingual denoising pretraining for neural machine translation. Transactions of the Association for Computational Linguistics 8, pp. 726–742. Cited by: §1.
 A corpus and cloze evaluation framework for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2016), Cited by: §1, §4.1.
 Improving language understanding by generative pretraining. Cited by: §1.
 Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §1.
 XLeditor: postediting sentences with xlnet. arXiv preprint arXiv:1910.10479. Cited by: §2, §2.
 Insertion transformer: flexible sequence generation via insertion operations. In International Conference on Machine Learning, pp. 5976–5985. Cited by: §1, §2, §2.
 Lexically constrained neural machine translation with levenshtein transformer. arXiv preprint arXiv:2004.12681. Cited by: §1.
 Progressive generation of long text. arXiv preprint arXiv:2006.15720. Cited by: §1.
 Attention is all you need. In NIPS, Cited by: §1.
 Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §1.
 Nonmonotonic sequential text generation. In International Conference on Machine Learning, pp. 6716–6726. Cited by: §1.
 A networkbased endtoend trainable taskoriented dialogue system. arXiv preprint arXiv:1604.04562. Cited by: §1.
 Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 2048–2057. External Links: Link Cited by: §1.
 Xlnet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §2, §2, §3.1.
 Planandwrite: towards better automatic storytelling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7378–7385. Cited by: §1, §1, §1, §4.1, §4.1, §5.3, Table 3.
 Pointer: constrained text generation via insertionbased generative pretraining. arXiv preprint arXiv:2005.00558. Cited by: §1.
Appendix A Concrete Examples of the Progressive Growing of Text on CoGENTcaption
Condition  Context 

the  
in the  
is in the  
is in the picture  
is a in the picture  
is a red in the picture  
is a red in the picture .  
There is a red in the picture .  
There is a red cube in the picture .  
have  
have a  
have a the  
have a object the  
have a object the a  
have a object the shape a  
have a object the shape a .  
have a object the shape a cylinder .  
have a object the shape of a cylinder .  
have a object in the shape of a cylinder .  
have a yellow object in the shape of a cylinder .  
We have a yellow object in the shape of a cylinder .  
.  
is .  
A is .  
A sphere is .  
A sphere is green .  
A sphere is is green .  
A sphere is table is green .  
A sphere is table it is green .  
A sphere is on table it is green .  
A sphere is on table and it is green .  
A sphere is placed on table and it is green .  
A sphere is placed on the table and it is green .  
have  
have a  
have a .  
have a cube .  
We have a cube .  
We have a in cube .  
We have a in the cube .  
We have a in the a cube .  
We have a in the of a cube .  
We have a in the shape of a cube .  
We have a object in the shape of a cube .  
We have a cyan object in the shape of a cube . 
Appendix B Concrete Examples of the Progressive Growing of Text on ROCStories
Title  Context 

the birthday party  claire turning/ party/ decided invite class/ party showed gifts/ excited 
claire turning./ party/ decided invite class/ party showed gifts/ excited  
claire turning./ party/ decided invite class/ party showed gifts/ excited  
claire turning eight./ party/ decided invite class/ party showed gifts/ excited  
claire turning eight./ party/ decided invite class/ party showed gifts/ was excited  
claire turning eight./ party/ decided invite class/ the party showed gifts/ was excited  
claire turning eight./ her party/ decided invite class/ the party showed gifts/ was excited  
claire turning eight./ her party / decided invite class/ the party showed gifts/ was excited  
claire turning eight./ her party / decided invite class/ the party showed up gifts/ was excited  
claire turning eight./ her party / decided invite class/ the party showed up gifts/ was excited  
claire turning eight./ her party / decided invite class / the party showed up gifts/ was excited  
claire turning eight./ her party / decided invite class / the party showed up gifts/ire was excited  
claire turning eight./ her party / decided invite class / the party showed up gifts/claire was excited  
claire turning eight./ her party / decided invite her class / the party showed up gifts/claire was excited  
claire turning eight./ her birthday party / decided invite her class / the party showed up gifts/claire was excited  
claire turning eight./ her birthday party / decided invite her class / the party showed up gifts./claire was excited  
claire turning eight./ her birthday party / decided invite her class / the party everyone showed up gifts./claire was excited  
claire turning eight./ her birthday party / decided invite her class / the party everyone showed up gifts./ claire was excited  
claire was turning eight./ her birthday party / decided invite her class / the party everyone showed up gifts./ claire was excited  
claire was turning eight./ her birthday party / decided invite her class / the party everyone showed up gifts./ claire was excited  
claire was turning eight./ her birthday party / decided invite her class./ the party everyone showed up gifts./ claire was excited  
claire was turning eight./ her birthday party / she decided invite her class./ the party everyone showed up gifts./ claire was excited  
claire was turning eight./ her birthday party./ she decided invite her class./ the party everyone showed up gifts./ claire was excited  
claire was turning eight./ her birthday party./ she decided invite her class./ the party everyone showed up gifts./ claire was excited.  
claire was turning eight./ her birthday party./ she decided to invite her class./ the party everyone showed up gifts./ claire was excited.  
claire was turning eight./ her birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited.  
claire was turning eight./ her a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited.  
claire was turning eight./ her had a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited.  
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited.  
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited to.  
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited to see.  
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited to see her.  
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited to see her party.  
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ at the party everyone showed up with gifts./ claire was excited to see her party.  
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ at the party, everyone showed up with gifts./ claire was excited to see her party.  
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ at the party, everyone showed up with gifts./ claire was excited to see her party.  
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ at the party, everyone showed up with gifts./ claire was so excited to see her party. 
john goes to the store.  john store/ tomatoes/ looked / sold / was 

john store/ he tomatoes/ looked / sold / was  
john to store/ he tomatoes/ looked / sold / was  
john to store/ he some tomatoes/ looked / sold / was  
john to store/ he some tomatoes/ looked / sold / was able  
john to store/ he some tomatoes/ looked / sold / was able to  
john to store/ he some tomatoes/ looked / sold / was able to buy  
john to store/ he some tomatoes/ looked / sold them / was able to buy  
john to store/ he some tomatoes/ looked / sold them / was able to buy some  
john to store./ he some tomatoes/ looked / sold them / was able to buy some  
john to store./ he some tomatoes/ looked at / sold them / was able to buy some  
john to store./ he some tomatoes/ looked at the / sold them / was able to buy some  
john to store./ he some tomatoes./ looked at the / sold them / was able to buy some  
john to store./ he some tomatoes./ looked at the produce / sold them / was able to buy some  
john to store./ he got some tomatoes./ looked at the produce / sold them / was able to buy some  
john to store./ he got some tomatoes./ looked at the produce / sold them / was able to buy some  
john to store./ he got some tomatoes./ looked at the produce / sold them / john was able to buy some  
john to store./ he got some tomatoes./ looked at the produce / he sold them / john was able to buy some  
john to store./ he got some tomatoes./ looked at the produce / he sold them / john was able to buy some  
john went to store./ he got some tomatoes./ looked at the produce / he sold them / john was able to buy some  
john went to store./ he got some tomatoes./ he looked at the produce / he sold them / john was able to buy some  
john went to store./ he got some tomatoes./ he looked at the produce / he sold them / john was able to buy some.  
john went to store./ he got some tomatoes./ he looked at the produce / he sold them / john was able to buy some.  
john went to the store./ he got some tomatoes./ he looked at the produce / he sold them / john was able to buy some.  
john went to the store./ he got some tomatoes./ he looked at the produce / he sold them./ john was able to buy some.  
john went to the store./ he got some tomatoes./ he looked at the produce./ he sold them./ john was able to buy some.  
john went to the store./ he got some tomatoes./ he looked at the produce aisle./ he sold them./ john was able to buy some.  
john went to the store./ he got some tomatoes./ he looked at the produce aisle./ he sold them./ john was able to buy some new.  
john went to the store./ he got some tomatoes./ he looked at the produce aisle./ he sold them./ john was able to buy some new vegetables. 
the minor flying  jessica flying/ time/ scared/ held hand/ thanks 

jessica flying/ time/ scared/ woman held hand / thanks  
jessica flying/ time/ scared/ a woman held hand / thanks  
jessica flying/ time/ scared./ a woman held hand / thanks  
jessica flying/ time/ scared./ a woman held hand / thanks  
jessica flying/ first time/ scared./ a woman held hand / thanks  
jessica flying/ was first time/ scared./ a woman held hand / thanks  
jessica flying/ was first time/ scared./ a woman held hand /s thanks  
jessica flying/ was first time/ scared./ a woman held hand /sica thanks  
jessica flying/ was first time/ was scared./ a woman held hand /sica thanks  
jessica flying / was first time/ was scared./ a woman held hand /sica thanks  
jessica flying / was first time/ was scared./ a woman held hand /jesica thanks  
jessica flying / was the first time/ was scared./ a woman held hand /jesica thanks  
jessica flying / was the first time/ she was scared./ a woman held hand /jesica thanks  
jessica flying / was the first time./ she was scared./ a woman held hand /jesica thanks  
jessica flying / was the first time./ she was scared./ a woman held hand /jesica thanks  
jessica flying / it was the first time./ she was scared./ a woman held hand /jesica thanks  
jessica flying / it was the first time./ she was scared./ a woman held her hand /jesica thanks  
jessica flying / it was the first time./ she was scared./ a woman held her hand /jesica was thanks  
jessica flying / it was the first time./ she was scared./ a woman held her hand /jesica was thanks  
jessica flying / it was the first time./ she was scared./ a woman held her hand /jesica was to thanks  
jessica flying / it was the first time./ she was scared./ a woman and held her hand /jesica was to thanks  
jessica flying / it was the first time./ she was scared./ a woman and held her hand /jesica was able to thanks  
jessica was flying / it was the first time./ she was scared./ a woman and held her hand /jesica was able to thanks  
jessica was flying / it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks  
jessica was flying / it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks.  
jessica was flying./ it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks.  
jessica was flying airplane./ it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks.  
jessica was flying an airplane./ it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks.  
jessica was flying an airplane./ it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks to.  
jessica was flying an airplane./ it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks to her.  
jessica was flying an airplane./ it was the first time./ she was scared./ a woman and held her hand./ jesica was able to thanks to her.  
jessica was flying an airplane./ it was the first time./ she was scared./ a woman came and held her hand./ jesica was able to thanks to her.  
jessica was flying an airplane./ it was the first time./ she was scared./ a woman came and held her hand./ jesica was able to fly thanks to her. 
Appendix C Formulations of InsNet Layers and Aggregation Methods
c.1 Layer Formulation
Most formulations of InsNet follow that in XLNet. However, there are still some minor differences. In this section, we aim to give a brief mathematical description to the formulation of an InsNet layer and the aggregation methods.
We mostly follow the ideas of TransformerXL/XLNet to incorporate the insertionbased relative position offset matrix. Suppose the tokens in the original sequence is inserted in the permutation order i.e. in the order of , and we want to use InsNet to predict the insertionbased likelihood of such a process. Each layer is given the sequence of representation vectors (from last InsNet layer or word embedding layer) and the sinusoidal relative position embeddings matrix as its input. Here is a
3d tensor, with the
section corresponds to the offset matrix as shown in Figure 5 in the main text). The same as in XLNet/TransformerXL, when computing the attention, the model needs to handle four groups of feature interactions, including Query ContentKey Content interaction, Query PositionKey Content interaction, Query ContentKey Position interaction and Query PositionKey Position interaction. The formulation for each interaction are shown as follows:Query ContentKey Content interaction, an matrix:
Query ContentKey Position interaction, an matrix:
Query PositionKey Content interaction, a vector:
Query PositionKey Position interaction, an matrix:
where matrices
are parameters to perform linear transformations on the embedding as in standard bilinear multiplicative attention formulation.
vectors are the invariant parts in the bilinear interaction due to the relative position encoding. The overall attention alpha would be a sum of the four terms:In general, for i = 1,…,N, the formulation for each layer of an Nlayer InsNet could be written as:
Specially, is the wordembedding sequence. In the first step, the model produces the linear transformed versions of the input embeddings, namely query head, key head, value head and position head. The second step performs the necessary bilinear interactions. The third step produces the actual attention probabilities and the last one introduces nonlinearity, following the design of that in XLNet.
c.2 Formulation of Aggregation Methods
After the context encoding process, we get a sequence of representations that are contextaware encodings for each existing token. Now we need to get the representation for the next insertion position and token prediction. For insertionbased generation, we can only insert tokens in between two existing tokens i.e. slots. We need a process that transforms sequence representation to slot representations. We call such a process aggregation.
Shallow Aggregation For shallow aggregation in each step after the presence of , suppose the output representation from the transformer is
and an unshuffling matrix such that
and
the shallow aggregation with an information update remedy can be written as:
where stands for tensor concatenation operation on the modelwidth dimension. The unshuffling matrix can be easily obtained when running the offset compression algorithm since it’s just the inverse operation of the algorithm’s output.
In each step , there will be possible token slots and a terminating slot. For simplicity, we denote the terminating slot as slot 0 and the rest to be . The token likelihood prediction is simple, since we only need to index the corresponding slot representation and directly pass it through a loglinear layer to obtain the vocabulary distribution. For position prediction, we yet to obtain the representation for the termination slot. Here we directly use as a global pooling vector that includes all the information we need for termination prediction. We add modules and
to transform the actual slot representation and the dummy termination slot representation into the logit for the slot likelihood, and we take the slot with highest probability to insert. Mathematically, we conduct the computations as follows:
Deep Aggregation Following the twostream attention idea proposed in XLNet, the formulation for deep aggregation is trivial, since they are basically equivalent to adding a sequence of tokens that share the offset matrix with that in the original context. These are trained to capture the contentfree information for slots that are occupied by actually inserted tokens in each step. After the model encoding process, the contextaware representation for each token is directly used to represent the corresponding real token. In this process, the formulation still follow the one described above, just replacing with .
Denote the original offset matrix to be , when using deep aggregation, the model’s inputs would be extended as:
where stands for the relative position of the slots defined in the form of an offset matrix. Each mask token that stands for a slot can only attend to the existing positions of that step.
For every aggregated slot, because we can only get one representation vector from one position in the model’s outputs, it is obvious that we need to add another mask position in . When predicting the position, since we need to collect the representation of every possible slot in each step, for the whole sequence of length , we would need mask positions, making deep aggregation practically too expensive in terms of space complexity (and/or time complexity if we use up the parallelization capacity). Thus, in our experiments, we choose the cheaper method i.e. shallow aggregation. Empirically, the performance of shallow aggregation is comparable with deep aggregation.