On Efficient Training, Controllability and Compositional Generalization of Insertion-based Language Generators

02/12/2021 ∙ by Sidi Lu, et al. ∙ 0

Auto-regressive language models with the left-to-right generation order have been a predominant paradigm for language generation. Recently, out-of-order text generation beyond the traditional left-to-right paradigm has attracted extensive attention, with a notable variation of insertion-based generation, where a model is used to gradually extend the context into a complete sentence purely with insertion operations. However, since insertion operations disturb the position information of each token, it is often believed that each step of the insertion-based likelihood estimation requires a bi-directional re-encoding of the whole generated sequence. This computational overhead prohibits the model from scaling up to generate long, diverse texts such as stories, news articles, and reports. To address this issue, we propose InsNet, an insertion-based sequence model that can be trained as efficiently as traditional transformer decoders while maintaining the same performance as that with a bi-directional context encoder. We evaluate InsNet on story generation and CleVR-CoGENT captioning, showing the advantages of InsNet in several dimensions, including computational costs, generation quality, the ability to perfectly incorporate lexical controls, and better compositional generalization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic generation of coherent text is an important and challenging task and the basis of many downstream applications, such as automatic story generation Yao et al. (2019); Tan et al. (2020)

, image captioning 

Vinyals et al. (2015); Xu et al. (2015), machine translation Bahdanau et al. (2015); Liu et al. (2020), and dialogue system/chatbot Li et al. (2017b, a)

. The algorithmic essence of such a problem from a machine learning perspective is usually sequence modeling, also known as language modeling.

Auto-regressive language model is the most prevalent generative model for language, which minimizes the negative-log-likelihood of a sequence of n tokens with a left-to-right factorization: during training. Then during decoding, the sentences are generated token by token from left to right. With the transformer architecture Vaswani et al. (2017), each step of likelihood estimation can be calculated in parallel while sharing the prefix context encoding calculations. This makes it possible to build powerful, efficiently trainable sequence models, like the GPT family (Radford et al., 2018, 2019; Brown et al., 2020),

Figure 1: Example for the recursive nature of language generation. Bold sentences are already complete one, but can still be progressively enriched. Left-to-right decoding cannot support such recursive enrichment.

Despite the practical success, there are several notable concerns about left-to-right auto-regressive generators. One important concern is that the left-to-right formulation does not sufficiently reflect the recursive and compositional nature of language (Figure 1). As a result, progressive refinement of text with left-to-right sequence models is non-trivial. This motivated the community to explore paradigms outside of left-to-right auto-regressive generation.

One direction is the out-of-order generation, which formulates the generation process as an insertion process. A sentence is still gradually formed from void to completion, but each token (or tokens) could now be inserted into arbitrary positions of an existing context Stern et al. (2019); Welleck et al. (2019). The explorations of such a generation paradigm focus on three potential benefits: 1) Leveraging parallel decoding to reduce the number of iterations in the inference time to sub-linear complexity w.r.t. the sequence length Stern et al. (2019). 2) Exploring the possibility to automatically learn the latent structures of sequences and reveal the compositional nature of languages (Welleck et al., 2019). 3) The ability to achieve perfect lexical control Zhang et al. (2020), where several given words are required to appear in the generated sentences non-consecutively. This setup has broad application in story generation Yao et al. (2019), task-oriented dialog systems  Wen et al. (2016), RDF-to-text generation Gao et al. (2020), and lexically constrained machine translation Susanto et al. (2020).

Figure 2: The absolute position information of each token is volatile for insertion-based models. Thus, previous models usually have to re-encode the sequence after each expansion.

However, out-of-order generation brings computational challenges. Unlike traditional left-to-right generators, the absolute positions of the inserted tokens are dynamic, as is shown in Figure 2. In other words, the position encoding of each token is volatile as the generation proceeds, requiring a re-encoding of the context after each expansion. Computation-sharing among steps of a complete likelihood estimation in such models is usually considered impossible.

We design an efficiently trainable insertion-based sequence generator such that it enjoys the benefit of out-of-order generation while having comparable training efficiency as left-to-right generation models. We achieve this by leveraging a modified relative position encoding that suits the model’s insertion-based nature so that the positional encoding of the inserted tokens will not change during the course of the insertion-based expansion. With the new component, we make it possible to share the computation among different likelihood estimation steps while the generation progresses.

With efficient training, we demonstrate that insertion-based generation can scale up to long, diverse text such as stories Mostafazadeh et al. (2016). We show the potential of such models in achieving perfect lexical control in a structure-to-text generation setting Yao et al. (2019), and also better compositional generalization in captioning scenes rendered under the settings of CleVR-CoGENT dataset Johnson et al. (2017). This opens up new possibilities for diverse and creative text generation.

2 Background

The core challenge for efficient training of insertion-based models is the position information encoding. We need to have a proper way to represent the position information changes as the context expands while making the representation incrementally computable so that we can amortize the encoding of the existing context. Our solution is largely inspired by the ideas of the XLNet family Yang et al. (2019); Dai et al. (2019); Shih et al. (2019). Also, we take insertion transformer Stern et al. (2019) as an important baseline. Therefore, we first review these two lines of work.

Insertion Transformer   Insertion transformer Stern et al. (2019)

(IT-vanilla or simply IT) proposes a design for insertion-based text generation. In each step, a bi-directional encoder transformer is performed on the expanded sequence to compose the representation for each candidate slot in between every two consecutive positions. After such, an optimization process for the joint distribution of position-token is performed to support an insertion-based text generation.

There are multiple variants of insertion transformer proposed in the original paper, varying in whether parallel prediction/decoding is enabled, how the model determines the termination of generation and how the model factorizes the text into a sequence of insertions. The probabilistic formulation for different variants of insertion transformer is slightly different. The common part of these variants is how each insertion is modeled in the step loss. On step where a token is inserted in between position and of context , the log step likelihood can be written as:

where stands for the i-th position of bi-directional encoding of the sequence and

stands for vector concatenation. Note that since IT-vanilla adopts the original absolute positional encoding of transformers (as illustrated in Figure 

2), the representation of the generated sequence has to be completely re-encoded after each step of context expansion to match the position changes of tokens. The expectation of the negative log step likelihood over all permitted context-insertion pairs at each step is computed as the step loss. The step losses from the first step to the last one are summed up as the sequence loss.

Transformer-XL and XLNet   Transformer-XL Dai et al. (2019) proposes a powerful framework which supports relative position encoding and truncated gradient propagation in transformer-based sequence models. In replacement of absolute positions that are tied to each token in the sequence, the spatial layout of the sequence is defined by a matrix that records a directed distance from the column token to the row token, as is illustrated in Figure 3. Here in each cell denotes “being -units away on the left/right”.

Figure 3: Comparison between absolute and relative position encoding used in Transformer-XL/XLNet.

XLNet Yang et al. (2019)

exploits such an architecture to implement a generalized form of auto-regressive language modeling, called permutation language models. Permutation language models shuffle the factorization order of the joint probability of a sequence and predict each token conditioned on the observed part of the sequence, given the predicted token’s position information.

The objective of PLM could be summarized as:

(1)

Here and are the actual position and its encoding of the -th element of the permutation. is the known/observed context upto step . Specifically, and . For each predicted token , a dummy token is created in an additional attention stream. The dummy token shares its positional information , but it contains no content information about what exactly is. The encoding of the dummy token from the second attention stream of the model results in a representation of .

Note that the permutation view in XLNet resembles a random insertion order to generate a sequence in an insertion-based model. However, when XLNet computes the relative position encoding, it assumes a global view of the oracle sequence. This implicitly assumes the span length between every two observed tokens are known a priori, violating the assumption of insertion-based generation, where we can insert arbitrarily many tokens in between two generated tokens in theory. This prohibits XLNet to act as an insertion-based generator, according to (Shih et al., 2019).

Computation Sharing for Context Encoding   A naive process of likelihood estimation involves steps of context encoding. In practice, this is inefficient particularly for modeling long sequences. In the pre-transformers era, some efforts towards exploiting the parallelism of GPUs have been proven useful (Bradbury et al., 2016; Gehring et al., 2017). A key insight behind such attempts is, usually, the context representations of incoming steps can be incrementally calculated based on the representations in previous steps. Thus computations could be shared among different steps of context encoding. The left-to-right decoder transformers enable this by applying a lower triangular mask to only allow left-wards attention. As a result, the representation for each position will not change as the sequence grows, and the computation of the representation for a new token can rely on previously computed prefix representations.

With the volatile positional information in insertion-based models, the left-wards attention no longer enables computation sharing in context encoding. Efficient computation thus becomes a challenge. Our model aims to solve this problem.

3 Efficient Insertion-based Generation

Since we generate sentences token by token, there are three major components we need to consider: 1) context encoding, which composes the representation of the generated context, 2) position/token prediction, which answers the question of where and what is the to-be-inserted token, and 3) termination criteria during the decoding process. We discuss these three components for our model in detail.

Figure 4: The relative position information between the context and the updated token.
(a) Permutation View
(b) Natural View
Figure 5: The offset matrix shown in the insertion order (permutation view) and the natural order (natural view). In the permutation view, the elements are arranged in the order of how insertions for each token happened. In the natural view the elements are arranged regularly in where they originally locate in the original sequence.
Figure 6: The offset compression algorithm. It transforms the insertion order/permutation indices to a offset matrix.

3.1 Context Encoding

We start with discussions about the context encoding of our model, specifically the encoding of position information. In previous sections, we’ve discussed the challenges caused by the phenomenon shown in Figure 2, which makes an incremental calculation of new context representation, i.e. efficient likelihood estimation seemingly impossible. We argue that with a relative position encoding mechanism, incremental calculation of new context representation after each insertion is still feasible.

Consider a case where the partial context (to be completed by further insertions) is “I have pen .” To make this a grammatical sentence, one minimal way is to insert a in between have and pen. The directed distance vector for token “a” is illustrated in Figure 4. This relative position annotation clearly defines where the insertion happens by only describing the spatial relationship between each pair of tokens. If we pack all the relative position vectors together, we will get a matrix that reflects the relative spatial relation along the trajectory of insertions, with each row corresponds to an insertion step. We name it as the offset matrix. For example, for the sequence “I have a pen.” in the insertion-based generation order of “BOS” “ EOS” “have” “pen” “I” “.” “a”, the complete offset matrix is shown in Figure 5(a).111Given the partial generation “I have pen.”, representations for the generated tokens will not attend to token “a”, we can simply mask out these slots in both the token and position attention masks, resulting in a lower-triangular offset matrix. Figure 5(b) shows the offset matrix in an alternative view arranged in the original sequence order. We can see that the relative position encoding reflects the order of the original sequence with the masked positions i.e. “later inserted tokens” correctly skipped.

We disentangle the position information from the token embeddings as in XLNet Yang et al. (2019). Since the token-only information could be perfectly shared among different steps, constructing such insertion-based relative position lower-triangular matrix allows us to adopt the computation sharing trick in traditional decoder transformers to remarkably boost up the training.

Offset Compression   We now show that given the insertion order of a sequence, described in the form of absolute position permutation indices (in the previous example, 0-6-2-4-1-3-5), the offset matrix can be computed efficiently with a process called offset compression. See Figure 6.

Specifically, we first convert the absolute position vector into a matrix by duplication. Then, the upper triangular elements are masked by “infinity” to remove their impact in relative position computation because the inserted token should not attend to future to-be-inserted tokens. In the third step, each element is replaced by its in-row low-to-high rank, i.e. its absolute position when skipping the masked positions. In the last step, each row is baselined by the diagonal element to reflect the fact that the model is attending from the last inserted token to previous ones.

Obviously, when the insertion order is from left-to-right, the resulting model is equivalent to the traditional auto-regressive language model with relative positional encoding.

3.2 Slot Representation for Token Prediction

With the context encoding, the next step is to aggregate prefix representations for the next token/position prediction. Building upon the ideas of insertion transformer and XLNet, we propose two ways of aggregation, illustrated in Figure 7.

(a) Shallow Aggregation
(b) Deep Aggregation
Figure 7: Comparison between two ways of slot representation aggregation.

The deep aggregation uses the two-stream attention mechanism proposed in XLNet to aggregate the information for token prediction. The advantage is that it fully utilizes the model capacity to compute slot representations. However, the computation of each slot representation requires a separate attention stream. In position prediction, we need to simultaneously obtain the representation of every candidate slot, which requires additional attention streams, making deep aggregation computationally expensive.

The shallow aggregation, mimicking the behavior vanilla insertion transformer, uses a concatenation of representation vectors from the left-neighboring and right-neighboring position as the slot representation. Since the aggregation operation from context embeddings to slot embeddings only include sparse operations like selection and concatenation, we can efficiently enumerate the slot representation in parallel for each time step, allowing us to compute the position likelihood and perform the sequence-level termination control. A minor concern about the naive implementation of shallow aggregation is that, in some corner cases, the slot representation will not be correctly updated by new insertions. The example in Figure 8 demonstrates such cases: assume we fill SLOT A at step 10 while SLOT B’s representation is determined by step 4 and step 9’s representation vectors. The inserted token in SLOT A will not affect the representation for SLOT B, which is problematic. One remedy is to also concatenate the representation from the latest insertion step to the computation of each slot’s representation to make sure the information is complete.

Figure 8: A possible failure case of the naive implementation of shallow aggregation on InsNet. SLOT B representation is determined by step 4 and step 9 representation vectors and cannot be effectively updated by step 10 insertion.
Figure 9: Typical failure of applying parallel decoding to high-entropy generative sequence modeling like creative text generation.

3.3 Inference and Termination

The original focus of insertion transformer is to accelerate the decoding of machine translation systems to sub-linear complexity. In their design, they highly rely on parallel decoding, which is to simultaneously insert multiple tokens in each step of the generation process. However, we found it hard to work well on more general, high conditional entropy text generation tasks, such as story generation. A typical failure in parallel decoding is shown in Figure 9, where the parallel decoding tends to generate extremely repetitive contents. Thus, we mostly focus on the “uniform” decoding variant of Insertion Transformer, which uniform-randomly predicts the next insertion position and token and performs one insertion operation per step.

For termination, we follow Insertion Transformer to use sequence-level control, which relies on the estimated position distribution (with a special termination position to terminate the generation) to determine whether the algorithm should stop generating. Besides, for longer and more diverse text generation, we force the model to expand the context until the termination position log-likelihood hits the expectation of the log-likelihood on the development set.

4 Experimental setup

We examine InsNet’s ability as an insertion-based generative sequence model in multiple aspects, including computational efficiency, the compatibility with traditional left-to-right formulation, the ability to achieve lexically constrained text generation, and compositional generalizability. We introduce the datasets and experimental setup we use to support our empirical studies.

4.1 Efficient and Controllable Insertion-based Generation

To demonstrate that InsNet can scale up to longer, more diverse text generation, we evaluated the model on both synthetic and real-world datasets to show its efficiency and controllability.

Computational Efficiency Benchmark   Due to architectural parallelism, with the model size and computational resources varying, there could be a huge deviation from the theoretical analysis on the computation efficiency when running the model in practical scenarios. We hereby create a synthetic contest to empirically show the computation efficiency of our model.

We create a random sequence dataset with variable length to reflect the growing speed of each model’s running time w.r.t. the predicted sequence. We set the vocabulary size to be 30000, mimicking the vocabulary size of the majority of the frequently-used tokenizers. For each selected length , we sample 25000 random sequences

. We train three sequence models (InsNet, IT-Vanilla, and an L2R model with GPT-2-base architecture) to model the random sequence dataset, record the time cost per epoch for 5 epochs and take the average.

Figure 10: A concrete example from the ROCStories dataset with storyline annotation.

Long Text Generation Benchmark   We use story generation as our task to showcase models’ ability to generate long texts. ROCStories Mostafazadeh et al. (2016) corpus contains 98,162 five-line stories with a title. In addition, 1817 title-less stories are provided for development and test, respectively. The average length for the stories in the corpus is 50, which makes the dataset a good testbed for diverse, medium-length generative sequence modeling. Following the data split in Yao et al. (2019), we further split the 98162 training stories with titles into train, development, and test splits, approximately with the ratio of 8:1:1.

Lexically Constrained Generation   ROCStories with storyline annotation (Figure 10), firstly created and used in (Yao et al., 2019), is a good testbed for evaluating the model’s ability for lexically controlled text generation. Beyond the basic title-story pair, a sequence of keywords, called “storyline”, is extracted from each story. In the lexically controlled generation setting, the model is trained to generate a story from a given storyline, and the generated story must contain all the storyline keywords.

4.2 Compositional Generalization

In addition to the story generation task, we created a simplified version of the compositional generalization (CoGENT) problem on CleVR dataset (Johnson et al., 2017) to study the impact of insertion-based, out-of-order formulation on the model’s causal preference and how it affects the model’s compositional generalization ability. CleVR dataset is a dataset/data creator that contains scenes where one or more objects are placed on a gray table. The objects have five properties, including size, color, shape, material and location. In the basic setting of CleVR, the possible shapes are cubes, cylinders and spheres. The possible colors include gray, red, blue, green, brown, purple, cyan and yellow. The material of each object could either be plastic or metal. CoGENT is a specialized task that challenges the evaluated model’s generalization ability when the general principle of the i.i.d. is disobeyed. CoGENT contains two constrained subsets of configuration. Under both settings of CoGENT_A and CoGENT_B, there are no limitations for the spheres so that the model should know the colors are reasonably interchangeable values of the same property. In CoGENT_A, the cubes can only be gray, blue, brown or yellow and the cylinders can only be red, green, purple or cyan. In CoGENT_B the color limitations are exchanged for cubes and cylinders. Models are trained and developed on CoGENT_A, then tested on CoGENT_B.

Dataset Preparation   To make the scenario closer to the cases we could encounter in language generation tasks, we reshape the dataset into a simple image captioning problem, namely CoGENT-caption. CoGENT-caption includes 2000 single-object image-caption pairs under CoGENT_A setting for training, 500 single-object image-caption pairs under CoGENT_A setting for development and 500 single-object image-caption pairs under CoGENT_B setting for compositional generalization testing. Figure 11 provided several examples in the dataset with visual descriptions.

Figure 11: CoGENT-caption dataset.

4.3 Implementation Details

For all language generation task, a BPE tokenizer is applied for word-piece level tokenization. Each of the evaluated transformer models, if not otherwise stated, is implemented as a base-sized transformer model, which has 12-layers with 12 attention head and 768 hidden dimensions. The batch size is set to 64. In cases where the model size exceeds the device capacity, the cumulative gradient trick is applied to support an equivalent optimization effect. The learning rate is selected from [5e-5, 1e-4, 2e-4]. The dropout rate is selected from [0.1, 0.2, 0.333333]. The weight decay rate is selected from [0.02, 0.05]. All the models are trained with 400 warm-up iterations and 80000 iterations of training in total. A linear-decay learning rate scheduler is applied for fine-grained training of the model. For CoGENT-caption task, the image encoding is supported by a ResNet-50 (He et al., 2016) model. All the hyper-parameters are determined by a grid search strategy with 5 epochs of trial run under each setting.

5 Results and Analysis

5.1 Empirical Computational Cost Analysis

On a machine with RTX3090 GPU and a 12-core 24-thread CPU, we collect our results under different length settings of [20. 40, 60, 80, 100, 120, 160]. The results are illustrated in Figure 12.

Figure 12: Curves of time cost per epoch of different models on the synthetic dataset as the sequence length increases.

Discussion   As we can observe from the illustration, when the length of text increases, IT-vanilla quickly used up the parallelization capacity and degenerate to an algorithm with approximately a sequential complexity, whereas InsNet and traditional left-to-right sequence models maintain a near-linearly and slower increasing time cost.

5.2 Efficient Sequence Generation

Short Sequence Generation

Model BLEU-1 BLEU-2 Token NLL
InsNet 9.07 2.21 79.42/76.84
IT-vanilla 7.41 1.72 79.56/77.10
L2R 7.68 1.65 74.22/72.85
Table 1: Performance comparison in the conditional storyline modeling experiment. NLL estimation from the insertion-based and left-to-right model may not be comparable to each other.

In the last subsection we’ve shown that it may not be practically affordable for us to obtain a well-trained IT-vanilla on longer sequences. However, before we move along to using the efficiency-improved InsNet as the representative for insertion-based methods, it would be both interesting and important for us to verify the performance consistency between InsNet and IT-vanilla. In addition to the likelihood measure, for better comparison, we also collect and compare the decoding results from two insertion-based models and traditional left-to-right models. We found that if still trained with 80000 iterations, the L2R model over-fit severely, so the L2R baseline is only trained with 20000 iterations, with the same linear-decay learning rate scheduler. The results are shown as in Table 1.

Model NLL BLEU-
InsNet-l2r 168.80 / 167.90 28.81 / 10.91 / 5.01 / 2.31
L2R 164.26 / 161.31 28.56 / 11.33 / 5.30 / 2.43
Table 2: Performance comparison in the conditional left-to-right sequence modeling experiment.
Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 Storyline Inc. %
Static(Yao et al., 2019) 28.20 12.80 6.36 3.44 78%
Dynamic(Yao et al., 2019) 28.47 11.49 5.21 2.62 75%
CondLM (Yao et al., 2019) 28.07 11.62 5.11 2.55 -
InsNet-full (w/Generated Storyline Input) 27.85 12.27 5.80 2.97 100%
InsNet-sorted (w/Generated Storyline Input) 27.33 12.09 5.79 2.98 100%
L2R-PNW (w/Generated Storyline Input) 27.47 11.57 5.27 2.53 91.63%
InsNet-full (Golden Storyline Input) 52.86 36.86 26.39 19.35 100%
InsNet-sorted (Golden Storyline Input) 51.75 34.35 22.13 16.71 100%
L2R-PNW (Golden Storyline Input) 51.74 35.74 25.64 18.95 95.40%
Table 3: Performance comparison in the progressive story generation/completion experiment.

Left-to-right Long Sequence Generation   To show our proposed method is a generalized form of the traditional left-to-right sequence model, we verify that it can correctly reproduce a regular left-to-right sequence model.

We perform the experiment on the ROCStories dataset to train a title-to-story conditional language model. For InsNet-l2r, we always feed a regular left-to-right formulation to see whether the model could quickly degenerate to a regular left-to-right in-order sequence model, and achieve reasonable performance. We compare the results in terms of likelihood prediction (NLL and NLL) and generation performance (BLEU-1,2,3,4). The results are shown in Table 2. We see that although InsNet-l2r does not show superior performance compared to the left-to-right baseline, the performance is comparable.

5.3 Lexically Constrained Generation

We hereby show one of the most appealing properties of insertion-based sequence model over non-insertion-based auto-regressive generators – lexically constrained generation. Since the insertion-based sequence model could expand the context without rewriting the context from the last iteration, the model can strictly follow a given storyline to generate the complete story instead of omitting part of the given guidance. We train a traditional left-to-right language model as the baseline model, conditioned on the given storylines, and compare its performance with two InsNet variants namely InsNet-sorted and InsNet-full. During training, both models are trained to firstly generate the storyline then expand it into full context. Given the storyline, InsNet-sorted is trained to reconstruct the context in left-to-right order, while InsNet-full fulfill the completion in completely random order. We collect and report the evaluated models’ performance on the title-storyline-story generation pipeline. We also report the performances given golden storylines as inputs. The results are shown in Table 3. To verify the generation quality, we also conduct a human evaluation on Amazon Mechanical Turk for 200 randomly sampled generated stories, each evaluated by five annotators with likert scale rating from 1-5. The average scores are shown in Table 4.

Model Fidelity Fluency Coherence
InsNet 3.92 3.26 3.43
L2R 3.89 3.26 3.37
Table 4: Human evaluation of controllable story generation results given gold storyline. Scores are averaged over all data points with a 1-5 likert scale rating.

Discussion   Results from our experiments indicate that, in the lexically controlled story generation task, the proposed InsNet could achieve at least comparable performance to the traditional left-to-right generators in terms of BLEU score and human ratings. As for the prompt incorporation rate, we observe a remarkable performance gain for transformer-based left-to-right models, compared to LSTM-based models in (Yao et al., 2019). However, all left-to-right models fail to guarantee perfect storyline incorporation, while InsNet naturally have a 100% incorporation rate due to its insertion-based nature.

5.4 Compositional Generalization

Another interesting property of out-of-order sequence models is their generalizability over compositional properties. Specifically, we argue that if the novel samples are created with observed properties but in unobserved combination, out-of-order sequence models have better compositional generalizability over left-to-right sequence models. We conduce a synthetic experiment on the CoGENT-Caption dataset and show the results in Table 5.

Model Color Acc. Shape Acc. Joint Acc.
InsNet 44.00% 37.60% 22.67%
L2R 94.93% 6.93% 1.87%
Table 5: Performance comparison in the compositional generalization experiment.

Discussion   Although completely achieving compositional generalization is still hard, the out-of-order sequence model shows a remarkable gain in the joint accuracy on the CoGENT-caption dataset over the baseline. We see the proposed InsNet shows a more balanced accuracy on the two attributes color and shape. In contrast, the left-to-right model is biased towards recognizing the color of the object, possibly because the majority (2/3) of the templates describe the color before the shape. One possible explanation for such observations is, in the stochastic observation re-ordering process of insertion-based sequence models, the probability for the model to first predict the shape then the color and the color then the shape are equal. Thus, the model is forced to enumerate and analyze all possible logic dependencies in between the context. The left-to-right model, on the contrary, learned to overly rely on the predicted color to help the shape inference, which is erroneous in compositional generalization. We believe this shows that insertion-based sequence models are more robust and have better compositional generalizability.

6 Conclusion and Future Work

We propose InsNet, an insertion-based sequence model with the capacity for efficient likelihood estimation. We empirically show the computational efficiency of such a model over IT-vanilla with a synthetic variable-length experiment. We also show several promising properties of our model, including its compatibility with left-to-right generation order, the ability to generate long and diverse text, the power to achieve perfect lexical control in a structure-to-text generation setting, and also better compositional generalization.

One interesting future direction is to train a large-scale version of InsNet as a universal pre-trained encoder for natural language understanding and lexically constrained natural language generation tasks. Another interesting direction is to investigate how to combine InsNet with parallel decoding on tasks like machine translation so that we can build a model that is both efficient during training and inference.

References

  • D. Bahdanau, K. H. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. (English (US)). Note: 3rd International Conference on Learning Representations, ICLR 2015 ; Conference date: 07-05-2015 Through 09-05-2015 Cited by: §1.
  • J. Bradbury, S. Merity, C. Xiong, and R. Socher (2016)

    Quasi-recurrent neural networks

    .
    arXiv preprint arXiv:1611.01576. Cited by: §2.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
  • Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov (2019) Transformer-XL: attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2978–2988. External Links: Link, Document Cited by: §2, §2.
  • H. Gao, L. Wu, P. Hu, and F. Xu (2020) Rdf-to-text generation with graph-augmented structural neural encoders. In

    Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20

    ,
    pp. 3030–3036. Cited by: §1.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1243–1252. External Links: Link Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §4.3.
  • J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910. Cited by: §1, §4.2.
  • J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. Jurafsky (2017a) Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547. Cited by: §1.
  • X. Li, Y. Chen, L. Li, J. Gao, and A. Celikyilmaz (2017b) End-to-end task-completion neural dialogue systems. arXiv preprint arXiv:1703.01008. Cited by: §1.
  • Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020) Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, pp. 726–742. Cited by: §1.
  • N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen (2016) A corpus and cloze evaluation framework for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2016), Cited by: §1, §4.1.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §1.
  • Y. Shih, W. Chang, and Y. Yang (2019) XL-editor: post-editing sentences with xlnet. arXiv preprint arXiv:1910.10479. Cited by: §2, §2.
  • M. Stern, W. Chan, J. Kiros, and J. Uszkoreit (2019) Insertion transformer: flexible sequence generation via insertion operations. In International Conference on Machine Learning, pp. 5976–5985. Cited by: §1, §2, §2.
  • R. H. Susanto, S. Chollampatt, and L. Tan (2020) Lexically constrained neural machine translation with levenshtein transformer. arXiv preprint arXiv:2004.12681. Cited by: §1.
  • B. Tan, Z. Yang, M. AI-Shedivat, E. P. Xing, and Z. Hu (2020) Progressive generation of long text. arXiv preprint arXiv:2006.15720. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §1.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §1.
  • S. Welleck, K. Brantley, H. D. Iii, and K. Cho (2019) Non-monotonic sequential text generation. In International Conference on Machine Learning, pp. 6716–6726. Cited by: §1.
  • T. Wen, D. Vandyke, N. Mrksic, M. Gasic, L. M. Rojas-Barahona, P. Su, S. Ultes, and S. Young (2016) A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562. Cited by: §1.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 2048–2057. External Links: Link Cited by: §1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §2, §2, §3.1.
  • L. Yao, N. Peng, R. Weischedel, K. Knight, D. Zhao, and R. Yan (2019) Plan-and-write: towards better automatic storytelling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7378–7385. Cited by: §1, §1, §1, §4.1, §4.1, §5.3, Table 3.
  • Y. Zhang, G. Wang, C. Li, Z. Gan, C. Brockett, and B. Dolan (2020) Pointer: constrained text generation via insertion-based generative pre-training. arXiv preprint arXiv:2005.00558. Cited by: §1.

Appendix A Concrete Examples of the Progressive Growing of Text on CoGENT-caption

Condition Context
the
in the
is in the
is in the picture
is a in the picture
is a red in the picture
is a red in the picture .
There is a red in the picture .
There is a red cube in the picture .
have
have a
have a the
have a object the
have a object the a
have a object the shape a
have a object the shape a .
have a object the shape a cylinder .
have a object the shape of a cylinder .
have a object in the shape of a cylinder .
have a yellow object in the shape of a cylinder .
We have a yellow object in the shape of a cylinder .
.
is .
A is .
A sphere is .
A sphere is green .
A sphere is is green .
A sphere is table is green .
A sphere is table it is green .
A sphere is on table it is green .
A sphere is on table and it is green .
A sphere is placed on table and it is green .
A sphere is placed on the table and it is green .
have
have a
have a .
have a cube .
We have a cube .
We have a in cube .
We have a in the cube .
We have a in the a cube .
We have a in the of a cube .
We have a in the shape of a cube .
We have a object in the shape of a cube .
We have a cyan object in the shape of a cube .

Appendix B Concrete Examples of the Progressive Growing of Text on ROCStories

Title Context
the birthday party claire turning/ party/ decided invite class/ party showed gifts/ excited
claire turning./ party/ decided invite class/ party showed gifts/ excited
claire turning./ party/ decided invite class/ party showed gifts/ excited
claire turning eight./ party/ decided invite class/ party showed gifts/ excited
claire turning eight./ party/ decided invite class/ party showed gifts/ was excited
claire turning eight./ party/ decided invite class/ the party showed gifts/ was excited
claire turning eight./ her party/ decided invite class/ the party showed gifts/ was excited
claire turning eight./ her party / decided invite class/ the party showed gifts/ was excited
claire turning eight./ her party / decided invite class/ the party showed up gifts/ was excited
claire turning eight./ her party / decided invite class/ the party showed up gifts/ was excited
claire turning eight./ her party / decided invite class / the party showed up gifts/ was excited
claire turning eight./ her party / decided invite class / the party showed up gifts/ire was excited
claire turning eight./ her party / decided invite class / the party showed up gifts/claire was excited
claire turning eight./ her party / decided invite her class / the party showed up gifts/claire was excited
claire turning eight./ her birthday party / decided invite her class / the party showed up gifts/claire was excited
claire turning eight./ her birthday party / decided invite her class / the party showed up gifts./claire was excited
claire turning eight./ her birthday party / decided invite her class / the party everyone showed up gifts./claire was excited
claire turning eight./ her birthday party / decided invite her class / the party everyone showed up gifts./ claire was excited
claire was turning eight./ her birthday party / decided invite her class / the party everyone showed up gifts./ claire was excited
claire was turning eight./ her birthday party / decided invite her class / the party everyone showed up gifts./ claire was excited
claire was turning eight./ her birthday party / decided invite her class./ the party everyone showed up gifts./ claire was excited
claire was turning eight./ her birthday party / she decided invite her class./ the party everyone showed up gifts./ claire was excited
claire was turning eight./ her birthday party./ she decided invite her class./ the party everyone showed up gifts./ claire was excited
claire was turning eight./ her birthday party./ she decided invite her class./ the party everyone showed up gifts./ claire was excited.
claire was turning eight./ her birthday party./ she decided to invite her class./ the party everyone showed up gifts./ claire was excited.
claire was turning eight./ her birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited.
claire was turning eight./ her a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited.
claire was turning eight./ her had a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited.
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited.
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited to.
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited to see.
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited to see her.
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ the party everyone showed up with gifts./ claire was excited to see her party.
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ at the party everyone showed up with gifts./ claire was excited to see her party.
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ at the party, everyone showed up with gifts./ claire was excited to see her party.
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ at the party, everyone showed up with gifts./ claire was excited to see her party.
claire was turning eight./ her mom had a birthday party./ she decided to invite her class./ at the party, everyone showed up with gifts./ claire was so excited to see her party.
john goes to the store. john store/ tomatoes/ looked / sold / was
john store/ he tomatoes/ looked / sold / was
john to store/ he tomatoes/ looked / sold / was
john to store/ he some tomatoes/ looked / sold / was
john to store/ he some tomatoes/ looked / sold / was able
john to store/ he some tomatoes/ looked / sold / was able to
john to store/ he some tomatoes/ looked / sold / was able to buy
john to store/ he some tomatoes/ looked / sold them / was able to buy
john to store/ he some tomatoes/ looked / sold them / was able to buy some
john to store./ he some tomatoes/ looked / sold them / was able to buy some
john to store./ he some tomatoes/ looked at / sold them / was able to buy some
john to store./ he some tomatoes/ looked at the / sold them / was able to buy some
john to store./ he some tomatoes./ looked at the / sold them / was able to buy some
john to store./ he some tomatoes./ looked at the produce / sold them / was able to buy some
john to store./ he got some tomatoes./ looked at the produce / sold them / was able to buy some
john to store./ he got some tomatoes./ looked at the produce / sold them / was able to buy some
john to store./ he got some tomatoes./ looked at the produce / sold them / john was able to buy some
john to store./ he got some tomatoes./ looked at the produce / he sold them / john was able to buy some
john to store./ he got some tomatoes./ looked at the produce / he sold them / john was able to buy some
john went to store./ he got some tomatoes./ looked at the produce / he sold them / john was able to buy some
john went to store./ he got some tomatoes./ he looked at the produce / he sold them / john was able to buy some
john went to store./ he got some tomatoes./ he looked at the produce / he sold them / john was able to buy some.
john went to store./ he got some tomatoes./ he looked at the produce / he sold them / john was able to buy some.
john went to the store./ he got some tomatoes./ he looked at the produce / he sold them / john was able to buy some.
john went to the store./ he got some tomatoes./ he looked at the produce / he sold them./ john was able to buy some.
john went to the store./ he got some tomatoes./ he looked at the produce./ he sold them./ john was able to buy some.
john went to the store./ he got some tomatoes./ he looked at the produce aisle./ he sold them./ john was able to buy some.
john went to the store./ he got some tomatoes./ he looked at the produce aisle./ he sold them./ john was able to buy some new.
john went to the store./ he got some tomatoes./ he looked at the produce aisle./ he sold them./ john was able to buy some new vegetables.
the minor flying jessica flying/ time/ scared/ held hand/ thanks
jessica flying/ time/ scared/ woman held hand / thanks
jessica flying/ time/ scared/ a woman held hand / thanks
jessica flying/ time/ scared./ a woman held hand / thanks
jessica flying/ time/ scared./ a woman held hand / thanks
jessica flying/ first time/ scared./ a woman held hand / thanks
jessica flying/ was first time/ scared./ a woman held hand / thanks
jessica flying/ was first time/ scared./ a woman held hand /s thanks
jessica flying/ was first time/ scared./ a woman held hand /sica thanks
jessica flying/ was first time/ was scared./ a woman held hand /sica thanks
jessica flying / was first time/ was scared./ a woman held hand /sica thanks
jessica flying / was first time/ was scared./ a woman held hand /jesica thanks
jessica flying / was the first time/ was scared./ a woman held hand /jesica thanks
jessica flying / was the first time/ she was scared./ a woman held hand /jesica thanks
jessica flying / was the first time./ she was scared./ a woman held hand /jesica thanks
jessica flying / was the first time./ she was scared./ a woman held hand /jesica thanks
jessica flying / it was the first time./ she was scared./ a woman held hand /jesica thanks
jessica flying / it was the first time./ she was scared./ a woman held her hand /jesica thanks
jessica flying / it was the first time./ she was scared./ a woman held her hand /jesica was thanks
jessica flying / it was the first time./ she was scared./ a woman held her hand /jesica was thanks
jessica flying / it was the first time./ she was scared./ a woman held her hand /jesica was to thanks
jessica flying / it was the first time./ she was scared./ a woman and held her hand /jesica was to thanks
jessica flying / it was the first time./ she was scared./ a woman and held her hand /jesica was able to thanks
jessica was flying / it was the first time./ she was scared./ a woman and held her hand /jesica was able to thanks
jessica was flying / it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks
jessica was flying / it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks.
jessica was flying./ it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks.
jessica was flying airplane./ it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks.
jessica was flying an airplane./ it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks.
jessica was flying an airplane./ it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks to.
jessica was flying an airplane./ it was the first time./ she was scared./ a woman and held her hand./jesica was able to thanks to her.
jessica was flying an airplane./ it was the first time./ she was scared./ a woman and held her hand./ jesica was able to thanks to her.
jessica was flying an airplane./ it was the first time./ she was scared./ a woman came and held her hand./ jesica was able to thanks to her.
jessica was flying an airplane./ it was the first time./ she was scared./ a woman came and held her hand./ jesica was able to fly thanks to her.

Appendix C Formulations of InsNet Layers and Aggregation Methods

c.1 Layer Formulation

Most formulations of InsNet follow that in XLNet. However, there are still some minor differences. In this section, we aim to give a brief mathematical description to the formulation of an InsNet layer and the aggregation methods.

We mostly follow the ideas of Transformer-XL/XLNet to incorporate the insertion-based relative position offset matrix. Suppose the tokens in the original sequence is inserted in the permutation order i.e. in the order of , and we want to use InsNet to predict the insertion-based likelihood of such a process. Each layer is given the sequence of representation vectors (from last InsNet layer or word embedding layer) and the sinusoidal relative position embeddings matrix as its input. Here is a

3-d tensor, with the

section corresponds to the offset matrix as shown in Figure 5 in the main text). The same as in XLNet/Transformer-XL, when computing the attention, the model needs to handle four groups of feature interactions, including Query Content-Key Content interaction, Query Position-Key Content interaction, Query Content-Key Position interaction and Query Position-Key Position interaction. The formulation for each interaction are shown as follows:

Query Content-Key Content interaction, an matrix:

Query Content-Key Position interaction, an matrix:

Query Position-Key Content interaction, a vector:

Query Position-Key Position interaction, an matrix:

where matrices

are parameters to perform linear transformations on the embedding as in standard bi-linear multiplicative attention formulation.

vectors are the invariant parts in the bilinear interaction due to the relative position encoding. The overall attention alpha would be a sum of the four terms:

In general, for i = 1,…,N, the formulation for each layer of an N-layer InsNet could be written as:

Specially, is the word-embedding sequence. In the first step, the model produces the linear transformed versions of the input embeddings, namely query head, key head, value head and position head. The second step performs the necessary bi-linear interactions. The third step produces the actual attention probabilities and the last one introduces non-linearity, following the design of that in XLNet.

c.2 Formulation of Aggregation Methods

After the context encoding process, we get a sequence of representations that are context-aware encodings for each existing token. Now we need to get the representation for the next insertion position and token prediction. For insertion-based generation, we can only insert tokens in between two existing tokens i.e. slots. We need a process that transforms sequence representation to slot representations. We call such a process aggregation.

Shallow Aggregation  For shallow aggregation in each step after the presence of , suppose the output representation from the transformer is

and an unshuffling matrix such that

and

the shallow aggregation with an information update remedy can be written as:

where stands for tensor concatenation operation on the model-width dimension. The unshuffling matrix can be easily obtained when running the offset compression algorithm since it’s just the inverse operation of the algorithm’s output.

In each step , there will be possible token slots and a terminating slot. For simplicity, we denote the terminating slot as slot 0 and the rest to be . The token likelihood prediction is simple, since we only need to index the corresponding slot representation and directly pass it through a log-linear layer to obtain the vocabulary distribution. For position prediction, we yet to obtain the representation for the termination slot. Here we directly use as a global pooling vector that includes all the information we need for termination prediction. We add modules and

to transform the actual slot representation and the dummy termination slot representation into the logit for the slot likelihood, and we take the slot with highest probability to insert. Mathematically, we conduct the computations as follows:

Deep Aggregation  Following the two-stream attention idea proposed in XLNet, the formulation for deep aggregation is trivial, since they are basically equivalent to adding a sequence of tokens that share the offset matrix with that in the original context. These are trained to capture the content-free information for slots that are occupied by actually inserted tokens in each step. After the model encoding process, the context-aware representation for each token is directly used to represent the corresponding real token. In this process, the formulation still follow the one described above, just replacing with .

Denote the original offset matrix to be , when using deep aggregation, the model’s inputs would be extended as:

where stands for the relative position of the slots defined in the form of an offset matrix. Each mask token that stands for a slot can only attend to the existing positions of that step.

For every aggregated slot, because we can only get one representation vector from one position in the model’s outputs, it is obvious that we need to add another mask position in . When predicting the position, since we need to collect the representation of every possible slot in each step, for the whole sequence of length , we would need mask positions, making deep aggregation practically too expensive in terms of space complexity (and/or time complexity if we use up the parallelization capacity). Thus, in our experiments, we choose the cheaper method i.e. shallow aggregation. Empirically, the performance of shallow aggregation is comparable with deep aggregation.