Auto-regressive models are a popular choice for generating sequences of any kind including audio van den Oord et al. (2016b), images van den Oord et al. (2016a), and text Sutskever et al. (2014); Cho et al. (2014)
. Such models work particularly well for text generation tasks such as summarizationLiu et al. (2018), machine translation Sutskever et al. (2014) and dialog response generation Budzianowski et al. (2018) in the encoder-decoder Cho et al. (2014); Sutskever et al. (2014)
setting. Here, the input text sequence is consumed by a neural network encoder, and the output text sequence is generated by a decoder left-to-right, one token (word or word-piece or character) at a time. Such models are typically trained byteacher-forcing Williams and Zipser (1989) where ground-truth history is fed to the model as input, which at test time is replaced by the model prediction.
Scheduled Sampling Bengio et al. (2015)
aims to mitigate the discrepancy between train and test time in teacher-forcing by randomly replacing some tokens in the history with the model’s prediction. More concretely, at a given time step in generating the output sequence, the model is conditioned either on ground-truth or model prediction from the previous time-step with some probability. The probability of selecting model predicted token is gradually increased as training progresses. This procedure potentially allows the model to recover from its own errors, andBengio et al. (2015) observe better empirical performance in natural language parsing, image captioning, and speech recognition compared to teacher-forced training.
A key bottleneck in training models with Scheduled Sampling is its inherently sequential nature. Unlike teacher-forcing, tokens must be processed one time-step at a time. The sequential procedure makes Scheduled Sampling impractical for training neural networks, particularly on problems involving long sequence generation. In this work, we describe a simple technique to parallelize Scheduled Sampling. Given an input example, we first generate an entire model prediction sequence in parallel by conditioning on ground-truth history (equivalent to forward-pass of teacher-forcing). Then, we employ a (parallel) mixing step where we generate a new sequence whose token at every time step is either from the model prediction or the ground-truth. Finally, we perform training as in teacher-forcing by conditioning on the sequence obtained from mixing. In Section 2.3 we show that by performing multiple passes of parallel prediction and mixing, we obtain a conditioning sequence that converges to a sample decode from the model.
We evaluate our approach on three text generation problems: machine translation, summarization and dialog response generation. In comparison to teacher-forcing, we find that Parallel Scheduled Sampling leads to equivalent or better empirical performance on both summarization and dialog response generation while achieving comparable performance on machine translation. In addition, we find Parallel and Sequential Scheduled Sampling lead to similar performance with the former requiring as little as 0.3% of the latter’s training time. Finally, we discuss the impact of various hyper-parameters including mixing probability, number of passes, and the mixing probability schedule on the model performance. We will open-source our implementation of Parallel Scheduled Sampling with the next version of this manuscript.
We consider the task of conditional language modeling. The training set is given in terms of input-output sequences , where is the input and target is the desired output. The target is a variable-length sequence of tokens, , whereas may be variable-length (as in translation) or fixed-length (as in image captioning). The goal is to learn a model that accurately predicts given . We use to denote the sequence of tokens .
2.1 Teacher-Forcing and Decoding
Given an input and a target , the log-probability of the target can be decomposed autoregressively:
Neural language models such as RNNs Mikolov et al. (2010) and Transformer Vaswani et al. (2017) adopt this decomposition and learn to assign high likelihood to token given previous target tokens and inputs via a learned likelihood model .
Such models are typically trained with teacher-forcing Williams and Zipser (1989). In teacher-forcing, the log likelihood of the training set is directly maximized,
Importantly, teacher-forcing conditions on gold target prefixes
, enabling backpropagation through all timesteps with a single pass of inference.
At inference time, beam or sample decoding is often used to generate a candidate target . In this regime, target tokens are generated one at a time while conditioning on previously-generated tokens.
A potential failure mode for teacher-forcing-trained models is in conditioning on previously unobserved target prefixes . As the model has not conditioned on these prefixes at training time, it may generate bland, repetitive, or nonsensical candidate targets Holtzman et al. (2019).
2.2 Scheduled Sampling
Scheduled Sampling Bengio et al. (2015), hereafter Sequential Scheduled Sampling is a training technique designed to bridge the gap between teacher-forcing and sample decoding. In its simplest form, Sequential Scheduled Sampling generates tokens and conditions on these target prefixes during training. Sequential Scheduled Sampling uses the same objective function as teacher-forcing (Equation 1) except the conditioning tokens are a random mixture of gold tokens and sampled tokens instead of gold tokens . See Algorithm 1 for implementation.
As , we condition on as in teacher-forcing, and as , we condition on as in sample decoding. Typically a schedule will be used to gradually increase over the course of training. As illustrated in Bengio et al. (2015), Scheduled Sampling leads to a performance improvement in a variety of language generation tasks.
In spite of its benefits, Sequential Scheduled Sampling is inherently a sequential algorithm: choosing conditioning token requires conditioning autoregressively on tokens . While this is natural for sequential architectures such as RNNs and LSTMs, it is poorly suited to self-attending feed-forward models such as Transformer where inference for multiple timesteps can be carried out simultaneously.
2.3 Parallel Scheduled Sampling
We propose a natural extension to Sequential Scheduled Sampling called Parallel Scheduled Sampling. Whereas Sequential Scheduled Sampling selects conditioning tokens one after another, we propose generating conditioning tokens for all timesteps in parallel over the course of one or more passes. While this technique requires strictly more operations than Sequential Scheduled Sampling, it is better suited to hardware accelerators such as GPUs and TPUs Jouppi et al. (2017). Moreover, we find in our experiments that only a modest number of passes is necessary for improving model performance.
Parallel Scheduled Sampling generates conditioning tokens for all timesteps simultaneously. The procedure consists of multiple passes, each pass consisting of parallel sampling and mixing steps (Figure 1). In the first pass, the algorithm conditions on gold tokens , generating tokens i.i.d. according to . Sampling tokens in the first pass is equivalent to the forward-pass of teacher-forcing. The sampled tokens, , are mixed (in parallel) with gold tokens, , to produce conditioning tokens for the next pass, .
We now describe the multiple-pass procedure. Let , and denote sampled and mixed tokens respectively on pass . The mixed tokens from pass , , are used for conditioning on pass in place of gold tokens . Finally, the loss is calculated as before, conditioning on the final mixture of gold and sampled tokens . See Algorithm 2 for implementation.
Finally, we prove that by running the sampling and mixing steps for multiple passes as described in Algorithm 2, the final sample from Parallel Scheduled Sampling converges to a random sample decode from the model when and .
Consider a sequence of tokens of length . Let and be fixed. Then the likelihood of under Parallel Scheduled Sampling’s proposal distribution111We drop conditioning on in the following for conciseness over conditioning tokens on pass , , is identical to random sample decoding’s, ,
We begin by establishing notation. Let be the likelihood of a sequence according to random sample decoding. Let be the likelihood of the same according to Parallel Scheduled Sampling’s proposal distribution on pass .
The proof proceeds by induction. First we show that the proposal distribution for the first token matches random sampling’s on the first pass, . Then we show that if holds for some , it also holds for all . Finally, we show that if the previous statement holds, it also holds for tokens on pass . Thus, it follows that the proposal distribution matches random sampling’s for all tokens so long as .
Base Case: Consider the proposal distribution for the first token on the first pass, . As , the first token is sampled from by construction. Thus,
Induction over : Suppose that the proposal distribution for tokens some . Then the equality also hold for the proposal distribution on pass . This follows trivially as tokens are “copied” from pass to and thus their likelihood is unchanged,
Induction over : Suppose that the proposal distribution matches random sample decoding’s for the first tokens for ; that is, . We show that the statement holds for pass for tokens . First, recall that by construction the proposal distribution for token given previous tokens is the same as random sampling’s when ,
Note that this only holds when . Then,
Where we use the chain rule, induction overfor , the inductive assumption for , and the definition of when .
3 Related Work
Professor forcing Lamb et al. (2016) has a similar motivation as Scheduled Sampling, where a discriminator network is trained jointly with the generator to distinguish between generator’s hidden states produced by conditioning on ground-truth and model prediction sample. The generator apart from maximizing the likelihood of the data is also trained to fool the discriminator Goodfellow et al. (2014). With this new objective, the dynamics of the generator would be the same for conditioning on both ground-truth and model prediction. Our parallel sampling contribution is orthogonal to professor forcing and can be potentially applied in their framework. Collins and Roark (2004)
use beam search which is a sequential search procedure during both during training and testing time, and update the weights of the model using a variant of the Perceptron algorithmRosenblatt (1958)
. Methods with similar motivation of mitigating the discrepancy between train and test time behavior have also been studied in the sequential decision making, and reinforcement learning settingDaumé et al. (2009); Ross et al. (2011).
While our proposed technique is generally applicable to sequence generation tasks, following previous work Bengio et al. (2015)
we focus on text generation tasks in our experiments. Concretely, we evaluate Parallel Scheduled Sampling on text summarizationLiu et al. (2018), task-oriented dialog response generation Budzianowski et al. (2018), and machine translation Sutskever et al. (2014); Vaswani et al. (2017) and compare it to teacher-forced training. We compare our method to Sequential Scheduled Sampling only on the dialog task Budzianowski et al. (2018) as we find runtime infeasible on larger tasks. We also conduct ablation studies on the dialog task. We use the Tensor2Tensor framework for all experiments Vaswani et al. (2018).
|Model Size||Max Length||Training Method||Decoding Method||ROUGE-2||ROUGE-L|
|Base||500||Parallel SS||Beam Search||25.19||34.76|
|Large||1500||Parallel SS||Beam Search||30.35||39.09|
Performance on the summarization task using base and large Transformer when trained with teacher-forcing and Parallel Scheduled Sampling. We consider both beam search and greedy decoding. We adopt the widely-used ROUGE score as the evaluation metric (higher the better).
Liu et al. (2018) propose a multi-document summarization task, where the task is to generate the text of a Wikipedia article given its references and other related documents. The dataset has close to 1.9 million training examples, and 230,000 test examples. We use a Transformer seq2seq model for this task in two hyper-parameter settings: a base model with 60 million parameters and a large model with 210 million parameters. For the base model, we restrict the maximum length of input and output to be 500, while for the large model the maximum length is set to 1500.
Table 1 shows the results of training base and large Transformer models for the summarization task. The base and large models were trained for 250k steps and 500k steps respectively. We use teacher-forcing for the first 50% of training steps in Parallel Scheduled Sampling as warm-up steps. The mixing probability is set to 50% and we perform a single pass of sampling and mixing (Algorithm 2). With the base model, Parallel Scheduled Sampling obtains better performance than teacher-forcing with both beam search and greedy decoding while it performs better only with greedy decoding when the large model is used. Figure 2 shows a plot of held-out development set log perplexity as a function of training step for teacher-forcing and Parallel Scheduled Sampling-trained models.
|Teacher Forcing||Parallel Scheduled Sampling|
|Andrea Smith is an American academic, author, and activist. She is the founder of Indigenous Studies, an interdisciplinary interdisciplinary interdisciplinary research program that focuses on Indigenous feminism, Indigenous activism, and Indigenous feminism. Smith is the author of Conquest: The Making of Indigenous Women, a book on the history of Indigenous women, and Conquest: The Making of Indigenous Women in the 21st Century. She is also the founder of the Indigenous Studies Institute, a non-profit organization that promotes Indigenous feminism and Indigenous feminism. Smith is also the founder of the Indigenous Studies Institute, a non-profit organization that promotes Indigenous feminism and Indigenous feminism. <repeat>||Andrea Smith (born 1960) is an Australian academic and academic. She is a Professor of the Department of English at the University of Sydney, and a Fellow of the Australian Academy of Arts and Sciences.|
|Sylvia Michel (born 28 March 1941) is a Swiss Reformed minister. She was the first female church council president in Switzerland. She was elected head of the Swiss Protestant Church in Argovia in April 1980. She was the first female vice president of the church in Switzerland. Michel was born in Switzerland and studied at the seminary in Cameroon. She has been teaching at the theological college of the Bible Society since 1975. She has been a pastor in international ecumenical organizations. She is cited for her time as vice president of the Church of Jesus Christ of Madagascar to highlight her career. Michel was the first female church council president in Switzerland. She was elected head of the Swiss Protestant Church in Argovia in April 1980. In recognition of Sylvia Michel the Prize is given to women to encourage and promote women to assume leadership positions in their churches. Michel was Europe’s first female church council president when she was elected head of the Church of Canton of Aargau in 1980. In recognition of Sylvia Michel the pioneer church leader in celebration of skill so many women in the world support, sometimes sustain and lead their churches, the Sylvia Michel Prize is given to women who encourage and promote women to assume leadership positions in their churches. <repeat>||Sylvia Michel (born 26) is a Swiss Reformed church leader. She was the first female church council president in Switzerland. Michel was elected head of the Swiss Protestant Church in Argovia in April 1980. She is a member of the World Communion of Reformed Churches. Michel is a professor of theology at the University of Cameroon. Michel is married to the church leader Yvette Rabemila.|
Figure 3 shows sample decodes comparing models trained with teacher-forcing and Parallel Scheduled Sampling. We find that Parallel Scheduled Sampling may help mitigate repetition and degeneration often found in generated text when using greedy or beam search decoding methods. Holtzman et al. (2019).
4.2 Dialog Response Generation
Results from models trained with teacher-forcing, Sequential Scheduled Sampling, and Parallel Scheduled Sampling. We report mean BLEU and maximum BLEU over 5 random restarts for each configuration except Sequential Scheduled Sampling, for which we report a single run. We provide results by varying different hyperparameters for both variants of Scheduled Sampling. We also provide training steps per second for the different training algorithms.
We evaluate our method on text response generation task using MultiWOZ Budzianowski et al. (2018), a task-oriented dialog dataset. Here, we consider the problem of mapping conversation history consisting of alternating user and assistant turns to a single turn of assistant response. We use a Transformer model containing approximately one million parameters for this study as the dataset is much smaller (approximately 100k training examples) than those in other experiments. We truncate the length of the input and output to 512, and train all the models for 50k steps. As both model and dataset are small, we are able to empirically compare our method to Sequential Scheduled Sampling (such experiments are infeasible in larger models). Table 2 summarizes results for all experiments on the MultiWOZ dataset and Figure 4 plots held-out development set perplexity vs training step for teacher-forcing, Sequential Scheduled Sampling and Parallel Scheduled Sampling trained models.
Both Sequential Scheduled Sampling and Parallel Scheduled Sampling (with just one pass) perform about the same, both achieving better results than teacher-forced trained models. However, as can be seen in Table 2, Parallel Scheduled Sampling and teacher-forcing are both two orders of magnitude faster to train than Sequential Scheduled Sampling. A single pass of Parallel Scheduled Sampling is approximately 25% slower than teacher-forced training while producing the same benefits as Sequential Scheduled Sampling. Table 2 also shows the impact of mixing probability, number of passes, warm-up steps, and the mixing probability schedule Bengio et al. (2015) on model performance. Overall, we find a single pass with 50% gold/sampled mixing probability sufficient for improving performance.
4.3 Machine Translation
We evaluate our method on the WMT 2014 English-German task which consists of approximately of 4.5 million training sentences. We experiment with the large Transformer model that contains approximately 210 million parameters. We did not see performance improvements by using Parallel Scheduled Sampling. The model trained with teacher-forcing for 500k steps gets 28.74 BLEU. The same model trained with 250k warm-up steps using teacher-forcing and the next 250k steps trained with Parallel Scheduled Sampling with mixing probability set to 50% and a single pass of sampling and mixing (Algorithm 2) obtains 28.57 BLEU. Hyper-parameter tuning of warm-up steps and mixing probability did not improve performance. We hypothesize the lack of performance improvement may be due to the fact that the summarization and dialog response generation tasks have much longer output sequences than in machine translation, though further investigation is required.
We introduce a simple technique to parallelize Scheduled Sampling that allows Schedule Sampling to be applied for training models with hundreds of millions of parameters on large datasets. The technique potentially mitigates discrepancy between train and test time in autoregressive sequence generation models. We find that in most cases our technique leads to better empirical performance on summarization and dialog generation tasks compared to teacher-forced training. Our empirical results indicate that Parallel Scheduled Sampling can potentially improve the performance of autoregressive sequence generation models particularly on tasks containing long sequences.
Bengio et al. 
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer.
Scheduled sampling for sequence prediction with recurrent neural networks.NeurIPS, 2015.
- Budzianowski et al.  Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. EMNLP, 2018.
- Cho et al.  Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP, 2014.
- Collins and Roark  Michael Collins and Brian Roark. Incremental parsing with the perceptron algorithm. ACL, 2004.
- Daumé et al.  Hal Daumé, Iii, John Langford, and Daniel Marcu. Search-based structured prediction. Machine Learning, 2009.
- Goodfellow et al.  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. NeurIPS, 2014.
- Holtzman et al.  Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. arXiv, 2019.
Jouppi et al. 
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,
Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al.
In-datacenter performance analysis of a tensor processing unit.ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017.
- Lamb et al.  Alex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. NeurIPS, 2016.
- Liu et al.  Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating wikipedia by summarizing long sequences. ICLR, 2018.
- Mikolov et al.  Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Honza Cernocký, and Sanjeev Khudanpur. Recurrent neural network based language model. INTERSPEECH, 2010.
- Rosenblatt  F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 1958.
Ross et al. 
Stephane Ross, Geoffrey Gordon, and Drew Bagnell.
A reduction of imitation learning and structured prediction to no-regret online learning.AISTATS, 2011.
- Sutskever et al.  Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. NeurIPS, 2014.
- van den Oord et al. [2016a] Aäron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. ICML, 2016a.
- van den Oord et al. [2016b] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alexander Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. Arxiv, 2016b.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017.
Vaswani et al. 
Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez,
Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar,
Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit.
Tensor2tensor for neural machine translation.CoRR, 2018.
- Williams and Zipser  Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1989.