Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement

02/19/2018 ∙ by Jason Lee, et al. ∙ 0

We propose a conditional non-autoregressive neural sequence model based on iterative refinement. The proposed model is designed based on the principles of latent variable models and denoising autoencoders, and is generally applicable to any sequence generation task. We extensively evaluate the proposed model on machine translation (En-De and En-Ro) and image caption generation, and observe that it significantly speeds up decoding while maintaining the generation quality comparable to the autoregressive counterpart.



page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Conditional neural sequence modeling has become a de facto standard in a variety of tasks (see, e.g., Cho et al., 2015, and references therein)

. Much of this recent success is built on top of successful autoregressive sequence modeling in which the probability of a target sequence is factorized as a product of conditional probabilities of next symbols given all the preceding ones. Despite its success, neural autoregressive modeling, which is often non-Markovian and nonlinear, has its weakness in decoding, i.e., finding the most likely sequence. Because of intractability, we must resort to suboptimal approximate decoding, and due to its sequential nature, decoding cannot be easily parallelized and results in a large latency 

(see, e.g., Cho, 2016). This has motivated the recent investigation into non-autoregressive neural sequence modeling by Gu et al. (2017) in the context of machine translation and Oord et al. (2017) in the context of speech synthesis.

In this paper, we propose a non-autoregressive neural sequence model based on iterative refinement, which is generally applicable to any sequence generation task beyond machine translation. The proposed model can be viewed as both a latent variable model and a conditional denoising autoencoder. We thus propose a learning algorithm that is hybrid of (deterministic) lower-bound maximization and reconstruction error minimization. We further design an iterative inference strategy with an adaptive number of steps to minimize the generation latency without sacrificing the generation quality.

We extensively evaluate the proposed conditional non-autoregressive sequence model and compare it against the autoregressive counterpart, using the state-of-the-art Transformer (Vaswani et al., 2017), on machine translation and image caption generation. In the case of machine translation, the proposed deterministic non-autoregressive models are able to decode approximately faster than beam search from the autoregressive counterparts on both GPU and CPU, while maintaining 90% (IWSLT’16 EnDe and WMT’16 EnRo) and 80% (WMT’15 EnDe) of translation quality. On image caption generation, we observe approximately and faster decoding on GPU and CPU, respectively, while maintaining 85% of caption quality.

We release our implementation, preprocessed datasets and pretrained models online at

2 Non-Autoregressive Sequence Modeling

Sequence modeling in deep learning has largely focused on autoregressive modeling of a sequence. That is, given a sequence

, we use some form of a neural network to parametrize the conditional distribution over each variable

given all the preceding variables, i.e.,


is for instance a recurrent neural network. This approach has become a

de facto standard in language modeling (Mikolov et al., 2010). When this is augmented with an extra conditioning variable , it becomes conditional sequence modeling which serves as a basis on which many recent advances in, for instance, machine translation (Bahdanau et al., 2014; Sutskever et al., 2014; Kalchbrenner & Blunsom, 2013) and speech recognition (Chorowski et al., 2015; Chiu et al., 2017) have been made.

Despite the recent success, autoregressive sequence modeling has a weakness due to its nature of sequential processing. This weakness shows itself especially when we try to decode the most likely sequence from a trained model, i.e., . There is no known polynomial algorithm for solving it exactly, and practitioners have relied on approximate decoding algorithms, such as greedy decoding, beam search, noisy parallel approximate decoding and continuous relaxation (see, e.g., Cho, 2016; Hoang et al., 2017). Among these, beam search has become the method of choice, due to its superior performance over greedy decoding, which however comes with a substantial computational overhead (Cho, 2016).

As a solution to this issue of slow decoding, two recent works have attempted non-autoregressive sequence modeling. Gu et al. (2017) have modified the recently proposed Transformer (Vaswani et al., 2017) for non-autoregressive machine translation, and Oord et al. (2017) a convolutional network (Oord et al., 2016) for non-autoregressive modeling of waveform. Non-autoregressive modeling aims at factorizing the distribution over a target sequence given a source into a product of conditionally independent per-step distributions:


breaking the dependency among the target variables across time. This break of dependency allows us to trivially find the most likely target sequence by We effectively bypass the computational overhead and sub-optimality of decoding from an autoregressive sequence model.

Figure 1: Non-autoregressive sequence modeling eliminates the decoding gap at the expense of potentially a larger modeling gap and thereby compensates for a potentially-larger modeling gap.

Modeling Gap vs. Decoding Gap

This desirable property of non-autoregressive neural sequence modeling however comes also with a potential performance degradation (Kaiser & Bengio, 2016). It is due to the fact that any potential dependency among target variables must be captured by a neural network that models the factorized conditionals in Eq. (1). That is, the potential modeling gap, which is the gap between the underlying, true model and the neural sequence model, could be larger with the non-autogressive model than the autoregressive one. On the other hand, this issue may be canceled out, because the decoding gap, which may be arbitrarily large with the autoregressive model and an approximate decoding algorithm, does not exist with the non-autoregressive model (that is, decoding is exact given a model.) This has indeed been shown to be the case recently by Oord et al. (2017) and Gu et al. (2017). See Fig. 1 for graphical illustration of this view.

3 Iterative Refinement for Deterministic Non-Autoregressive Sequence Modeling

3.1 Latent variable model

We propose a non-autoregressive neural sequence model. Similarly to two recent works (Oord et al., 2017; Gu et al., 2017), we introduce latent variables to implicitly capture the dependencies among target variables. We however remove any stochastic behavior by interpreting this latent variable model, introduced immediately below, as a process of iterative refinement.

Our goal is to capture the dependencies among target symbols given a source sentence without auto-regression. We address this by introducing

-many intermediate random variables and marginalizing them out:


Each product term inside the summation is modelled by a shared deep neural network that takes as input a source sentence and outputs the conditional distribution over the target vocabulary for each .

The marginalization in Eq. (2) is intractable. In order to avoid this issue, we consider its deterministic lower bound. To make it clearer, let us consider . Then,

where .

Then, the entire lower bound can be written as:


In this formulation, the intermediate random variables could be anonymous. We however constrain them to be of the same type as the output in order to share the underlying parametrized neural network. This constraint allows us to view each conditional as a single-step of refining a rough target sequence . The entire chain of conditionals is then the -step iterative refinement. Furthermore, sharing the parameters across these refinement steps enables us to dynamically adapt the number of iterations per input . This is important as it substantially reduces the amount of time required for decoding, as we see later in the experiments.


For each training pair , we first compute the lower bound in Eq. (3) from the first conditional to the final one . The computation of each conditional is followed by

We then maximize


where is a set of parameters. is an empty sequence.

3.2 Denoising Autoencoder

The proposed approach could instead be viewed as learning a conditional denoising autoencoder which is known to capture the gradient of the log-density (under certain assumptions.) That is, we implicitly learn to find a direction in the output space that maximizes the underlying true, data-generating distribution . Because the output space is discrete, much of the theoretical analysis by Alain & Bengio (2014) are not strictly applicable. We however find this view attractive as it serves as an alternative foundation for designing a learning algorithm.

We start with a corruption process , which introduces noise to the correct output . Given the source and the reference translation , we sample . This corrupted target then acts as an input to each conditional in Eq. (2). Then, the goal of learning is to maximize the log-probability of the original reference given the corrupted version. That is, to maximize


Once this cost is minimized, we can recursively perform the maximum-a-posterior inference, i.e.,

to find that (approximately) maximizes .

Corruption Process

There is little consensus on the best corruption process for a sequence, especially of discrete tokens. In this work, we use a corruption process proposed by Hill et al. , which has recently become more widely adopted  (see, e.g., Artetxe et al., 2017; Lample et al., 2017). Each in a reference target is corrupted with a probability . If decided to corrupt, we either (1) replace with this token , (2) replace with a token uniformly selected from a vocabulary of all unique tokens at random, or (3) swap and . This is done sequentially from until .

3.3 Training

Although it is possible to train the proposed non-autoregressive sequence model using either of the cost functions above ( or ,) we propose to stochastically mix these two cost functions. We do so by randomly replacing each term in Eq. (4) with in Eq. (5). In other words,


where , and

is a sample from a Bernoulli distribution with the probability


is a hyperparameter. As the first conditional

in Eq. (2) does not take as input any target , we set always.


Gu et al. (2017), in the context of machine translation, and Oord et al. (2017), in the context of speech generation, have recently discovered that it is important to use knowledge distillation (Hinton et al., 2015; Kim & Rush, 2016) to successfully train a non-autoregressive sequence model. Following Gu et al. (2017), we also use knowledge distillation by replacing the reference target of each training example with a target generated from a well-trained autoregressive counterpart. Other than this replacement, the cost function in Eq (3.3) and the model architecture remain unchanged.

Length Prediction

One minor difference between the autoregressive and non-autoregressive models is that the former naturally models the length of a target sequence without any arbitrary upper-bound, while the latter does not. It is hence necessary to separately model , where is the length of a target sequence. During training, we simply use the length of each reference target sequence.

3.4 Inference

Inference in the proposed approach is entirely deterministic. We start from the input and use the first conditional to generate the initial target sequence, i.e.,

From here on, we iteratively generate target sequences by

Because these conditionals, except for the initial one, are modeled by a single, shared neural network, this refinement can be performed as many iterations as necessary, potentially more than the number of iterations used for training, until a predefined stopping criterion is met. A criterion can for instance be based either on the amount of change in a target sequence after each iteration (i.e., ), on the amount of change in the conditional log-probabilities (i.e., ) or on the computational budget. In the experiments later, we observe that the first criterion is a reasonable choice for both machine translation and image caption generation.

Figure 2: We compose three transformer blocks (“Encoder”, “Decoder 1” and “Decoder 2”) to implement the proposed non-autoregressive sequence model. See Sec. 5 for more details.

4 Related Work

Non-Autoregressive Neural Machine Translation

Schwenk (2012)

proposed a continuous-space translation model (CSTM) to estimate the conditional distribution over a target phrase given a source phrase, while dropping the conditional dependencies among target tokens, as in Eq. (

1). The CSTM was found to improve the output of a machine translation system by reranking the -best list. The evaluation was however limited to reranking and to short phrase pairs (up to 7 words on each side) only.

More recently, Kaiser & Bengio (2016)

investigated a recurrent stack of convolutional gated recurrent units, called neural GPU 

(Kaiser & Sutskever, 2015), for machine translation. They evaluated both non-autoregressive and autoregressive approaches, and found that the non-autoregressive approach significantly lags behind the autoregressive variants. Their approach is similar in that a shared stack of a neural GPU is applied multiple times given an input sequence. It however differs from our approach that each iteration does not output a refined version from the previous iteration.

The recent paper by Gu et al. (2017)

is most relevant to the proposed work. In order to capture dependencies among target variables, they introduced a sequence of discrete latent variables. Instead of (approximately) marginalizing them out, they use supervised learning to train an inference network for those latent variables, using the off-the-shelf word alignment tool 

(Dyer et al., 2013), which is unlike our approach which does not require any extra supervision. To achieve the best result, Gu et al. (2017) stochastically sample the latent variables and rerank the corresponding target sequences with an external, autoregressive model. This is in stark contrast to the proposed approach which is fully deterministic and does not rely on any extra reranking mechanism. Furthermore, our approach does not require additional finetuning with reverse KL-divergence.

Parallel WaveNet

Simultaneously with Gu et al. (2017), Oord et al. (2017) presented a successful, non-autoregressive sequence model for speech waveform. They use inverse autoregressive flow (IAF, Kingma et al., 2016) to map a sequence of independent random variables to a target sequence. In order to maximize the performance, they found it helpful to apply the IAF multiple times, similarly to our iterative refinement strategy. Their approach is however restricted to continuous target variables, while the proposed approach in principle could be applied to both discrete and continuous variables.111 Despite this, we evaluate our approach only on discrete target variables, focusing on natural language sentence.

Deliberation Network

The deliberation network, proposed recently by Xia et al. (2017)

, incorporates the idea of refinement into neural machine translation. This deliberation network consists of two autoregressive decoders. The second decoder takes into account the translation generated by the first decoder. Our approach significantly expands this by allowing many steps of refinement.


 = 1 28.64 34.11 70.3 32.2 31.93 31.55 55.6 15.7 23.40 26.49 53.6 15.8 23.47 4.3 2.1
 = 4 28.98 34.81 63.8 14.6 32.40 32.06 43.3 7.3 24.12 27.05 45.8 6.7 24.78 3.6 1.0

Our Model
 = 1 22.20 27.68 573.0 213.2 24.45 25.73 694.2 98.6 12.65 14.84 536.5 101.2 20.12 17.1 8.9
 = 2 24.82 30.23 423.8 110.9 27.10 28.15 332.7 62.8 15.03 17.15 407.1 56.4 20.88 12.0 5.7
 = 5 26.58 31.85 189.7 52.8 28.86 29.72 194.4 29.0 17.53 20.02 173.4 27.1 21.12 6.2 2.8
 = 10 27.11 32.31 98.8 24.1 29.32 30.19 93.1 14.8 18.48 21.10 87.8 13.1 21.24 2.0 1.2
 = 20 26.74 32.54 40.2 12.1 29.49 30.41 51.8 7.2 19.13 21.69 36.3 5.8 21.24 1.5 0.5
Adaptive 27.01 32.43 125.9 29.3 29.66 30.30 226.6 16.5 18.91 21.60 90.9 12.8 21.12 10.8 4.8
Table 1: Generation quality (BLEU) and speed: tokens/sec and images/sec for translation and image captioning respectively. Generation speed is measured assuming sentence-by-sentence generation (no parallelization across sentences) and reported based both on GPU and CPU. On translation tasks, the speed was measured on En{De,Ro}. AR stands for the autoregressive models. is the beam width. is the number of refinement steps taken during decoding. Adaptive refers to the adaptive number of refinement steps.
Figure 3: The evolution of the translation quality on the development set of WMT’15 over the refinement iterations up to 100 ().

5 Network Architecture

We use three transformer-based network blocks to implement our model on the right side of Eq. (2). The first block (“Encoder”) encodes the input , the second block (“Decoder 1”) models the first conditional , and the final block (“Decoder 2”) is shared across iterative refinement steps, modeling . These blocks are depicted side-by-side in Fig. 2. The encoder is identical to that from the original Transformer (Vaswani et al., 2017). We however use the decoders from (Gu et al., 2017) with additional positional attention.222 Instead of the residual layer (He et al., 2016), we use the highway layer (Srivastava et al., 2015) to stabilize learning.

Decoder 1 takes as input the original input padded/shortened according to the length of the corresponding reference target sequence. At each refinement step , decoder 2 takes as input the predicted target sequence

and the sequence of final activation vectors from the previous step. For each position

, the embedding vector of the target token and the activation vector are simply summed. In our preliminary experiments, we have experimented excluding the activation vector from the previous step, which underperformed the current setup.

6 Experimental Settings

We evaluate the proposed approach on two distinct sequence modeling tasks: (1) machine translation and (2) image caption generation. We compare the proposed non-autoregressive model against the autoregressive counterpart both in terms of generation quality, measured in terms of BLEU (Papineni et al., 2002), and generation efficiency, measured in terms of (source) tokens and images per second for translation and image captioning respectively.

Machine Translation

We choose three translation datasets of different sizes: IWSLT’16 EnDe, WMT’16 EnRo and WMT’15 EnDe, whose training sets consist of 196K, 610k and 4.5M sentence pairs, respectively. We tokenize all corpora using a script from Moses (Koehn et al., 2007) and segment each word into subword units using Byte Pair Encoding  (BPE, Sennrich et al., 2016). We use 40k, 40k and 60k shared BPE tokens for IWSLT’16 En-De, WMT’16 En-Ro and WMT’15 En-De, respectively. For WMT’15 En-De, we use newstest-2013 and newstest-2014 as development and test sets. For WMT’16 En-Ro, we use newsdev-2016 and newstest-2016 as development and test sets. For IWSLT’16 En-De, we use test2013 for validation.

We closely follow the setting from (Gu et al., 2017). In the case of IWSLT’16 En-De, we use the small model ( and ).333 Due to the space constraint, we refer readers to (Vaswani et al., 2017; Gu et al., 2017) for more details on how these hyperparameters define one transformer block. For WMT’15 En-De and WMT’16 En-Ro, we use the base transformer from (Vaswani et al., 2017) ( and ). In addition to the warm-up learning rate scheduling from (Vaswani et al., 2017), we also experimented with annealing the learning rate linearly from to . We do not use label smoothing nor average multiple check-pointed models. These decisions were made based on the preliminary experiments. We train each model either on a single NVIDIA P40 (WMT’15 En-De and WMT’16 En-Ro) or on a single NVIDIA P100 (IWSLT’16 En-De) with each minibatch consisting of approximately 2,048 tokens.

Image Caption Generation: MS COCO

We use MS COCO (Lin et al., 2014) for image caption generation. We used the publicly available splits used in the previous image caption generation work (Karpathy & Li, 2015) consisting of 113,287 training images, 5k validation images and 5k test images. Each image is pre-transformed into a set of 49 512-dimensional feature vectors, using a ResNet-18 (He et al., 2016)

pretrained on ImageNet 

(Deng et al., 2009). The average of these 49 vectors is copied as many times to match the length of the target sentence (reference during training and predicted during evaluation) to form the initial input to decoder 1. We use the base transformer from (Vaswani et al., 2017) except for set to 4 instead of 6. We train each model on a single NVIDIA 1080ti with each minibatch consisting of approximately 1,024 tokens.

Target Length Prediction

We formulate the target length prediction as a classification problem of predicting the difference between the target and source lengths for translation and target length for image captioning. All the hidden vectors from the

layers of the encoder are summed and fed to a softmax classifier after affine transformation. This length predictor is trained separately once the main non-autoregressive model has been trained and is used only during decoding.

Figure 4: The decoding latencies in terms of sec/sentence using different decoding algorithms on IWSLT’16 EnDe. Decoding from the non-autoregressive model is largely constant with respect to the sentence length, while the latency of decoding from the autoregressive model (greedy decoding) increases linearly. Note that the y-axis is in the logarithmic scale.

Training and Inference

We use Adam (Kingma & Ba, 2014) as an optimizer and use in Eq. (2), meaning we run four steps of iterative refinement ( from hereon.) We use based on the validation set performance. After both the main non-autogressive sequence model and target length predictor are trained, we decode by first predicting the target length and running iterative refinement steps until the outputs of consecutive iterations are the same (or Jaccard distance between consecutive decoded sequences is ). To assess the effectiveness of this adaptive scheme, we also test a fixed number of steps (). Only in the case of machine translation, we remove any repetition by mapping multiple consecutive occurrences in a token into one.

EnDe DeEn
distill rep no rep rep no rep
AR 28.64 34.11
28.98 24.81
Our Models 0 14.62 18.03 16.70 21.18
0 17.42 21.08 19.84 24.25
0 19.22 22.65 22.15 25.24
1 19.83 22.29 24.00 26.57
0.5 20.91 23.65 24.05 28.18
0.5 26.17 27.11 31.92 32.59
Table 2: Ablation study on IWSLT’16 EnDe development set. For our models we show BLEU scores both with and without repetitions in the output.
Src seitdem habe ich sieben Häuser in der Nachbarschaft mit den Lichtern versorgt und sie funktionierenen wirklich gut . Iter 1 and I ’ve been seven homes since in neighborhood with the lights and they ’re really functional . Iter 2 and I ’ve been seven homes in the neighborhood with the lights , and they ’re a really functional . Iter 4 and I ’ve been seven homes in neighborhood with the lights , and they ’re a really functional . Iter 8 and I ’ve been providing seven homes in the neighborhood with the lights and they ’re a really functional . Iter 20 and I ’ve been providing seven homes in the neighborhood with the lights , and they ’re a very good functional . Ref since now , I ’ve set up seven homes around my community , and they ’re really working . Src er sah sehr glücklich aus , was damals ziemlich ungewöhnlich war , da ihn die Nachrichten meistens deprimierten . Iter 1 he looked very happy , which was pretty unusual the , because the news was were usually depressing . Iter 2 he looked very happy , which was pretty unusual at the , because the news was s depressing . Iter 4 he looked very happy , which was pretty unusual at the , because news was mostly depressing . Iter 8 he looked very happy , which was pretty unusual at the time because the news was mostly depressing . Iter 20 he looked very happy , which was pretty unusual at the time , because the news was mostly depressing . Ref there was a big smile on his face which was unusual then , because the news mostly depressed him . Src furchtlos zu sein heißt für mich , heute ehrlich zu sein . Iter 1 to be , for me , to be honest today . Iter 2 to be fearless , me , is to be honest today . Iter 4 to be fearless for me , is to be honest today . Iter 8 to be fearless for me , me to be honest today . Iter 20 to be fearless for me , is to be honest today . Ref so today , for me , being fearless means being honest .
Table 3: Three sample DeEn translations from the proposed non-autoregressive sequence model. Source sentences are from the development set of IWSLT’16. The first iteration corresponds to decoder 1, and from thereon, decoder 2 is repeatedly applied to refine the translation. Subsequences with changes across the refinement steps are underlined.

7 Results and Analysis

7.1 Quantitative Analysis

We make some important observations from the results in Table 1. First, the generation quality improves across all the tasks as we run more refinement steps even beyond the number of iterations used during training (). This supports our interpretation of the proposed approach as a conditional denoising autoencoder in Sec. 3.2. To further verify this, we run decoding on WMT’15 (both directions) up to 100 iterations. As shown in Fig. 3, the quality improves up to a certain number of iterations, but from thereon, stagnates and eventually drops. A similar behavior was observed earlier with another generative model using iterative refinement (Raiko et al., 2014), which we leave as a future work to investigate.

Second, the generation efficiency decreases as more refinements are made. Together with the first observation, it suggests that the proposed approach allows us to make a smooth trade-off between the quality and speed. We further observe that the adaptive iteration scheme works well by achieving near-best generation quality with significantly lower computational overhead.

We also observe that the speedup in decoding from the proposed approach is much clearer on GPU than on CPU. This is a consequence of highly parallel computation of the proposed non-autoregressive model, which is better suited to GPUs, showcasing the potential of using the non-autoregressive model with a specialized hardware for parallel computation, such as Google’s TPUs (Jouppi et al., 2017).

The proposed approach however suffers from the generation quality degradation compared to the autoregressive counterpart, as also observed by Gu et al. (2017). The quality degradation is most evident on WMT’15 En-De. WMT’15 En-De is clearly distinguished from other datasets in two aspects. First, the average target length of training set in WMT’15 En-De is which is longer compared to IWSLT’16 En-De (20), WMT’16 En-Ro (26) and COCO (10). This may require unreasonably many refinement steps, or transformer blocks in each decoder in order to capture long-term dependencies in a target sequence. Second, WMT’15 has much more training examples compared to other datasets. It is not clear how these aspects negatively affect non-autoregressive model, and we leave this analysis for future investigation.

Lastly, it is encouraging to observe that the proposed non-autoregressive model works well on image caption generation. This result confirms the generality of our approach beyond machine translation, unlike that by Gu et al. (2017) which was explicitly designed and tested for machine translation or by Oord et al. (2017) which was for speech synthesis.

Decoding Latency

To better understand the observed decoding speedup, we plot the average seconds per sentence in Fig. 4, measured on GPU while sequentially decoding one sentence at a time. As expected, decoding from the autoregressive model linearly slows down as the length of a sentence grows, while decoding from the proposed non-autoregressive model with a fixed number of iterations has the constant complexity. The adaptive scheme, described in Sec. 3.4, automatically increases the number of refinement steps as the length of a source sentence increases, suggesting that this scheme captures the amount of information in the input well. This increase in latency is however less severe, compared to decoding from the autoregressive model.

Generated Caption Iter 1 a yellow bus parked on parked in of parking road . Iter 2 a yellow and black on parked in a parking lot . Iter 3 a yellow and black bus parked in a parking lot . Iter 4 a yellow and black bus parked in a parking lot . Reference Captions a tour bus is parked on the curb waiting city bus parked on side of hotel in the rain . bus parked under an awning next to brick sidewalk a bus is parked on the curb in front of a building . a double decked bus sits parked under an awning Generated Caption Iter 1 a woman standing on playing tennis on a tennis racquet . Iter 2 a woman standing on a tennis court a tennis racquet . Iter 3 a woman standing on a tennis court a a racquet . Iter 4 a woman standing on a tennis court holding a racquet . Reference Captions a female tennis player in a black top playing tennis a woman standing on a tennis court holding a racquet . a female tennis player preparing to serve the ball . a woman is holding a tennis racket on a court a woman getting ready to reach for a tennis ball on the ground
Table 4: Two sample image captions from the proposed non-autoregressive sequence model. The images are from the development set of MS COCO. The first iteration is from decoder 1, while the subsequent ones are from decoder 2. Subsequences with changes across the refinement steps are underlined.

7.2 Ablation Study

We run ablative experiments on IWSLT’16 En-De to investigate the impact of different components in the proposed non-autoregressive sequence model. The results are presented in Table 2.


First, we observe that it is beneficial to use multiple iterations of refinement during training. By using four iterations (one step of decoder 1, followed by three steps of decoder 2), the BLEU score improved by approximately 1.5 points in both directions. We also notice that it is necessary to use the proposed hybrid learning strategy in Eq. (3.3) to maximize the improvement from more iterations during training ( vs. vs. .) Lastly, knowledge distillation was found crucial to close the gap between the proposed deterministic non-autoregressive sequence model and its autoregressive counterpart, echoing the observations by Gu et al. (2017) and Oord et al. (2017).


As we noticed many instances of repetition in the model output with our preliminary experiments, we have decided to remove any repeating, consecutive symbols as a simple post—processing routine. From Table 2, we see that removing such repeating, consecutive symbols improves the quality (approximately +1 BLEU).444 We did not observe this behavior with image caption generation. This suggests that the proposed iterative refinement is not enough to remove repetitions on its own. Further investigation and development is necessary to properly tackle this issue, which we leave as a future work.

7.3 Qualitative Analysis

Machine Translation

In Table 3, we present three sample translations and their iterative refinement steps from the development set of IWSLT’16 (DeEn). As expected, the sequence generated from the first iteration is mostly rough and it is iteratively refined over multiple steps. Inspecting the underlined sequences, we see that each iteration does not monotonically improve the translation, but overall modifies the translation towards the reference sentence. Missing words are added, while unnecessary words are dropped. For instance, see the second example. The second iteration removes the unnecessary “were”, and the fourth iteration inserts a new word “mostly”. The phrase “at the time” is gradually added one word at a time.

Image Caption Generation

Table 4 shows two examples of image caption generation from the proposed non-autoregressive sequence model. In this case, we observe that each iteration captures more and more details of the input image. In the first example (left), the bus was described only as a “yellow bus” in the first iteration, but the subsequent iterations refine it into “yellow and black bus”. Similarly, “road” is refined into “lot” by noticing the details such as parking lanes. We notice this behavior in the second example (right) as well. The first iteration does not specify the place in which “a woman” is “standing on”, which is fixed immediately in the second iteration: “standing on a tennis court”. In the final and fourth iteration, the proposed model captures the fact that the “woman” is “holding” a racquet.

8 Conclusion

Following on the exciting, recent success of non-autoregressive neural sequence modeling by Gu et al. (2017) and Oord et al. (2017), we proposed a deterministic non-autoregressive neural sequence model based on the idea of iterative refinement. We designed a learning algorithm specialized to the proposed approach by interpreting the entire model as a latent variable model and each refinement step as a denoising autoencoder.

We implemented our approach using the recently proposed Transformer, the state-of-the-art sequence-to-sequence model and evaluated it on two tasks: machine translation and image caption generation. On both tasks, we were able to show that the proposed non-autoregressive model performs closely to the autoregressive counterpart with significant speedup in decoding. Qualitative analysis revealed that the proposed iterative refinement indeed refines a target sequence gradually over multiple steps.

Despite these promising results, we observed that proposed non-autoregressive neural sequence model is outperformed by its autoregressive counterpart in terms of the quality of generated sequences. We believe the following directions should be pursued in the future to narrow this gap. First, the deterministic lower-bound in Eq. (3) should be replaced with a tighter bound. Second, we should investigate other corruption processes to understand better the impact of its choice on the generation quality. Lastly, further work on sequence-to-sequence model architectures could yield better results in non-autoregressive sequence modeling.


We thank support by AdeptMind, eBay, TenCent and NVIDIA. This work was partly supported by Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI) and Samsung Electronics (Improving Deep Learning using Latent Structure). We also thank Jiatao Gu for valuable feedback.