Hint-Based Training for Non-Autoregressive Machine Translation

Due to the unparallelizable nature of the autoregressive factorization, AutoRegressive Translation (ART) models have to generate tokens sequentially during decoding and thus suffer from high inference latency. Non-AutoRegressive Translation (NART) models were proposed to reduce the inference time, but could only achieve inferior translation accuracy. In this paper, we proposed a novel approach to leveraging the hints from hidden states and word alignments to help the training of NART models. The results achieve significant improvement over previous NART models for the WMT14 En-De and De-En datasets and are even comparable to a strong LSTM-based ART baseline but one order of magnitude faster in inference.


page 2

page 3


Fast Structured Decoding for Sequence Models

Autoregressive sequence models achieve state-of-the-art performance in d...

Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation

State-of-the-art neural machine translation models generate outputs auto...

End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification

Autoregressive decoding is the only part of sequence-to-sequence models ...

Fine-Tuning by Curriculum Learning for Non-Autoregressive Neural Machine Translation

Non-autoregressive translation (NAT) models remove the dependence on pre...

Non-Autoregressive Machine Translation: It's Not as Fast as it Seems

Efficient machine translation models are commercially important as they ...

Faster Re-translation Using Non-Autoregressive Model For Simultaneous Neural Machine Translation

Recently, simultaneous translation has gathered a lot of attention since...

Neighbors Are Not Strangers: Improving Non-Autoregressive Translation under Low-Frequency Lexical Constraints

However, current autoregressive approaches suffer from high latency. In ...

1 Introduction

Neural machine translation has attracted much attention in recent years (Bahdanau et al., 2014, 2016; Kalchbrenner et al., 2016; Gehring et al., 2016). Given a sentence from the source language, the straight-forward way for translation is to generate the target words one by one from left to right. This is also known as the AutoRegressive

Translation (ART) models, in which the joint probability is decomposed into a chain of conditional probabilities:


While the ART models have achieved great success in terms of translation quality, the time consumption during inference is still far away from satisfactory. During training, the predictions at different positions can be estimated in parallel since the ground truth pair

is exposed to the model. However, during inference, the model has to generate tokens sequentially as must be inferred on the fly. Such autoregressive behavior becomes the bottleneck of the computational time (Wu et al., 2016).

In order to speed up the inference process, a line of works begin to develop non-autoregressive translation models. These models break the autoregressive dependency by decomposing the joint probability with


The lost of autoregressive dependency largely hurt the consistency of the output sentences, increase the difficulty in the learning process and thus lead to a low quality translation. Previous works mainly focus on adding different components into the NART model to improve the expressiveness of the network structure to overcome the loss of autoregressive dependency (Gu et al., 2017; Lee et al., 2018; Kaiser et al., 2018). However, the computational overhead of new components will hurt the inference speed, contradicting with the goal of the NART models: to parallelize and speed up neural machine translation models.

To tackle this, we proposed a novel hint-based method for NART model training. We first investigate the causes of the poor performance of the NART model. Comparing with the ART model, we find that: (1) the positions where the NART model outputs incoherent tokens will have very high hidden states similarity; (2) the attention distributions of the NART model are more ambiguous than those of ART model. Therefore, we design two kinds of hints from the hidden states and attention distributions of the ART model to help the training of the NART model. The experimental results show that our model achieves significant improvement over the NART baseline models and is even comparable to a strong ART baseline in Wu et al. (2016).

Figure 1:

Case study: the above three figures visualize the hidden state cosine similarities of different models. The axes correspond to the generated target tokens. Each pixel shows the cosine similarities

between the last layer hidden states of the -th and -th generated tokens, where the diagonal pixel will always be 1.0.

2 Approach

In this section, we first describe the observations on the ART and NART models, and then discuss what kinds of information can be used as hints to help the training of the NART model. We follow the network structure in Vaswani et al. (2017), use a copy of the source sentence as decoder input, remove the attention masks in decoder self-attention layers and add a positional attention layer as suggested in Gu et al. (2017). We provide a visualization of ART and NART models we used in Figure 3 and a detailed description of the model structure in Appendix.

2.1 Observation: Illed States and Attentions

According to the case study in Gu et al. (2017), the translations of the NART models contain incoherent phrases (e.g. repetitive words) and miss meaningful tokens on the source side, while these patterns do not commonly appear in ART models. After some empirical study, we find two non-obvious facts that lead to this phenomenon.

First, we visualize the cosine similarities between decoder hidden states of a certain layer in both ART and NART models for sampled cases. Mathematically, for a set of hidden states , the pairwise cosine similarity can be derived by We then plot the heatmap of the resulting matrix . A typical example is shown in Figure 1, where the cosine similarities in the NART model are larger than those of the ART model, indicating that the hidden states across positions in the NART model are “similar”. Positions with highly-correlated hidden states tend to generate the same word and make the NART model output repetitive tokens, e.g., the yellow area on the top-left of Figure 1(b), while this does not happen in the ART model (Figure 1(a)). According to our statistics, 70% of the cosine similarities between hidden states in the ART model are less than 0.25, and 95% are less than 0.5.

Figure 2: Case study: the above three figures visualize the encoder-decoder attention weights of different models. The x-axis and y-axis correspond to the source and generated target tokens respectively. The attention distribution is from a single head of the third layer encoder-decoder attention, which is the most informative one according to our observation. Each pixel shows attention weights between the -th source token and -th target token.

Second, we visualize the encoder-decoder attentions for sampled cases, shown in Figure 2. Good attentions between the source and target sentences are usually considered to lead to accurate translation while poor ones may cause wrong output tokens (Bahdanau et al., 2014). In Figure 2(b), the attentions of the ART model almost covers all source tokens, while the attentions of the NART model do not cover “farm” but with two “morning”. This directly makes the translation result worse in the NART model. These phenomena inspire us to use the intermediate hidden information in the ART model to guide the learning process of the NART model.

2.2 Hints from the ART teacher Model

Our study motivates us to leverage the intermediate hidden information from an ART model to improve the NART model. We focus on how to define hints from a well-trained ART teacher model and use it to guide the training process of a NART student model. We study layer-to-layer hints and assume both the teacher and student models have an -layer encoder and an -layer decoder, despite the difference in stacked components.

Without the loss of generality, we discuss our method on a given paired sentence . In real experiments, losses are averaged over all training data. For the teacher model, we use as the encoder-to-decoder attention distribution of -th head in the -th decoder layer at position , and use as the output of the -th decoder layer after feed forward network at position . Correspondingly, and are used for the student model. We propose a hint-based training framework that contains two kinds of hints:

Hints from hidden states

The discrepancy of hidden states motivates us to use hidden states of the ART model as a hint for the learning process of the NART model. One straight-forward method is to regularize the or distance between each pair of hidden states in ART and NART models. However, since the network components are quite different in ART and NART models, applying the straight-forward regression on hidden states hurts the learning process and fails. Therefore, we design a more implicit loss to help the student refrain from the incoherent translation results by acting towards the teacher in the hidden-state level:

where , , and is a penalty function. In particular, we let

where are two thresholds controlling whether to penalize or not. We design this loss since we only want to penalize hidden states that are highly similar in the NART model, but not similar in the ART model. We have tested several choices of , e.g., , from which we find similar experimental results.

Hints from word alignments

We observe that meaningful words in the source sentence are sometimes untranslated by the NART model, and the corresponding positions often suffer from ambiguous attention distributions. Therefore, we use the word alignment information from the ART model to help the training of the NART model.

In particular, we minimize KL-divergence between the per-head encoder-to-decoder attention distributions of the teacher and the student to encourage the student to have similar word alignments to the teacher model, i.e.

Our final training loss is a weighted sum of two parts stated above and the negative log-likelihood loss defined on bilingual sentence pair , i.e.


where and

are hyperparameters controlling the weight of different loss terms.

Figure 3: Hint-based training from ART model to NART model.

3 Experiments

3.1 Experimental Settings

The evaluation is on two widely used public machine translation datasets: IWSLT14 German-to-English (De-En) (Huang et al., 2017; Bahdanau et al., 2016) and WMT14 English-to-German (En-De) dataset (Wu et al., 2016; Gehring et al., 2017). To compare with previous works, we also reverse WMT14 English-to-German dataset and obtain WMT14 German-to-English dataset.

We pretrain Transformer (Vaswani et al., 2017) as the teacher model on each dataset, which achieves 33.26/27.30/31.29 in terms of BLEU (Papineni et al., 2002) in IWSLT14 De-En, WMT14 En-De and De-En test sets. The student model shares the same number of layers in encoder/decoder, size of hidden states/embeddings and number of heads as the teacher models (Figure 3). Following Gu et al. (2017); Kim and Rush (2016), we replace the target sentences by the decoded output of the teacher models.

Hyperparameters () for hint-based learning are determined to make the scales of three loss components similar after initialization. We also employ label smoothing of value (Szegedy et al., 2016) in all experiments. We use Adam optimizer and follow the setting in Vaswani et al. (2017). Models for WMT14/IWSLT14 tasks are trained on 8/1 NVIDIA M40 GPUs respectively. The model is based on the open-sourced tensor2tensor (Vaswani et al., 2018).111Open-source code can be found at https://github.com/zhuohan123/hint-nart More settings can be found in Appendix.

3.2 Inference

During training, does not need to be predicted as the target sentence is given. During testing, we have to predict the length of the target sentence for each source sentence. In many languages, the length of the target sentence can be roughly estimated from the length of the source sentence. We choose a simple method to avoid the computational overhead, which uses input length to determine target sentence length: , where is a constant bias determined by the average length differences between the source and target training sentences. We can also predict the target length ranging from , where is the halfwidth. By doing this, we can obtain multiple translation results with different lengths. Note that we choose this method only to show the effectiveness of our proposed method and a more advanced length estimation method can be used to further improve the performance.

Once we have multiple translation results, we additionally use our ART teacher model to evaluate each result and select the one that achieves the highest probability. As the evaluation is fully parallelizable (since it is identical to the parallel training of the ART model), this rescoring operation will not hurt the non-autoregressive property of the NART model.

WMT14 IWSLT14 Models En-De De-En De-En Latency Speedup Autoregressive models LSTM-based S2S (Wu et al., 2016; Bahdanau et al., 2016) 24.60 / 28.53 / / ConvS2S (Gehring et al., 2017; Edunov et al., 2017) 26.43 / 32.84 / / Transformer (Vaswani et al., 2017) 27.30 31.29 33.26 784 ms 1.00 Non-autoregressive models FT (Gu et al., 2017) 17.69 20.62 / 39 ms 15.6 FT (rescoring 10 candidates) 18.66 22.41 / 79 ms 7.68 FT (rescoring 100 candidates) 19.17 23.20 / 257 ms 2.36 IR (Lee et al., 2018, adaptive refinement steps) 21.54 25.43 / / 2.39 LT (Kaiser et al., 2018) 19.8 / / 105 ms 5.78 LT (rescoring 10 candidates) 21.0 / / / / LT (rescoring 100 candidates) 22.5 / / / / NART w/ hints 21.11 25.24 25.55 26 ms 30.2 NART w/ hints (, 9 candidates) 25.20 29.52 28.80 44 ms 17.8

Table 1: Performance on WMT14 En-De, De-En and IWSLT14 De-En tasks. “/” means non-reportable.

3.3 Experimental Results

We compare our model with several baselines, including three ART models, the fertility based (FT) NART model (Gu et al., 2017), the deterministic iterative refinement based (IR) NART model (Lee et al., 2018), and the Latent Transformer (LT; Kaiser et al., 2018) which is not fully non-autoregressive by incorporating an autoregressive sub-module in the NART model architecture.

The results are shown in the Table 1.222 and indicate that the latency is measured on our own platform or by previous works, respectively. Note that the latencies may be affected by hardware settings and such absolute values are not fair for direct comparison, so we also list the speedup of the works compared to their ART baselines. Across different datasets, our method achieves significant improvements over previous non-autoregressive models. Specifically, our method outperforms fertility based NART model with 6.54/7.11 BLEU score improvements on WMT En-De and De-En tasks in similar settings and achieves comparable results with state-of-the-art LSTM-based model on WMT En-De task. Furthermore, our model achieves a speedup of 30.2 (output a single sentence) or 17.8 (teacher rescoring) times over the ART counterparts. Note that our speedups significantly outperform all previous works, because of our lighter design of the NART model: without any computationally expensive module trying to improve the expressiveness.

We also visualize the hidden state cosine similarities and attention distributions for the NART model with hint-based training, as shown in Figure 1(c) and 2(c). With hints from hidden states, the hidden states similarities of the NART model decrease in general, and especially for the positions where the original NART model outputs incoherent phrases. The attention distribution of the NART model after hint-based training is more similar to the ART teacher model and less ambiguous comparing to the NART model without hints.

According to our empirical analysis, the percentage of repetitive words drops from 8.3% to 6.5% by our proposed methods on the IWSLT14 De-En test set, which is a 20%+ reduction. This shows that our proposed method effectively improve the quality of the translation outputs. We also provide several case studies in Appendix.

BLEU 23.08 24.76 25.55
Long-sentence BLEU 17.48 19.24 20.63
Table 2: Ablation studies on IWSLT14 De-En. Results are BLEU scores without teacher rescoring.

Finally, we conduct an ablation study on IWSLT14 De-En task. As shown in Table 2, the hints from word alignments provide an improvement of about 1.6 BLEU points, and the hints from hidden states improve the results by about 0.8 BLEU points. We also test these models on a subsampled set whose source sentence lengths are at least 40. Our model outperforms the baseline model by more than 3 BLEU points (20.63 v.s. 17.48).

4 Conclusion

In this paper, we proposed to use hints from a well-trained ART model to enhance the training of NART models. Our results on WMT14 En-De and De-En significantly outperform previous NART baselines, with one order of magnitude faster in inference than ART models. In the future, we will focus on designing new architectures and training methods for NART models to achieve comparable accuracy as ART models.


This work is supported by National Key R&D Program of China (2018YFB1402600), NSFC (61573026) and BJNSF (L172037) and a grant from Microsoft Research Asia. We would like to thank the anonymous reviewers for their valuable comments on our paper.



Appendix A Related Works

a.1 AutoRegressive Translation

Given a sentence from the source language, the straight-forward way for translation is to generate the words in the target language one by one from left to right. This is also known as the autoregressive factorization in which the joint probability is decomposed into a chain of conditional probabilities, as in the Eqn. (1

). Deep neural networks are widely used to model such conditional probabilities based on the encoder-decoder framework. The encoder takes the source tokens

as input and encodes into a set of context states . The decoder takes and subsequence as input and estimates according to some parametric function.

There are many design choices in the encoder-decoder framework based on different types of layers, e.g., recurrent neural network(RNN)-based

(Bahdanau et al., 2014)

, convolution neural network(CNN)-based

(Gehring et al., 2017) and recent self-attention based (Vaswani et al., 2017) approaches. We show a self-attention based network (Transformer) in the left part of Figure 3. While the ART models have achieved great success in terms of translation quality, the time consumption during inference is still far away from satisfactory. During training, the ground truth pair is exposed to the model, and thus the prediction at different positions can be estimated in parallel based on CNN or self-attention networks. However, during inference, given a source sentence , the decoder has to generate tokens sequentially, as the decoder inputs must be inferred on the fly. Such autoregressive behavior becomes the bottleneck of the computational time (Wu et al., 2016).

a.2 Non-AutoRegressive Translation

In order to speed up the inference process, a line of works begin to develop non-autoregressive translation models. These models follow the encoder-decoder framework and inherit the encoder structure from the autoregressive models. After generating the context states by the encoder, a separate module will be used to predict the target sentence length and decoder inputs by a parametric function: , which is either deterministic or stochastic. The decoder will then predict based on following probabilistic decomposition


Different configurations of and enable the decoder to produce different target sentence given the same input sentence , which increases the output diversity of the translation models.

Previous works mainly pay attention to different design choices of . Gu et al. (2017) introduce fertilities, corresponding to the number of target tokens occupied by each of the source tokens, and use a non-uniform copy of encoder inputs as according to the fertility of each input token. The prediction of fertilities is done by a separated neural network-based module. Lee et al. (2018) define by a sequence of generated target sentences , where each is a refinement of . Kaiser et al. (2018) use a sequence of autoregressively generated discrete latent variables as inputs of the decoder.

While the expressiveness of improved by different kinds of design choices, the computational overhead of will hurt the inference speed of the NART models. Comparing to the more than 15 speed up in Gu et al. (2017), which uses a relatively simpler design choice of , the speedup of Kaiser et al. (2018) is reduced to about 5, and the speedup of Lee et al. (2018) is reduced to about 2. This contradicts with the design goal of the NART models: to parallelize and speed up neural machine translation models.

a.3 Knowledge Distillation and Hint-Based Training

Knowledge Distillation (KD) was first proposed by Hinton et al. (2015), which trains a small student network from a large (possibly ensemble) teacher network. The training objective of the student network contains two parts. The first part is the standard classification loss, e.g, the cross entropy loss defined on the student network and the training data. The second part is defined between the output distributions of the student network and the teacher network, e.g, using KL-divergence . Kim and Rush (2016) introduces the KD framework to neural machine translation models. They replace the ground truth target sentence by the generated sentence from a well-trained teacher model. Sentence-level KD is also proved helpful for non-autoregressive translation in multiple previous works (Gu et al., 2017; Lee et al., 2018).

However, knowledge distillation only uses the outputs of the teacher model, but ignores the rich hidden information inside a teacher model. Romero et al. (2014) introduced hint-based training to leverage the intermediate representations learned by the teacher model as hints to improve the training process and final performance of the student model. Hu et al. (2018) used the attention weights as hints to train a small student network for reading comprehension.

Appendix B Network Architecture

Encoder and decoder

Same as the ART model, the encoder of the NART model takes the embeddings of source tokens as inputs,333Following Vaswani et al. (2017); Gu et al. (2017) we also use positional embedding to model relative correlation between positions and add it to word embedding in both source and target sides. The positional embedding is represented by a sinusoidal function of different frequencies to encode different positions. Specifically, the positional encoding is computed as (for even ) or

(for odd

), where is the position index and is the dimension index of the embedding vector.

and generates a set of context vectors. As discussed in the main paper, the NART model needs to predict

given length and source sentence . We use a simple and efficient method to predict . Given source sentence and target length , we denote as the embedding of . We linearly combine the embeddings of all the source tokens to generate as follows:


where is the normalized weight that controls the contribution of to according to


where and . is a hyperparameter to control the “sharpness” of the weight distribution. We use for this weighted average function to be consistent as in the non-autoregressive decomposition.

Three types of multi-head attention

The ART and NART models share two types of multi-head attentions: multi-head self attention and multi-head encoder-to-decoder attention. The NART model specifically uses multi-head positional attention to model local word orders within the sentence (Vaswani et al., 2017; Gu et al., 2017). A general attention mechanism can be formulated as querying a dictionary with key-value pairs (Vaswani et al., 2017), e.g.,



is the dimension of hidden representations and

(Query), (Key), (Value) differ among three types of attentions. For self attention, , and are hidden representations of the previous layer. For encoder-to-decoder attention, is hidden representations of the previous layer, whereas and are context vectors from the encoder. For positional attention, positional embeddings are used as and , and hidden representations of the previous layer are used as . The multi-head variant of attention allows the model to jointly attend to information from different representation subspaces, and is defined as


where , , and are project parameter matrices, is the number of heads, and and are the numbers of dimensions.

In addition to multi-head attentions, the encoder and decoder also contain fully connected feed-forward network (FFN) layers with ReLU activations, which are applied to each position separately and identically. Compositions of self attention, encoder-to-decoder attention, positional attention, and position-wise feed-forward network are stacked to form the encoder and decoder of the ART model and the NART model, with residual connections

(He et al., 2016) and layer normalization (Ba et al., 2016).

Table 3: Cases on IWSLT14 De-En.

Appendix C Extra Experimental Settings

Dataset specifications

The split of the training/validation/test sets of the IWSLT14 dataset444https://wit3.fbk.eu/ contain about 153K/7K/7K sentence pairs, respectively. The training set of the WMT14 dataset555http://www.statmt.org/wmt14/translation-task contains 4.5M parallel sentence pairs. Newstest2014 is used as the test set, and Newstest2013 is used as the validation set. In both datasets, tokens are split into a 32000 word-piece dictionary (Wu et al., 2016) which is shared in source and target languages.

Model specifications

For the WMT14 dataset, we use the default network architecture of the base Transformer model in Vaswani et al. (2017), which consists of a 6-layer encoder and 6-layer decoder. The size of hidden nodes and embeddings are set to 512. For the IWSLT14 dataset, we use a smaller architecture, which consists of a 5-layer encoder, and a 5-layer decoder. The size of hidden states and embeddings are set to 256 and the number of heads is set to 4.

Hyperparameter specifications

Hyperparameters () are determined to make the scales of three loss components similar after initialization. Specifically, we use for IWSLT14 De-En, for WMT14 De-En and WMT14 En-De.

BLEU scores

We use the BLEU score (Papineni et al., 2002) as our evaluation measure. During inference, we set to for WMT14 En-De, De-En and IWSLT14 De-En datasets respectively, according to the average lengths of different languages in the training sets. When using the teacher to rescore, we set and thus have 9 candidates in total. We also evaluate the average per-sentence decoding latencies on one NVIDIA TITAN Xp GPU card by decoding on WMT14 En-De test sets with batch size 1 for our ART teacher model and NART models, and calculate the speedup based on them.

Appendix D Case Study

We provide some case studies for the NART models with and without hints in Table 3. From the first case, we can see that the model without hints translates the meaning of “as far as I’m concerned” to a set of meaningless tokens. In the second case, the model without hints omits the phrase “the farm” and replaces it with a repetitive phrase “every morning”. In the third case, the model without hints mistakenly puts the word “uploaded” to the beginning of the sentence, whereas our model correctly translates the source sentence. In all cases, hint-based training helps the NART model to generate better target sentences.