Neural machine translation (NMT) models with encoder-decoder framework (sutskever2014sequence; bahdanau2014neural) significantly outperform conventional statistical machine translation models (koehn2003statistical; koehn2007moses) on translation quality. Despite their success, the state-of-the-art NMT models usually suffer from the slow inference speed, which has become a bottleneck to apply NMT in real-world translation systems. The slow inference speed of NMT models is due to their autoregressive property, i.e., decoding the target sentence word-by-word according to the translation history.
Recently, gu2017non introduced non-autoregressive NMT (NAT) which can simultaneously decode all target words to break the bottleneck of the autoregressive NMT (AT) models. To this end, NAT models (gu2017non; wei2019imitation; wang2019non; guo2019non; shao2019retrieving) usually directly copy the source word representations to the input of the decoder, instead of using previous predicted target word representations. Hence, the inference of different target words are independent, which enables parallel computation of the decoder in NAT models. NAT models could achieve - times speedup compared to AT models while maintaining considerable translation quality.
However, existing NAT systems ignore the dependencies among target words and simultaneously generate all target words, which makes the search space in the decoding procedure too large to be well modeled. Specially, when decoding a target word, in order to determine which part of source sentence it is translated from, the NAT models need to search in a large global hypothesis space to infer what is expressed by its previous and latter words in the translation. Consequently, the large decoding space issue makes NAT models generate translation conditioned on less or inaccurate source information, thus leading to missing, repeated and even wrong translations. This problem is not severe for AT models because it only needs to decode a target word in a small local hypothesis space conditioned on previously translated words.
In this paper, to address this issue, we propose a novel NAT framework named ReorderNAT which explicitly models the reordering information to guide the decoding of NAT. To be specific, as shown in Figure 1, ReorderNAT first reorders the source sentence into a pseudo-translation formed by source words but in the target language structure, and then translate the pseudo-translation into target language to obtain the final translation. We further introduce two guiding decoding strategies which utilizes the reordering information (i.e. pseudo-translation) to guide the searching direction in decoding. The first one is deterministic guiding decoding which first generates a most likely pseudo-translation and then generates the target sentence based on it. The second one is non-deterministic guiding decoding which utilizes the conditional distribution of the pseudo-translation as a latent variable to guide the decoding of the target sentence.
The search space in decoding procedure of ReorderNAT is much smaller than the whole decoding space of original NAT: (1) the decoding space of reordering module in generating pseudo-translation is limited on the permutation of the source words; (2) with the guide of reordering information, for each target word, what it is expressed can be nearly narrow to the corresponding word of pseudo-translation in the same position. Therefore, ReorderNAT could effectively reduce the decoding search space by introducing the reordering information in NAT.
Experimental results on several widely-used public benchmarks show that our proposed ReorderNAT model achieves significant and consistent improvements compared to existing NAT models by explicitly model the reordering information to guide the decoding. Moreover, by introducing a simple but effective AT decoder to model reordering information, our ReorderNAT immensely narrows the translation quality gap between AT and NAT models, while maintains the considerable speedup (nearly six times faster). We will release all source codes and related resources of this work for further research explorations.
Non-autoregressive neural machine translation (NAT) is first proposed by gu2017non to alleviate the slow decoding issue of autoregressive neural machine translation (AT) models, which could simultaneously generate target words by removing their dependencies. Formally, given a source sentence and a target sentence
, NAT models the translation probability fromto as a product of conditionally independent target word probability:
Instead of utilizing the previous translation history, NAT models usually copy the sequence of source word representations as the input of the decoder. Hence, when translating a sentence, NAT models could predict all target words with their maximum likelihood individually by breaking the dependency among the target words, and therefore the decoding procedure of NAT models is in parallel and has very low translation latency.
However, since NAT models discard the sequential dependencies among words in the target sentence, it suffers from the potential performance degradation due to the explosion of decoding search space. To be specific, when decoding a target word, the NAT model must be able to figure out not only what target-side information does the word describe but also what is expressed by other target words. With the explosion of decoding search space, NAT models cannot effectively learn the intricate translation patterns from source sentences to target sentences, which leads to inferior translation quality.
In this section, we introduce a novel NAT model named ReorderNAT, which aims to break the explosion of search space in the decoding procedure of NAT models.
As shown in Figure 1, ReorderNAT employs a reordering module to explicitly model the reordering information in the decoding. Formally, ReorderNAT first translates the source sentence into a pseudo-translation which reorders source sentence structure into the target language, and then translates the pseudo-translation to target sentence . ReorderNAT models the overall translation probability as:
where is modeled by the reordering module and is modeled by the decoder module. The encoder module in ReorderNAT is a multi-layer Transformer, which is the same as original NAT, and thus we do not introduce it in detail.
3.1.1 Reordering Module
The reordering module determines the source-side information of each target word by learning to translate the source sentence into the pseudo-translation. We propose two feasible implementations of the reordering module:
(1) NAT Reordering Module: Intuitively, the pseudo-translation probability can also be modeled as NAT:
where is a one-layer Transformer.
(2) AT Reordering Module: Moreover, we find that AT models are more suitable for modeling the reordering information compared to NAT models, and even a light AT model with similar decoding speed to a large NAT model could achieve better performance in modeling reordering information. Hence, we also introduce a light AT model to model the pseudo-translation probability as:
where indicates the pseudo-translation history, and
is a one-layer recurrent neural network.
3.1.2 Decoder Module
The decoder module generates the target translation with the guiding of pseudo-translation, which regards the translation of each word as NAT:
As shown in Figure 1, the encoder module and the decoder module can be viewed as a seq-to-seq model which translate the source sentence to target sentence. Different with the original NAT, the inputs of our decoder module is the embeddings of pseudo translation instead of copied embeddings of source sentence, which is used to guide the decoding direction.
3.2 Guiding Decoding Strategy
ReorderNAT explicitly models reordering information of NAT and aims to utilize it to alleviate the issue of explosive decoding search space of NAT. Now the remaining problem is how to perform decoding with the guide of reordering information. We propose to utilize the pseudo-translation as a bridge to guide the decoding of the target sentence, which can be formulated as:
It is intractable to obtain an exact solution for maximizing Eq. 6 due to the high time complexity. Inspired by the pre-ordering works in statistical machine translation, we propose a deterministic guiding decoding (DGD) strategy and a non-deterministic guiding decoding (NDGD) strategy to solve this problem.
The DGD strategy first generates the most probable pseudo-translation of the source sentence and then decodes the target translation conditioned on it:
The DGD approach is simple and effective, but it brings in some noise in the approximation.
Different from the DGD strategy which utilizes a deterministic pseudo-translation to guide, the NDGD strategy, regards the probability distributionof the pseudo-translation as a latent variable, and models the translation as generating the target sentence according to the latent variable , i.e., Eq. 6 is re-formulated as:
where the probability distribution is defined as:
is a score function of pseudo-translation (the input of softmax layer in the decoder) andis a temperature coefficient. Since the latent variable can be viewed as a non-deterministic form of the pseudo-translation, the translation with the NDGD strategy is also guided by the pseudo-translation.
To be specific, as shown in Figure 1, the major difference between DGD and NDGD strategy is the inputs of decoder module (No. 2 dashed arrow), where the DGD strategy directly utilizes the word embeddings of generated pseudo-translation and the NDGD strategy utilizes the weighted word embeddings of the word probability of pseudo-translation. The detailed architecture of ReorderNAT model is introduced in Appendix A due to the space limit.
3.3 Decoding Search Space of ReorderNAT
In ReorderNAT, the decoding space of generating pseudo-translation with reordering module is much smaller than that of the whole translation in NAT since the decoding vocabulary is limited in the words in the source sentence. Therefore, ReorderNAT could easily capture the reordering information compared to the original NAT by explicitly modeling with pseudo-translation as internal supervision. Besides, the decoding search space of generating the final translation with decoder module is also much small. The reason is that the search space of the -th word of the final translation can be narrowed to the translation of to some extent since is the -th word in the pseudo-translation which indicates the corresponding source information of .
In the training process, for each training sentence pair , we first generate its corresponding pseudo-translation : we use a word alignment tool to align each word to a source word 111We set the word alignment tool to link each target word to exact one source word., and we set . And then ReorderNAT is optimized by maximizing a joint loss:
where and indicate the reordering and translation losses respectively. Formally, for both DGD and NDGD approaches, the reordering loss is defined as:
For the DGD approach, the translation loss is defined as an overall maximum likelihood of translating pseudo-translation into the target sentence:
For the NDGD approach, the translation loss is defined as an overall maximum likelihood of decoding the target sentence from the conditional probability of pseudo-translation:
In particular, we use the trained model for the DGD approach to initialize the model for the NDGD approach since if is not well trained, will converge very slowly.
The main experiments are conducted on three widely-used machine translation tasks: WMT14 En-De (M pairs), WMT16 En-Ro (k pairs) and IWSLT16 En-De (k pairs). For WMT14 En-De task, we take newstest-2013 and newstest-2014 as validation and test sets respectively. For WMT16 En-Ro task, we employ newsdev-2016 and newstest-2016 as validation and test sets respectively. For IWSLT16 En-De task, we use test2013 for validation. We also conduct our experiments on Chinese-English translation which differs more in language structure. The training set consists of M sentence pairs extracted from the LDC corpora. We choose NIST 2002 (MT02) dataset as our validation set, and NIST 2003 (MT03), 2004 (MT04), 2005 (MT05), 2006 (MT06) and 2008 (MT08) datasets as our test sets.
4.2 Experimental Settings
We use fast_align tool222https://github.com/clab/fast_align
to generate the pseudo-translation in our experiments. We follow most of the model hyperparameter settings in(gu2017non; lee2018deterministic; wei2019imitation) for fair comparison. For IWSLT16 En-De, we use a -layer Transformer model (, , , ) and anneal the learning rate linearly (from to ) as in (lee2018deterministic). For WMT14 En-De, WMT16 En-Ro and Chinese-English translation, we use a -layer Transformer model (, , , ) and adopt the warm-up learning rate schedule (vaswani2017attention) with . For the GRU reordering module, we set it to have the same hidden size with the Transformer model in each dataset. We employ label smoothing of value and utilize the sequence-level knowledge distillation (kim2016sequence) for all datasets.
In the experiments, we compare ReorderNAT (NAT) and ReorderNAT (AT) which utilize an NAT reordering module and an AT reordering module respectively with several baselines.
We select three models as our autoregressive baselines: (1) Transformer (vaswani2017attention), the hyperparameters are described in experimental settings. (2) Transformer, a lighter version of Transformer, of which decoder layer number is . (3) Transformer, which replaces the decoder of Transformer with GRU (cho2014learning).
We also include several typical NAT models as our baselines: (1) NAT-FT (gu2017non), which copies source inputs using fertilities as the decoder inputs and predicts the target words in parallel. (2) NAT-FT+NPD (gu2017non), an NAT-FT model which adopts noisy parallel decoding (NPD) during inference. We set the sample size of NPD to and . (3) NAT-IR (lee2018deterministic), which iteratively refines the translation for multiple times. We set the number of iterations to and . (4) NAT-REG (wang2019non), an NAT model using repeated translation and similarity regularizations. (5) NAT-FS (shao2019retrieving), which serializes the top decoder layer and generates the target sentence autoregressively. (6) imitate-NAT (wei2019imitation), which forces the NAT model to imitate an AT model during training. (7) imitate-NAT+LPD (wei2019imitation), an imitate-NAT model which adopts length parallel decoding.
4.4 Effect of Temperature Coefficient
|NAT-IR (iter = 1)||NAT-TM||13.91||16.77||24.45||25.73||22.20||8.90|
|NAT-IR (iter = 10)||NAT-TM||21.61||25.48||29.32||30.19||27.11||1.50|
|ReorderNAT (NAT) + LPD||NAT-TM||24.74||29.11||31.16||31.44||27.40|
The hyperparameter of temperature coefficient controls the smoothness of the distribution (see Eq. 9). As shown in Figure 2, we find that affects on the BLEU scores on the IWSLT16 validation set to some extent. While deceases BLEU socres, improves translation quality significantly and consistently. However, increasing further to or , results in worse tranlsation quality compared to after training k steps. Hence, we set for the NDGD strategy in our experiments.
4.5 Effect of Guiding Decoding Strategy
We also investigate the effect of two proposed guiding decoding strategies including DGD and NDGD on IWSLT16 validation set. In Table 1, we can find that the NDGD strategy has better performance compared to the DGD strategy for both ReorderNAT (AT) and ReorderNAT (NAT) since the NDGD strategy could effectively reduce the information loss of the DGD strategy. However, we also find that the NDGD strategy does not bring many improvements for our best model with AT reordering module. The reason is perhaps that the pseudo-translation generated by AT reordering module is good enough and therefore it does not bring in much information loss in the whole translation. We use NDGD as the default decoding strategy in the following experiments.
4.6 Overall Results
We compare ReorderNAT (NAT) and ReorderNAT (AT) that utilizes an NAT reordering module and an AT reordering module respectively with all baseline models. All the results are shown in Table 2. From the table, we can find that:
(1) ReorderNAT (AT) achieves state-of-the-art performance on most of the benchmark datasets, which is even close to the AT model with smaller than BLEU gap. ( vs. in WMT14’s DeEn task, vs. in WMT16’s RoEn task, vs. in IWSLT’s EnDe task). It is also worth mentioning that although ReorderNAT utilizes a small AT model to better capture reordering information, it could still maintain low translation latency (about speedup of ReorderNAT (NAT) and speedup of ReorderNAT (AT)). Compared to Transformer and Transformer, ReorderNAT (AT) uses a much smaller vocabulary in the AT reordering module, which is limited to the words in the source sentence and makes it faster.
(2) ReorderNAT (NAT) and ReorderNAT (NAT)+LPD also gain significant improvements compared to most existing NAT model, and even overcome the state-of-the-art NAT model imitate-NAT on WMT14 by explicitly modeling the reordering information. It verifies that the reordering information explicitly modeled by ReorderNAT could effectively guide its decoding direction.
(3) A small AT model with close latency to large NAT models could perform much better in modeling reordering information. On all benchmark datasets, ReorderNAT (AT) with small AT GRU reordering module achieves much better translation quality than that with large NAT model (- BLEU scores). Moreover, we find that the AT model Transformer and Transformer with a one-layer AT Transformer or GRU for decoding could also outperform most of existing NAT models and even outperform state-of-the-art imitate-NAT model in WMT14, while maintains acceptable latency ( and speedup respectively). The reason is that a major potential performance degradation of NAT models compared to AT models comes from the difficulty of modeling the sentence structure difference between source and target language, i.e., reordering information, which is neglected for most of existing NAT models but can be well modeled by the small AT decoder.
4.7 Results on Chinese-English Translation
To show the effectiveness of modeling reordering information in NAT, we compare ReorderNAT with baselines on Chinese-English translation since the language structure between Chinese and English is more different than that between German and English (En-De). From Table 3, we can find that in Chinese-English translation, ReorderNAT (AT) achieves much more improvements (6-7 BLEU scores) compared to ReorderNAT (NAT) and imitate-NAT. The reason is that the problem of explosive decoding search space is more severe in Chinese-English translation, which could effectively alleviate by ReorderNAT.
4.8 Translation Quality over Sentence Lengths
Figure 3 shows the BLEU scores of translations generated by AT Transformer model (Transformer), our Reorder-NAT model without reordering module (NAT), our Reorder-NAT model with AT reordering module (Reorder-NAT (AT)) and with NAT reordering module (Reorder-NAT (NAT)) on the IWSLT16 validation set with respect to input sentence lengths. From the figure, we can observe that:
(1) The ReorderNAT (AT) model achieves significant improvement compared to the NAT model, and nearly comparable performance to AT Transformer model for all lengths. It verifies that the reordering information modeled by ReorderNAT could effectively reduce the decoding space and improve the translation quality of the model.
(2) Our ReorderNAT model achieves much better translation performance than the NAT model for sentences longer than words. The reason is that the size of the global hypothesis space for NAT’s decoding is correlated to the sentence length and therefore the problem of large decoding space is more serious for longer input sentences.