Autoregressive sequence models achieve great success in domains like machine translation and have been deployed in real applications Vaswani et al. (2017); Wu et al. (2016); Cho et al. (2014); Bahdanau et al. (2014); Gehring et al. (2017). However, these models suffer from high inference latency Vaswani et al. (2017); Wu et al. (2016), which is sometimes unaffordable for real-time industrial applications. This is mainly attributed to the autoregressive factorization nature of the models: Considering a general conditional sequence generation framework, given a context sentence and a target sentence
, autoregressive sequence models are based on a chain of conditional probabilities with a left-to-right causal structure:
where represents the tokens before the -th token of target . See Figure 1(a) for the illustration of a state-of-the-art autoregressive sequence model, Transformer Vaswani et al. (2017). The autoregressive factorization makes the inference process hard to be parallelized as the results are generated token by token sequentially.
Recently, non-autoregressive sequence models Gu et al. (2017); Li et al. (2019); Wang et al. (2019); Lee et al. (2018) were proposed to alleviate the inference latency by removing the sequential dependencies within the target sentence. Those models also use the general encoder-decoder framework: the encoder takes the context sentence as input to generate contextual embedding and predict the target length , and the decoder uses a well-designed deterministic or stochastic input and the contextual embedding to predict each target token:
The non-autoregressive sequence models take full advantage of parallelism and significantly improve the inference speed. However, they usually cannot get results as good as their autoregressive counterparts. As shown in Table 1, on the machine translation task, compared to AutoRegressive Translation (ART) models, Non-AutoRegressive Translation (NART) models suffer from severe decoding inconsistency problem. In non-autoregressive sequence models, each token in the target sentence is generated independently. Thus the decoding consistency (e.g., word co-occurrence) cannot be guaranteed on the target side. The primary phenomenon that can be observed is the multimodality problem: the non-autoregressive models cannot model the highly multimodal distribution of target sequences properly Gu et al. (2017). For example, an English sentence “Thank you.” can have many correct German translations like “Danke.”, “Danke schon.”, or “Vielen Dank.”. In practice, this will lead to inconsistent outputs such as “Danke Dank.” or “Vielen schon.”.
To tackle this problem, in this paper, we propose to incorporate a structured inference module in the non-autoregressive decoder to directly model the multimodal distribution of target sequences. Specifically, we regard sequence generation (e.g., machine translation) as a sequence labeling problem and propose to use linear-chain Conditional Random Fields (CRF) Lafferty et al. (2001) to model richer structural dependencies. By modeling the co-occurrence relationship between adjacent words, the CRF-based structured inference module can significantly improve decoding consistency in the target side. Different from the probability product form of Equation 2, the probability of the target sentence is globally normalized:
where is the pairwise potential for and . Such a probability form could better model the multiple modes in target translations.
However, the label size (vocabulary size) used in typical sequence models is very large (e.g., 32k) and intractable for traditional CRFs. Therefore, we design two effective approximation methods for the CRF: low-rank approximation and beam approximation. Moreover, to leverage the rich contextual information from the hidden states of non-autoregressive decoder and to improve the expressive power of the structured inference module, we further propose a dynamic transition technique to model positional contexts in CRF.
We evaluate the proposed end-to-end model on three widely used machine translation tasks: WMT14 English-to-German/German-to-English (En-De/De-En) tasks and IWSLT14 German-to-English task. Experimental results show that while losing little speed, our NART-CRF model could achieve significantly better translation performance than previous NART models on several tasks. In particular, for the WMT14 En-De and De-En tasks, our model obtains BLEU scores of 26.80 and 30.04, respectively, which largely outperform previous non-autoregressive baselines and are even comparable to the autoregressive counterparts.
2 Related Work
2.1 Non-autoregressive neural machine translation
Non-AutoRegressive neural machine Translation (NART) models aim to speed up the inference process for real-time machine translationGu et al. (2017), but their performance is considerably worse than their ART counterparts. Most previous works attributed the poor performance to unavoidable conditional independence when predicting each target token, and proposed various methods to solve this issue.
Some methods alleviated the multimodality phenomenon in vanilla NART training: Gu et al. (2017) introduced the sentence-level knowledge distillation Hinton et al. (2015); Kim and Rush (2016) to reduce the multimodality in the raw data; Wang et al. (2019) designed two auxiliary regularization terms in training; Li et al. (2019) proposed to leverage hints from the ART models to guide NART’s attention and hidden states. Our approach is orthogonal to these training techniques. Perhaps the most similar one to our approach is Libovickỳ and Helcl (2018), which introduced the Connectionist Temporal Classification (CTC) loss in NART training. Both CTC and CRF can reduce the multimodality effect in training. However, CTC can only model a unimodal target distribution, while CRF can model a multimodal target distribution effectively.
Other methods attempted to model the multimodal target distribution by well-designed decoder input : Gu et al. (2017) introduced the concept of fertilities from statistical machine translation models Brown et al. (1993) into the NART models; Lee et al. (2018) used an iterative refinement process in the decoding process of their proposed model; Kaiser et al. (2018) and Roy et al. (2018) embedded an autoregressive sub-module that consists of discrete latent variables into their models. In comparison, our NART-CRF models use a simple design of decoder input , but model a richer structural dependency for the decoder output.
2.2 Structured learning for machine translation
The idea of recasting the machine translation problem as a sequence labeling task can be traced back to Lavergne et al. (2011)
, where a CRF-based method was proposed for Statistical Machine Translation (SMT). They simplify the CRF training by (1) limiting the possible “labels” to those that are observed during training and (2) enforcing sparsity in the model. In comparison, our proposed low-rank approximation and beam approximation are more suitable for neural network models.
Structured prediction provides a declarative language for specifying prior knowledge and structural relationships in the data Kim et al. (2018). Our approach is also related to other works on structured neural sequence modeling. Tran et al. (2016)
neuralizes an unsupervised Hidden Markov Model (HMM).Kim et al. (2017) proposed to incorporate richer structural distribution for the attention mechanism. They both focus on the internal structural dependencies in their models, while in this paper, we directly model richer structural dependencies for the decoder output.
Finally, our work is also related to previous work on combining neural networks with CRF for sequence labeling. Collobert et al. (2011) proposed a unified neural network architecture for sequence labeling. Andor et al. (2016) proposed a globally normalized transition-based neural network on a task-specific transition system.
3 Fast Structured Decoding for Sequence Models
In this section, we describe the proposed model in the context of machine translation and use “source” and “context” interchangeably. The proposed NART-CRF model formulates non-autoregressive translation as a sequence labeling problem and use Conditional Random Fields (CRF) to solve it. We first briefly introduce the Transformer-based NART architecture and then describe the CRF-based structured inference module. Figure 1(b) illustrates our NART-CRF model structure.
3.1 Transformer-based Non-autoregressive Translation Model
The model design follows the Transformer architecture Vaswani et al. (2017) with an additional positional attention layer proposed by Gu et al. (2017). We refer the readers to Vaswani et al. (2017); Gu et al. (2017); Vaswani et al. (2018) for more details about the model.
Non-autoregressive machine translation can also be formulated in an encoder-decoder framework Cho et al. (2014). Same as the ART models, the encoder of NART models takes the embeddings of source tokens as inputs and generates the context representation. However, as shown in Equation 2, the NART decoder does not use the autoregressive factorization, but decodes each target token independently given the target length and decoder input .
ART and NART Transformer models share two types of multi-head attentions: multi-head self-attention and multi-head encoder-to-decoder attention. The NART model additionally uses multi-head positional attention to model local word orders within the sentence Gu et al. (2017)
. A general attention mechanism can be formulated as the weighted sum of the value vectorsusing query vectors and key vectors :
represents the dimension of hidden representations. For self-attention,, and are hidden representations of the previous layer. For encoder-to-decoder attention, refers to hidden representations of the previous layer, whereas and are context vectors from the encoder. For positional attention, positional embedding is used as and , and hidden representations of the previous layer are used as .
3.2 Structured inference module
In this paper, we propose to incorporate a structured inference module in the decoder part to directly model multimodality in NART models. Figure 2 shows how a CRF-based structured inference module works. In principle, this module can be any structured prediction model such as Conditional Random Fields (CRF) Lafferty et al. (2001) or Maximum Entropy Markov Model (MEMM) McCallum et al. (2000). Here we focus on linear-chain CRF, which is the most widely applied model in the sequence labeling literature. In the context of machine translation, we use “label” and “token” (vocabulary) interchangeably for the decoder output.
Conditional random fields
CRF is a framework for building probabilistic models to segment and label sequence data. Given the sequence data and the corresponding label sequence , the likelihood of given is defined as:
where is the normalizing factor, is the label score of at the position , and is the transition score from to . The CRF module can be end-to-end jointly trained with neural networks using negative log-likelihood loss . Note that when omitting the transition score , Equation 6 is the same as vanilla non-autoregressive models (Equation 2).
Incorporating CRF into NART model
For the label score, a linear transformation of the NART decoder’s output : works well, where and are the weights and bias of the linear transformation. However, for the transition score, naive methods require a matrix to model . Also, according to the widely-used forward-backward algorithm Lafferty et al. (2001), the likelihood computation and decoding process requires complexity through dynamic programming Lafferty et al. (2001); Sutton et al. (2012); Collins (2013), which is infeasible for practical usage (e.g. a 32k vocabulary).
Low-rank approximation for transition matrix
A solution for the above issue is to use a low-rank matrix to approximate the full-rank transition matrix. In particular, we introduce two transition embedding to approximate the transition matrix:
where is the dimension of the transition embedding.
Beam approximation for CRF
Low-rank approximation allows us to calculate the unnormalized term in Equation 6 efficiently. However, due to numerical accuracy issues222The transition is calculated in the log space. See https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/crf for detailed implementation., both the normalizing factor and the decoding process require the full transition matrix, which is still unaffordable. Therefore, we further propose beam approximation to make CRF tractable for NART models.
In particular, for each position
, we heuristically truncate allcandidates to a pre-defined beam size . We keep candidates with highest label scores for each position , and accordingly crop the transition matrix between each pair of and . The forward-backward algorithm is then applied on the truncated beam to get either normalizing factor or decoding result. In this way, the time complexities of them are reduced from to (e.g., for the normalizing factor, instead of a sum over possible paths, we sum over paths in the beam). Besides, when calculating the normalizing factor for a training pair , we explicitly include each in the beam to ensure that the approximated normalizing factor is larger than the unnormalized path score of .
The intuition of the beam approximation is that for the normalizing factor, the sum of path scores in such a beam (the approximated ) is able to predominate the actually value of , while it is also reasonable to assume that the beam includes each label of the best path .
Dynamic CRF transition
In the traditional definition, the transition matrix is fixed for each position . A dynamic transition matrix that depends on the positional context could improve the representation power of CRF. Here we use a simple but effective way to get a dynamic transition matrix by inserting a dynamic matrix between the product of transition embedding and :
where is the concatenation of two adjacent decoder outputs, and is a two-layer Feed-Forward Network (FFN).
Latency of CRF decoding
Unlike vanilla non-autoregressive decoding, the CRF decoding can no longer be parallelized. However, due to our beam approximation, the computation of linear-chain CRF is in theory still much faster than autoregressive decoding. As shown in Table 2, in practice, the overhead is only 814ms.
Exact Decoding for Machine Translation
Despite fast decoding, another promise of this approach is that it provides an exact decoding framework for machine translation, while the de facto standard beam search algorithm for ART models cannot provide such guarantee. CRF-based structured inference module can solve the label bias problem Lafferty et al. (2001), while locally normalized models (e.g. beam search) often have a very weak ability to revise earlier decisions Andor et al. (2016).
Joint training with vanilla non-autoregressive loss
In practice, we find that it is beneficial to include the original NART loss to help the training of the NART-CRF model. Therefore, our final training loss is a weighted sum of the CRF negative log-likelihood loss (Equation 3) and the Non-AutoRegressive (NAR) negative log-likelihood loss (Equation 2):
is the hyperparameter controlling the weight of different loss terms.
4.1 Experimental settings
We use several widely adopted benchmark tasks to evaluate the effectiveness of our proposed models: IWSLT14333https://wit3.fbk.eu/ German-to-English translation (IWSLT14 De-En) and WMT14444http://statmt.org/wmt14/translation-task.html English-to-German/German-to-English translation (WMT14 En-De/De-En). For the WMT14 dataset, we use Newstest2014 as test data and Newstest2013 as validation data. For each dataset, we split word tokens into subword units following Wu et al. (2016), forming a 32k word-piece vocabulary shared by source and target languages.
For the WMT14 dataset, we use the default network architecture of the original base Transformer Vaswani et al. (2017), which consists of a 6-layer encoder and 6-layer decoder. The size of hidden states is set to 512. Considering that IWSLT14 is a relatively smaller dataset comparing to WMT14, we use a smaller architecture for IWSLT14, which consists of a 5-layer encoder, and a 5-layer decoder. The size of hidden states is set to 256, and the number of heads is set to 4. For all datasets, we set the size of transition embedding to 32 and the beam size of beam approximation to 64. Hyperparameter is set to to balance the scale of two loss components.
Following previous works Gu et al. (2017), we use sequence-level knowledge distillation Kim and Rush (2016) during training. Specifically, we train our models on translations produced by a Transformer teacher model. It has been shown to be an effective way to alleviate the multimodality problem in training Gu et al. (2017).
Since the CRF-based structured inference module is not parallelizable in training, we initialize our NART-CRF models by warming up from their vanilla NART counterparts to speed up training. We use Adam Kingma and Ba (2014) optimizer and employ label smoothing of value Szegedy et al. (2016)
in all experiments. Models for WMT14/IWSLT14 tasks are trained on 4/1 NVIDIA P40 GPUs, respectively. We implement our models based on the open-sourced tensor2tensor libraryVaswani et al. (2018).
During training, the target sentence is given, so we do not need to predict the target length . However, during inference, we have to predict the length of the target sentence for each source sentence. Specifically, in this paper, we use the simplest form of target length , which is a linear function of source length defined as , where is a constant bias term that can be set according to the overall length statistics of the training data. We also try different target lengths ranging from to and obtain multiple translation results with different lengths, where is the half-width, and then use the ART Transformer as the teacher model to select the best translation from multiple candidate translations during inference.
We set the constant bias term to 2, -2, 2 for WMT14 En-De, De-En and IWSLT14 De-En datasets respectively, according to the average lengths of different languages in the training sets. We set to / and get / candidate translations for each sentence. For each dataset, we evaluate our model performance with the BLEU score Papineni et al. (2002). Following previous works Gu et al. (2017); Lee et al. (2018); Guo et al. (2018); Wang et al. (2019), we evaluate the average per-sentence decoding latency on WMT14 En-De test sets with batch size 1 with a single NVIDIA Tesla P100 GPU for the Transformer model and the NART models to measure the speedup of our models. The latencies are obtained by taking average of five runs.
4.3 Results and analysis
We evaluate666We follow common practice in previous works to make a fair comparison. Specifically, we use tokenized case-sensitive BLEU for WMT datasets and case-insensitive BLEU for IWSLT datasets. three models described in Section 3: Non-AutoRegressive Transformer baseline (NART), NART with static-transition Conditional Random Fields (NART-CRF), and NART with Dynamic-transition Conditional Random Fields (NART-DCRF). We also compare the proposed models with other ART or NART models, where LSTM-based model Wu et al. (2016); Bahdanau et al. (2016), CNN-based model Gehring et al. (2017); Edunov et al. (2017), and Transformer Vaswani et al. (2017) are autoregressive models; FerTility based (FT) NART model Gu et al. (2017), deterministic Iterative Refinement (IR) model Lee et al. (2018), Latent Transformer (LT) Kaiser et al. (2018), NART model with Connectionist Temporal Classification (CTC) Libovickỳ and Helcl (2018), Enhanced Non-Autoregressive Transformer (ENAT) Guo et al. (2018), Regularized Non-Autoregressive Transformer (NAT-REG) Wang et al. (2019)
, and Vector Quantized Variational AutoEncoders (VQ-VAE)Roy et al. (2018) are non-autoregressive models.
Table 2 shows the BLEU scores on different datasets and the inference latency of our models and the baselines. The proposed NART-CRF/NART-DCRF models achieve state-of-the-art performance with significant improvements over previous proposed non-autoregressive models across various datasets and even outperform two strong autoregressive models (LSTM-based and CNN-based) on WMT En-De dataset.
Specifically, the NART-DCRF model outperforms the fertility-based NART model with 5.75/7.41 and 5.75/7.27 BLEU score improvements on WMT En-De and De-En tasks in similar settings, and outperforms our own NART baseline with 3.17/1.85/1.81 and 5.20/3.47/3.44 BLEU score improvements on WMT En-De and De-En tasks in the same settings. It is even comparable to its ART Transformer teacher model. To the best of our knowledge, it is the first time that the performance gap of ART and NART is narrowed to 0.61 BLEU on WMT En-De task. Apart from the translation accuracy, our NART-CRF/NART-DCRF model achieves a speedup of 11.1/10.4 (greedy decoding) or 4.45/4.39 (teacher rescoring) over the ART counterpart.
The proposed dynamic transition technique boosts the performance of the NART-CRF model by 0.12/0.03/0.12, 1.47/0.80/0.78, and 1.05/0.78/0.81 BLEU score on WMT En-De, De-En and IWSLT De-En tasks respectively. We can see that the gain is smaller on the En-De translation task. This may be due to language-specific properties of German and English.
|CRF beam size||1||2||4||8||16||32||64||128||256|
|NART-CRF (resocring 9)||19.61||23.93||25.48||25.86||25.93||26.01||26.04||26.09||26.08|
|NART-CRF (resocring 19)||20.02||25.00||26.28||26.56||26.57||26.65||26.68||26.71||26.66|
An interesting question in our model design is how well the beam approximation fits the full CRF transition matrix. We conduct an ablation study of our NART-CRF model on WMT En-De task and the results are shown in Table 3. The model is trained with CRF beam size and evaluated with different CRF beam size and rescoring candidates. We can see that has already provided a quite good approximation, as further increasing does not bring much gain. This validates the effectiveness of our proposed beam approximation technique.
5 Conclusion and Future Work
Non-autoregressive sequence models have achieved impressive inference speedup but suffer from decoding inconsistency problem, and thus performs poorly compared to autoregressive sequence models. In this paper, we propose a novel framework to bridge the performance gap between non-autoregressive and autoregressive sequence models. Specifically, we use linear-chain Conditional Random Fields (CRF) to model the co-occurrence relationship between adjacent words during the decoding. We design two effective approximation methods to tackle the issue of the large vocabulary size, and further propose a dynamic transition technique to model positional contexts in the CRF. The results significantly outperform previous non-autoregressive baselines on WMT14 En-De and De-En datasets and achieve comparable performance to the autoregressive counterparts.
In the future, we plan to utilize other existing techniques for our NART-CRF models to further bridge the gap between non-autoregressive and autoregressive sequence models. Besides, although the rescoring process is also parallelized, it severely increases the inference latency, as can be seen in Table 2. An additional module that can accurately predict the target length might be useful. As our major contribution in this paper is to model richer structural dependency in the non-autoregressive decoder, we leave this for future work.
-  (2016) Globally normalized transition-based neural networks. arXiv preprint arXiv:1603.06042. Cited by: §2.2, §3.2.
-  (2016) An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086. Cited by: §4.3, Table 2.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
The mathematics of statistical machine translation: parameter estimation. Computational linguistics 19 (2), pp. 263–311. Cited by: §2.1.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §1, §3.1.
-  (2013) The forward-backward algorithm. Columbia Columbia Univ. Cited by: §3.2.
Natural language processing (almost) from scratch.
Journal of machine learning research12 (Aug), pp. 2493–2537. Cited by: §2.2.
-  (2017) Classical structured prediction losses for sequence to sequence learning. arXiv preprint arXiv:1711.04956. Cited by: §4.3, Table 2.
-  (2017) Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1243–1252. Cited by: §1, §4.3, Table 2.
-  (2017) Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281. Cited by: §1, §2.1, §2.1, §2.1, §3.1, §3.1, §3.1, §4.1, §4.2, §4.3, Table 2.
-  (2018) Non-autoregressive neural machine translation with enhanced decoder input. arXiv preprint arXiv:1812.09664. Cited by: §4.2, §4.3, Table 2.
-  (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.1.
-  (2018) Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382. Cited by: §2.1, §4.3, Table 2.
-  (2017) Structured attention networks. arXiv preprint arXiv:1702.00887. Cited by: §2.2.
-  (2016) Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947. Cited by: §2.1, §4.1.
-  (2018) A tutorial on deep latent variable models of natural language. arXiv preprint arXiv:1812.06834. Cited by: §2.2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Cited by: §1, §3.2, §3.2, §3.2.
From n-gram-based to crf-based translation models. In Proceedings of the sixth workshop on statistical machine translation, pp. 542–553. Cited by: §2.2.
-  (2018) Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901. Cited by: §1, §2.1, §4.2, §4.3, Table 2.
-  (2019) Hint-based training for non-autoregressive translation. arXiv preprint arXiv:1909.06708. Cited by: §1, §2.1.
-  (2018) End-to-end non-autoregressive neural machine translation with connectionist temporal classification. arXiv preprint arXiv:1811.04719. Cited by: §2.1, §4.3, Table 2.
-  (2000) Maximum entropy markov models for information extraction and segmentation.. In Icml, pp. 591–598. Cited by: §3.2.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.2.
-  (2018) Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063. Cited by: §2.1, §4.3, Table 2.
-  (2012) An introduction to conditional random fields. Foundations and Trends® in Machine Learning 4 (4), pp. 267–373. Cited by: §3.2.
Rethinking the inception architecture for computer vision. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §4.1.
-  (2016) Unsupervised neural hidden markov models. arXiv preprint arXiv:1609.09007. Cited by: §2.2.
-  (2018) Tensor2tensor for neural machine translation. arXiv preprint arXiv:1803.07416. Cited by: §3.1, §4.1.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §3.1, §4.1, §4.3, Table 2.
-  (2019) Non-autoregressive machine translation with auxiliary regularization. arXiv preprint arXiv:1902.10245. Cited by: §1, §2.1, §4.2, §4.3, Table 2.
-  (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1, §4.1, §4.3, Table 2.