1 Introduction
Autoregressive sequence models achieve great success in domains like machine translation and have been deployed in real applications Vaswani et al. (2017); Wu et al. (2016); Cho et al. (2014); Bahdanau et al. (2014); Gehring et al. (2017). However, these models suffer from high inference latency Vaswani et al. (2017); Wu et al. (2016), which is sometimes unaffordable for realtime industrial applications. This is mainly attributed to the autoregressive factorization nature of the models: Considering a general conditional sequence generation framework, given a context sentence and a target sentence
, autoregressive sequence models are based on a chain of conditional probabilities with a lefttoright causal structure:
(1) 
where represents the tokens before the th token of target . See Figure 1(a) for the illustration of a stateoftheart autoregressive sequence model, Transformer Vaswani et al. (2017). The autoregressive factorization makes the inference process hard to be parallelized as the results are generated token by token sequentially.
Recently, nonautoregressive sequence models Gu et al. (2017); Li et al. (2019); Wang et al. (2019); Lee et al. (2018) were proposed to alleviate the inference latency by removing the sequential dependencies within the target sentence. Those models also use the general encoderdecoder framework: the encoder takes the context sentence as input to generate contextual embedding and predict the target length , and the decoder uses a welldesigned deterministic or stochastic input and the contextual embedding to predict each target token:
(2) 
The nonautoregressive sequence models take full advantage of parallelism and significantly improve the inference speed. However, they usually cannot get results as good as their autoregressive counterparts. As shown in Table 1, on the machine translation task, compared to AutoRegressive Translation (ART) models, NonAutoRegressive Translation (NART) models suffer from severe decoding inconsistency problem. In nonautoregressive sequence models, each token in the target sentence is generated independently. Thus the decoding consistency (e.g., word cooccurrence) cannot be guaranteed on the target side. The primary phenomenon that can be observed is the multimodality problem: the nonautoregressive models cannot model the highly multimodal distribution of target sequences properly Gu et al. (2017). For example, an English sentence “Thank you.” can have many correct German translations like “Danke.”, “Danke schon.”, or “Vielen Dank.”. In practice, this will lead to inconsistent outputs such as “Danke Dank.” or “Vielen schon.”.
To tackle this problem, in this paper, we propose to incorporate a structured inference module in the nonautoregressive decoder to directly model the multimodal distribution of target sequences. Specifically, we regard sequence generation (e.g., machine translation) as a sequence labeling problem and propose to use linearchain Conditional Random Fields (CRF) Lafferty et al. (2001) to model richer structural dependencies. By modeling the cooccurrence relationship between adjacent words, the CRFbased structured inference module can significantly improve decoding consistency in the target side. Different from the probability product form of Equation 2, the probability of the target sentence is globally normalized:
(3) 
where is the pairwise potential for and . Such a probability form could better model the multiple modes in target translations.
However, the label size (vocabulary size) used in typical sequence models is very large (e.g., 32k) and intractable for traditional CRFs. Therefore, we design two effective approximation methods for the CRF: lowrank approximation and beam approximation. Moreover, to leverage the rich contextual information from the hidden states of nonautoregressive decoder and to improve the expressive power of the structured inference module, we further propose a dynamic transition technique to model positional contexts in CRF.
We evaluate the proposed endtoend model on three widely used machine translation tasks: WMT14 EnglishtoGerman/GermantoEnglish (EnDe/DeEn) tasks and IWSLT14 GermantoEnglish task. Experimental results show that while losing little speed, our NARTCRF model could achieve significantly better translation performance than previous NART models on several tasks. In particular, for the WMT14 EnDe and DeEn tasks, our model obtains BLEU scores of 26.80 and 30.04, respectively, which largely outperform previous nonautoregressive baselines and are even comparable to the autoregressive counterparts.
2 Related Work
2.1 Nonautoregressive neural machine translation
NonAutoRegressive neural machine Translation (NART) models aim to speed up the inference process for realtime machine translation
Gu et al. (2017), but their performance is considerably worse than their ART counterparts. Most previous works attributed the poor performance to unavoidable conditional independence when predicting each target token, and proposed various methods to solve this issue.Some methods alleviated the multimodality phenomenon in vanilla NART training: Gu et al. (2017) introduced the sentencelevel knowledge distillation Hinton et al. (2015); Kim and Rush (2016) to reduce the multimodality in the raw data; Wang et al. (2019) designed two auxiliary regularization terms in training; Li et al. (2019) proposed to leverage hints from the ART models to guide NART’s attention and hidden states. Our approach is orthogonal to these training techniques. Perhaps the most similar one to our approach is Libovickỳ and Helcl (2018), which introduced the Connectionist Temporal Classification (CTC) loss in NART training. Both CTC and CRF can reduce the multimodality effect in training. However, CTC can only model a unimodal target distribution, while CRF can model a multimodal target distribution effectively.
Other methods attempted to model the multimodal target distribution by welldesigned decoder input : Gu et al. (2017) introduced the concept of fertilities from statistical machine translation models Brown et al. (1993) into the NART models; Lee et al. (2018) used an iterative refinement process in the decoding process of their proposed model; Kaiser et al. (2018) and Roy et al. (2018) embedded an autoregressive submodule that consists of discrete latent variables into their models. In comparison, our NARTCRF models use a simple design of decoder input , but model a richer structural dependency for the decoder output.
2.2 Structured learning for machine translation
The idea of recasting the machine translation problem as a sequence labeling task can be traced back to Lavergne et al. (2011)
, where a CRFbased method was proposed for Statistical Machine Translation (SMT). They simplify the CRF training by (1) limiting the possible “labels” to those that are observed during training and (2) enforcing sparsity in the model. In comparison, our proposed lowrank approximation and beam approximation are more suitable for neural network models.
Structured prediction provides a declarative language for specifying prior knowledge and structural relationships in the data Kim et al. (2018). Our approach is also related to other works on structured neural sequence modeling. Tran et al. (2016)
neuralizes an unsupervised Hidden Markov Model (HMM).
Kim et al. (2017) proposed to incorporate richer structural distribution for the attention mechanism. They both focus on the internal structural dependencies in their models, while in this paper, we directly model richer structural dependencies for the decoder output.Finally, our work is also related to previous work on combining neural networks with CRF for sequence labeling. Collobert et al. (2011) proposed a unified neural network architecture for sequence labeling. Andor et al. (2016) proposed a globally normalized transitionbased neural network on a taskspecific transition system.
3 Fast Structured Decoding for Sequence Models
In this section, we describe the proposed model in the context of machine translation and use “source” and “context” interchangeably. The proposed NARTCRF model formulates nonautoregressive translation as a sequence labeling problem and use Conditional Random Fields (CRF) to solve it. We first briefly introduce the Transformerbased NART architecture and then describe the CRFbased structured inference module. Figure 1(b) illustrates our NARTCRF model structure.
3.1 Transformerbased Nonautoregressive Translation Model
The model design follows the Transformer architecture Vaswani et al. (2017) with an additional positional attention layer proposed by Gu et al. (2017). We refer the readers to Vaswani et al. (2017); Gu et al. (2017); Vaswani et al. (2018) for more details about the model.
Encoderdecoder framework
Nonautoregressive machine translation can also be formulated in an encoderdecoder framework Cho et al. (2014). Same as the ART models, the encoder of NART models takes the embeddings of source tokens as inputs and generates the context representation. However, as shown in Equation 2, the NART decoder does not use the autoregressive factorization, but decodes each target token independently given the target length and decoder input .
Multihead attention
ART and NART Transformer models share two types of multihead attentions: multihead selfattention and multihead encodertodecoder attention. The NART model additionally uses multihead positional attention to model local word orders within the sentence Gu et al. (2017)
. A general attention mechanism can be formulated as the weighted sum of the value vectors
using query vectors and key vectors :(4) 
where
represents the dimension of hidden representations. For selfattention,
, and are hidden representations of the previous layer. For encodertodecoder attention, refers to hidden representations of the previous layer, whereas and are context vectors from the encoder. For positional attention, positional embedding is used as and , and hidden representations of the previous layer are used as .The positionwise FeedForward Network (FFN) is applied after multihead attentions in both encoder and decoder. It consists of a twolayer linear transformation with ReLU activation:
(5) 
3.2 Structured inference module
In this paper, we propose to incorporate a structured inference module in the decoder part to directly model multimodality in NART models. Figure 2 shows how a CRFbased structured inference module works. In principle, this module can be any structured prediction model such as Conditional Random Fields (CRF) Lafferty et al. (2001) or Maximum Entropy Markov Model (MEMM) McCallum et al. (2000). Here we focus on linearchain CRF, which is the most widely applied model in the sequence labeling literature. In the context of machine translation, we use “label” and “token” (vocabulary) interchangeably for the decoder output.
Conditional random fields
CRF is a framework for building probabilistic models to segment and label sequence data. Given the sequence data and the corresponding label sequence , the likelihood of given is defined as:
(6) 
where is the normalizing factor, is the label score of at the position , and is the transition score from to . The CRF module can be endtoend jointly trained with neural networks using negative loglikelihood loss . Note that when omitting the transition score , Equation 6 is the same as vanilla nonautoregressive models (Equation 2).
Incorporating CRF into NART model
For the label score, a linear transformation of the NART decoder’s output : works well, where and are the weights and bias of the linear transformation. However, for the transition score, naive methods require a matrix to model . Also, according to the widelyused forwardbackward algorithm Lafferty et al. (2001), the likelihood computation and decoding process requires complexity through dynamic programming Lafferty et al. (2001); Sutton et al. (2012); Collins (2013), which is infeasible for practical usage (e.g. a 32k vocabulary).
Lowrank approximation for transition matrix
A solution for the above issue is to use a lowrank matrix to approximate the fullrank transition matrix. In particular, we introduce two transition embedding to approximate the transition matrix:
(7) 
where is the dimension of the transition embedding.
Beam approximation for CRF
Lowrank approximation allows us to calculate the unnormalized term in Equation 6 efficiently. However, due to numerical accuracy issues^{2}^{2}2The transition is calculated in the log space. See https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/crf for detailed implementation., both the normalizing factor and the decoding process require the full transition matrix, which is still unaffordable. Therefore, we further propose beam approximation to make CRF tractable for NART models.
In particular, for each position
, we heuristically truncate all
candidates to a predefined beam size . We keep candidates with highest label scores for each position , and accordingly crop the transition matrix between each pair of and . The forwardbackward algorithm is then applied on the truncated beam to get either normalizing factor or decoding result. In this way, the time complexities of them are reduced from to (e.g., for the normalizing factor, instead of a sum over possible paths, we sum over paths in the beam). Besides, when calculating the normalizing factor for a training pair , we explicitly include each in the beam to ensure that the approximated normalizing factor is larger than the unnormalized path score of .The intuition of the beam approximation is that for the normalizing factor, the sum of path scores in such a beam (the approximated ) is able to predominate the actually value of , while it is also reasonable to assume that the beam includes each label of the best path .
Dynamic CRF transition
In the traditional definition, the transition matrix is fixed for each position . A dynamic transition matrix that depends on the positional context could improve the representation power of CRF. Here we use a simple but effective way to get a dynamic transition matrix by inserting a dynamic matrix between the product of transition embedding and :
(8)  
(9)  
(10) 
where is the concatenation of two adjacent decoder outputs, and is a twolayer FeedForward Network (FFN).
Latency of CRF decoding
Unlike vanilla nonautoregressive decoding, the CRF decoding can no longer be parallelized. However, due to our beam approximation, the computation of linearchain CRF is in theory still much faster than autoregressive decoding. As shown in Table 2, in practice, the overhead is only 814ms.
Exact Decoding for Machine Translation
Despite fast decoding, another promise of this approach is that it provides an exact decoding framework for machine translation, while the de facto standard beam search algorithm for ART models cannot provide such guarantee. CRFbased structured inference module can solve the label bias problem Lafferty et al. (2001), while locally normalized models (e.g. beam search) often have a very weak ability to revise earlier decisions Andor et al. (2016).
Joint training with vanilla nonautoregressive loss
In practice, we find that it is beneficial to include the original NART loss to help the training of the NARTCRF model. Therefore, our final training loss is a weighted sum of the CRF negative loglikelihood loss (Equation 3) and the NonAutoRegressive (NAR) negative loglikelihood loss (Equation 2):
(11) 
where
is the hyperparameter controlling the weight of different loss terms.
4 Experiments
4.1 Experimental settings
We use several widely adopted benchmark tasks to evaluate the effectiveness of our proposed models: IWSLT14^{3}^{3}3https://wit3.fbk.eu/ GermantoEnglish translation (IWSLT14 DeEn) and WMT14^{4}^{4}4http://statmt.org/wmt14/translationtask.html EnglishtoGerman/GermantoEnglish translation (WMT14 EnDe/DeEn). For the WMT14 dataset, we use Newstest2014 as test data and Newstest2013 as validation data. For each dataset, we split word tokens into subword units following Wu et al. (2016), forming a 32k wordpiece vocabulary shared by source and target languages.
For the WMT14 dataset, we use the default network architecture of the original base Transformer Vaswani et al. (2017), which consists of a 6layer encoder and 6layer decoder. The size of hidden states is set to 512. Considering that IWSLT14 is a relatively smaller dataset comparing to WMT14, we use a smaller architecture for IWSLT14, which consists of a 5layer encoder, and a 5layer decoder. The size of hidden states is set to 256, and the number of heads is set to 4. For all datasets, we set the size of transition embedding to 32 and the beam size of beam approximation to 64. Hyperparameter is set to to balance the scale of two loss components.
Following previous works Gu et al. (2017), we use sequencelevel knowledge distillation Kim and Rush (2016) during training. Specifically, we train our models on translations produced by a Transformer teacher model. It has been shown to be an effective way to alleviate the multimodality problem in training Gu et al. (2017).
Since the CRFbased structured inference module is not parallelizable in training, we initialize our NARTCRF models by warming up from their vanilla NART counterparts to speed up training. We use Adam Kingma and Ba (2014) optimizer and employ label smoothing of value Szegedy et al. (2016)
in all experiments. Models for WMT14/IWSLT14 tasks are trained on 4/1 NVIDIA P40 GPUs, respectively. We implement our models based on the opensourced tensor2tensor library
Vaswani et al. (2018).4.2 Inference
During training, the target sentence is given, so we do not need to predict the target length . However, during inference, we have to predict the length of the target sentence for each source sentence. Specifically, in this paper, we use the simplest form of target length , which is a linear function of source length defined as , where is a constant bias term that can be set according to the overall length statistics of the training data. We also try different target lengths ranging from to and obtain multiple translation results with different lengths, where is the halfwidth, and then use the ART Transformer as the teacher model to select the best translation from multiple candidate translations during inference.
We set the constant bias term to 2, 2, 2 for WMT14 EnDe, DeEn and IWSLT14 DeEn datasets respectively, according to the average lengths of different languages in the training sets. We set to / and get / candidate translations for each sentence. For each dataset, we evaluate our model performance with the BLEU score Papineni et al. (2002). Following previous works Gu et al. (2017); Lee et al. (2018); Guo et al. (2018); Wang et al. (2019), we evaluate the average persentence decoding latency on WMT14 EnDe test sets with batch size 1 with a single NVIDIA Tesla P100 GPU for the Transformer model and the NART models to measure the speedup of our models. The latencies are obtained by taking average of five runs.
4.3 Results and analysis
We evaluate^{6}^{6}6We follow common practice in previous works to make a fair comparison. Specifically, we use tokenized casesensitive BLEU for WMT datasets and caseinsensitive BLEU for IWSLT datasets. three models described in Section 3: NonAutoRegressive Transformer baseline (NART), NART with statictransition Conditional Random Fields (NARTCRF), and NART with Dynamictransition Conditional Random Fields (NARTDCRF). We also compare the proposed models with other ART or NART models, where LSTMbased model Wu et al. (2016); Bahdanau et al. (2016), CNNbased model Gehring et al. (2017); Edunov et al. (2017), and Transformer Vaswani et al. (2017) are autoregressive models; FerTility based (FT) NART model Gu et al. (2017), deterministic Iterative Refinement (IR) model Lee et al. (2018), Latent Transformer (LT) Kaiser et al. (2018), NART model with Connectionist Temporal Classification (CTC) Libovickỳ and Helcl (2018), Enhanced NonAutoregressive Transformer (ENAT) Guo et al. (2018), Regularized NonAutoregressive Transformer (NATREG) Wang et al. (2019)
, and Vector Quantized Variational AutoEncoders (VQVAE)
Roy et al. (2018) are nonautoregressive models.Table 2 shows the BLEU scores on different datasets and the inference latency of our models and the baselines. The proposed NARTCRF/NARTDCRF models achieve stateoftheart performance with significant improvements over previous proposed nonautoregressive models across various datasets and even outperform two strong autoregressive models (LSTMbased and CNNbased) on WMT EnDe dataset.
Specifically, the NARTDCRF model outperforms the fertilitybased NART model with 5.75/7.41 and 5.75/7.27 BLEU score improvements on WMT EnDe and DeEn tasks in similar settings, and outperforms our own NART baseline with 3.17/1.85/1.81 and 5.20/3.47/3.44 BLEU score improvements on WMT EnDe and DeEn tasks in the same settings. It is even comparable to its ART Transformer teacher model. To the best of our knowledge, it is the first time that the performance gap of ART and NART is narrowed to 0.61 BLEU on WMT EnDe task. Apart from the translation accuracy, our NARTCRF/NARTDCRF model achieves a speedup of 11.1/10.4 (greedy decoding) or 4.45/4.39 (teacher rescoring) over the ART counterpart.
The proposed dynamic transition technique boosts the performance of the NARTCRF model by 0.12/0.03/0.12, 1.47/0.80/0.78, and 1.05/0.78/0.81 BLEU score on WMT EnDe, DeEn and IWSLT DeEn tasks respectively. We can see that the gain is smaller on the EnDe translation task. This may be due to languagespecific properties of German and English.
CRF beam size  1  2  4  8  16  32  64  128  256 

NARTCRF  15.10  20.67  22.54  23.04  23.22  23.26  23.32  23.33  23.38 
NARTCRF (resocring 9)  19.61  23.93  25.48  25.86  25.93  26.01  26.04  26.09  26.08 
NARTCRF (resocring 19)  20.02  25.00  26.28  26.56  26.57  26.65  26.68  26.71  26.66 
An interesting question in our model design is how well the beam approximation fits the full CRF transition matrix. We conduct an ablation study of our NARTCRF model on WMT EnDe task and the results are shown in Table 3. The model is trained with CRF beam size and evaluated with different CRF beam size and rescoring candidates. We can see that has already provided a quite good approximation, as further increasing does not bring much gain. This validates the effectiveness of our proposed beam approximation technique.
5 Conclusion and Future Work
Nonautoregressive sequence models have achieved impressive inference speedup but suffer from decoding inconsistency problem, and thus performs poorly compared to autoregressive sequence models. In this paper, we propose a novel framework to bridge the performance gap between nonautoregressive and autoregressive sequence models. Specifically, we use linearchain Conditional Random Fields (CRF) to model the cooccurrence relationship between adjacent words during the decoding. We design two effective approximation methods to tackle the issue of the large vocabulary size, and further propose a dynamic transition technique to model positional contexts in the CRF. The results significantly outperform previous nonautoregressive baselines on WMT14 EnDe and DeEn datasets and achieve comparable performance to the autoregressive counterparts.
In the future, we plan to utilize other existing techniques for our NARTCRF models to further bridge the gap between nonautoregressive and autoregressive sequence models. Besides, although the rescoring process is also parallelized, it severely increases the inference latency, as can be seen in Table 2. An additional module that can accurately predict the target length might be useful. As our major contribution in this paper is to model richer structural dependency in the nonautoregressive decoder, we leave this for future work.
References
 [1] (2016) Globally normalized transitionbased neural networks. arXiv preprint arXiv:1603.06042. Cited by: §2.2, §3.2.
 [2] (2016) An actorcritic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086. Cited by: §4.3, Table 2.
 [3] (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.

[4]
(1993)
The mathematics of statistical machine translation: parameter estimation
. Computational linguistics 19 (2), pp. 263–311. Cited by: §2.1.  [5] (2014) Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §1, §3.1.
 [6] (2013) The forwardbackward algorithm. Columbia Columbia Univ. Cited by: §3.2.

[7]
(2011)
Natural language processing (almost) from scratch.
Journal of machine learning research
12 (Aug), pp. 2493–2537. Cited by: §2.2.  [8] (2017) Classical structured prediction losses for sequence to sequence learning. arXiv preprint arXiv:1711.04956. Cited by: §4.3, Table 2.
 [9] (2017) Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1243–1252. Cited by: §1, §4.3, Table 2.
 [10] (2017) Nonautoregressive neural machine translation. arXiv preprint arXiv:1711.02281. Cited by: §1, §2.1, §2.1, §2.1, §3.1, §3.1, §3.1, §4.1, §4.2, §4.3, Table 2.
 [11] (2018) Nonautoregressive neural machine translation with enhanced decoder input. arXiv preprint arXiv:1812.09664. Cited by: §4.2, §4.3, Table 2.
 [12] (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.1.
 [13] (2018) Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382. Cited by: §2.1, §4.3, Table 2.
 [14] (2017) Structured attention networks. arXiv preprint arXiv:1702.00887. Cited by: §2.2.
 [15] (2016) Sequencelevel knowledge distillation. arXiv preprint arXiv:1606.07947. Cited by: §2.1, §4.1.
 [16] (2018) A tutorial on deep latent variable models of natural language. arXiv preprint arXiv:1812.06834. Cited by: §2.2.
 [17] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
 [18] (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Cited by: §1, §3.2, §3.2, §3.2.

[19]
(2011)
From ngrambased to crfbased translation models
. In Proceedings of the sixth workshop on statistical machine translation, pp. 542–553. Cited by: §2.2.  [20] (2018) Deterministic nonautoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901. Cited by: §1, §2.1, §4.2, §4.3, Table 2.
 [21] (2019) Hintbased training for nonautoregressive translation. arXiv preprint arXiv:1909.06708. Cited by: §1, §2.1.
 [22] (2018) Endtoend nonautoregressive neural machine translation with connectionist temporal classification. arXiv preprint arXiv:1811.04719. Cited by: §2.1, §4.3, Table 2.
 [23] (2000) Maximum entropy markov models for information extraction and segmentation.. In Icml, pp. 591–598. Cited by: §3.2.
 [24] (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.2.
 [25] (2018) Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063. Cited by: §2.1, §4.3, Table 2.
 [26] (2012) An introduction to conditional random fields. Foundations and Trends® in Machine Learning 4 (4), pp. 267–373. Cited by: §3.2.

[27]
(2016)
Rethinking the inception architecture for computer vision
. InProceedings of the IEEE conference on computer vision and pattern recognition
, pp. 2818–2826. Cited by: §4.1.  [28] (2016) Unsupervised neural hidden markov models. arXiv preprint arXiv:1609.09007. Cited by: §2.2.
 [29] (2018) Tensor2tensor for neural machine translation. arXiv preprint arXiv:1803.07416. Cited by: §3.1, §4.1.
 [30] (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §3.1, §4.1, §4.3, Table 2.
 [31] (2019) Nonautoregressive machine translation with auxiliary regularization. arXiv preprint arXiv:1902.10245. Cited by: §1, §2.1, §4.2, §4.3, Table 2.
 [32] (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1, §4.1, §4.3, Table 2.