1 Introduction
The performance of nonautoregressive neural machine translation (NAT) systems, which predict tokens in the target language independently of each other conditioned on the source sentence, has been improving steadily in recent years
(Lee et al., 2018; Ghazvininejad et al., 2019; Ma et al., 2019). One common ingredient in getting nonautoregressive systems to perform well is to train them on a corpus of distilled translations (Kim and Rush, 2016). This distilled corpus consists of source sentences paired with the translations produced by a pretrained autoregressive “teacher” system.As an alternative to training nonautoregressive translation systems on distilled corpora, we instead propose to train them to minimize the energy defined by a pretrained autoregressive teacher model. That is, we view nonautoregressive machine translation systems as inference networks (Tu and Gimpel, 2018, 2019; Tu et al., 2019) trained to minimize the teacher’s energy. This provides the nonautoregressive model with additional information related to the energy of the teacher, rather than just the approximate minimizers of the teacher’s energy appearing in a distilled corpus.
In order to train inference networks to minimize an energy function, the energy must be differentiable with respect to the inference network output. We describe several approaches for relaxing the autoregressive teacher’s energy to make it amenable to minimization with an inference network, and compare them empirically. We experiment with two nonautoregressive inference network architectures, one based on bidirectional RNNs and the other based on the transformer model of Ghazvininejad et al. (2019).
In experiments on the IWSLT 2014 DEEN and WMT 2016 ROEN datasets, we show that training to minimize the teacher’s energy significantly outperforms training with distilled outputs. Our approach, which we call ENGINE (ENerGybased Inference NEtworks), achieves stateoftheart results for nonautoregressive translation on these datasets, approaching the results of the autoregressive teachers. Our hope is that ENGINE will enable energybased models to be applied more broadly for nonautoregressive generation in the future.
2 Related Work
Nonautoregressive neural machine translation began with the work of Gu et al. (2018), who found that using knowledge distillation (Hinton et al., 2015), and in particular sequencelevel distilled outputs (Kim and Rush, 2016), improved performance. Subsequent work has continued to narrow the gap between nonautoregressive and autoregressive translation, including multiiteration refinements (Lee et al., 2018; Ghazvininejad et al., 2019; Saharia et al., 2020) and rescoring with autoregressive models (Kaiser et al., 2018; Ma et al., 2019; Sun et al., 2019). Recently, Ghazvininejad et al. (2020) and Saharia et al. (2020) proposed aligned cross entropy or latent alignment models and achieve the highest BLEU scores of all nonautoregressive models without refinement or rescoring. In our work, we propose training inference networks with autoregressive energy and outperform the best purely nonautoregressive methods.
Another related approach trains an “actor” network to manipulate the hidden state of an autoregressive neural MT system (Gu et al., 2017; Chen et al., 2018; Zhou et al., 2020) in order to bias it toward outputs with better BLEU scores. This work modifies the original pretrained network rather than using it to define an energy for training an inference network.
Energybased models have had limited application in text generation due to the computational challenges involved in learning and inference in extremely large search spaces. The use of inference networks to output approximate minimizers of a loss function is popular in variational inference
(Kingma and Welling, 2013; Rezende et al., 2014), and, more recently, in structured prediction (Tu and Gimpel, 2018, 2019; Tu et al., 2019).3 EnergyBased Inference Networks for NonAutoregressive NMT
Most neural machine translation (NMT) systems model the conditional distribution of a target sequence given a source sequence , where each comes from a vocabulary , is , and is . It is common in NMT to define this conditional distribution using an “autoregressive” factorization (Sutskever et al., 2014; Bahdanau et al., 2015; Vaswani et al., 2017):
This model can be viewed as an energybased model (LeCun et al., 2006) by defining the energy function . Given trained parameters , test time inference seeks to find the translation for a given source sentence with the lowest energy: .
Finding the translation that minimizes the energy involves combinatorial search. In this paper, we train inference networks to perform this search approximately. The idea of this approach is to replace the test time combinatorial search typically employed in structured prediction with the output of a network trained to produce approximately optimal predictions (Tu and Gimpel, 2018, 2019). More formally, we define an inference network which maps an input to a translation and is trained with the goal that .
Specifically, we train the inference network parameters as follows (assuming is pretrained and fixed):
(1) 
where is a training set of sentence pairs. The network architecture of can be different from the architectures used in the energy function. In this paper, we combine an autoregressive energy function with a nonautoregressive inference network. By doing so, we seek to combine the effectiveness of the autoregressive energy with the fast inference speed of a nonautoregressive network.
3.1 Energies for Inference Network Training
In order to allow for gradientbased optimization of the inference network parameters , we now define a more general family of energy functions for NMT. First, we change the representation of the translation in the energy, redefining as a sequence of distributions over words instead of a sequence of words.
In particular, we consider the generalized energy
(2) 
where
(3) 
We use the notation in above to indicate that we may need the full distribution over words. Note that by replacing the with onehot distributions we recover the original energy.
In order to train an inference network to minimize this energy, we simply need a network architecture that can produce a sequence of word distributions, which is satisfied by recent nonautoregressive NMT models (Ghazvininejad et al., 2019). However, because the distributions involved in the original energy are onehot, it may be advantageous for the inference network too to output distributions that are onehot or approximately so. We will accordingly view inference networks as producing a sequence of logit vectors , and we will consider two operators and that will be used to map these logits into distributions for use in the energy. Figure 1 provides an overview of our approach, including this generalized energy function, the inference network, and the two operators and . We describe choices for these operators in the next section.
SX  
STL  
SG  
ST  
GX 
3.2 Choices for Operators
We now consider ways of defining the two operators that govern the interface between the inference network and the energy function. As shown in Figure 1, we seek an operator to modulate the way that logits output by the inference network are fed to the decoder input slots in the energy function, and an operator to determine how the distribution
is used to compute the log probability of a word in
. Explicitly, then, we rewrite each local energy term (Eq. 3) aswhich our inference networks will minimize with respect to the .
The choices we consider for and , which we present generically for operator and logit vector , are shown in Table 1, and described in more detail below. Some of these operations are not differentiable, and so the Jacobian matrix must be approximated during learning; we show the approximations we use in Table 1 as well.
We consider five choices for each :

SX: . Here ; no Jacobian approximation is necessary.

STL: straightthrough logits. Here .
is approximated by the identity matrix
(see Bengio et al. (2013)). 
SG: straightthrough GumbelSoftmax. Here , where is Gumbel noise.^{2}^{2}2 and . is approximated with (Jang et al., 2016).

ST: straightthrough. This setting is identical to SG with (see Bengio et al. (2013)).

GX: GumbelSoftmax. Here , where again is Gumbel noise; no Jacobian approximation is necessary.


4 Experimental Setup
4.1 Datasets
4.2 Autoregressive Energies
We consider two architectures for the pretrained autoregressive (AR) energy function. The first is an autoregressive sequencetosequence (seq2seq) model with attention (Luong et al., 2015). The encoder is a twolayer BiLSTM with 512 units in each direction, the decoder is a twolayer LSTM with 768 units, and the word embedding size is 512. The second is an autoregressive transformer model (Vaswani et al., 2017), where both the encoder and decoder have 6 layers, 8 attention heads per layer, model dimension 512, and hidden dimension 2048.
4.3 Inference Network Architectures
We choose two different architectures: a BiLSTM “tagger” (a 2layer BiLSTM followed by a fullyconnected layer) and a conditional masked language model (CMLM; Ghazvininejad et al., 2019), a transformer with 6 layers per stack, 8 attention heads per layer, model dimension 512, and hidden dimension 2048. Both architectures require the target sequence length in advance; methods for handling length are discussed in Sec. 4.5. For baselines, we train these inference network architectures as nonautoregressive models using the standard perposition crossentropy loss. For faster inference network training, we initialize inference networks with the baselines trained with crossentropy loss in our experiments.
The baseline CMLMs use the partial masking strategy described by Ghazvininejad et al. (2019). This involves using some masked input tokens and some provided input tokens during training. At test time, multiple iterations (“refinement iterations”) can be used for improved results (Ghazvininejad et al., 2019). Each iteration uses partiallymasked input from the preceding iteration. We consider the use of multiple refinement iterations for both the CMLM baseline and the CMLM inference network.^{3}^{3}3The CMLM inference network is trained according to Eq. 1 with full masking (no partial masking like in the CMLM baseline). However, since the CMLM inference network is initialized using the CMLM baseline, which is trained using partial masking, the CMLM inference network is still compatible with refinement iterations at test time.
4.4 Hyperparameters
For inference network training, the batch size is 1024 tokens. We train with the Adam optimizer (Kingma and Ba, 2015). We tune the learning rate in
. For regularization, we use L2 weight decay with rate 0.01, and dropout with rate 0.1. We train all models for 30 epochs. For the baselines, we train the models with local cross entropy loss and do early stopping based on the BLEU score on the dev set. For the inference network, we train the model to minimize the energy (Eq.
1) and do early stopping based on the energy on the dev set.4.5 Predicting Target Sequence Lengths
Nonautoregressive models often need a target sequence length in advance (Lee et al., 2018). We report results both with oracle lengths and with a simple method of predicting it. We follow Ghazvininejad et al. (2019) in predicting the length of the translation using a representation of the source sequence from the encoder. The length loss is added to the crossentropy loss for the target sequence. During decoding, we select the top length candidates with the highest probabilities, decode with the different lengths in parallel, and return the translation with the highest average of log probabilities of its tokens.
5 Results
IWSLT14 DEEN  WMT16 ROEN  
# iterations  # iterations  
1  10  1  10  
CMLM  28.11  33.39  28.20  33.31 
ENGINE  31.99  33.17  33.16  34.04 
IWSLT14  WMT16  
DEEN  ROEN  
Autoregressive (Transformer)  
Greedy Decoding  33.00  33.33  
Beam Search  34.11  34.07  
Nonautoregressive 

Iterative Refinement (Lee et al., 2018)    25.73  
NAT with Fertility (Gu et al., 2018)    29.06  
CTC (Libovický and Helcl, 2018)    24.71  
FlowSeq (Ma et al., 2019)  27.55  30.44  

CMLM (Ghazvininejad et al., 2019)  28.25  28.20 
Bagofngramsbased loss (Shao et al., 2020)    29.29  
AXE CMLM (Ghazvininejad et al., 2020)    31.54  
Imputerbased model (Saharia et al., 2020)    31.7  
ENGINE (ours)  31.99  33.16 
Effect of choices for and .
Table 2 compares various choices for the operations and . For subsequent experiments, we choose the setting that feeds the whole distribution into the energy function ( = SX) and computes the loss with straightthrough ( = ST). Using Gumbel noise in has only minimal effect, and rarely helps. Using ST instead also speeds up training by avoiding the noise sampling step.
Training with distilled outputs vs. training with energy.
We compared training nonautoregressive models using the references, distilled outputs, and as inference networks on both datasets. Table 5 in the Appendix shows the results when using BiLSTM inference networks and seq2seq AR energies. The inference networks improve over training with the references by 11.27 BLEU on DEEN and 12.22 BLEU on ROEN. In addition, inference networks consistently improve over nonautoregressive networks trained on the distilled outputs.
Impact of refinement iterations.
Ghazvininejad et al. (2019) show improvements with multiple refinement iterations. Table 3 shows refinement results of CMLM and ENGINE. Both improve with multiple iterations, though the improvement is much larger with CMLM. However, even with 10 iterations, ENGINE is comparable to CMLM on DEEN and outperforms it on ROEN.
Comparison to other NAT models.
Table 4
shows 1iteration results on two datasets. To the best of our knowledge, ENGINE achieves stateoftheart NAT performance: 31.99 on IWSLT14 DEEN and 33.16 on WMT16 ROEN. In addition, ENGINE achieves comparable performance with the autoregressive NMT model.
6 Conclusion
We proposed a new method to train nonautoregressive neural machine translation systems via minimizing pretrained energy functions with inference networks. In the future, we seek to expand upon energybased translation using our method.
Acknowledgments
We would like to thank Graham Neubig for helpful discussions and the reviewers for insightful comments. This research was supported in part by an Amazon Research Award to K. Gimpel.
References
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of International Conference on Learning Representations (ICLR).
 Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.

Chen et al. (2018)
Yun Chen, Victor O.K. Li, Kyunghyun Cho, and Samuel Bowman. 2018.
A stable and effective
learning strategy for trainable greedy decoding.
In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages 380–390, Brussels, Belgium. Association for Computational Linguistics.  Ghazvininejad et al. (2020) Marjan Ghazvininejad, Vladimir Karpukhin, Luke Zettlemoyer, and Omer Levy. 2020. Aligned cross entropy for nonautoregressive machine translation. arXiv preprint arXiv:2004.01655.
 Ghazvininejad et al. (2019) Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Maskpredict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 6111–6120, Hong Kong, China. Association for Computational Linguistics.
 Gu et al. (2018) Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2018. Nonautoregressive neural machine translation. In Proceedings of International Conference on Learning Representations (ICLR).
 Gu et al. (2017) Jiatao Gu, Kyunghyun Cho, and Victor O.K. Li. 2017. Trainable greedy decoding for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1968–1978, Copenhagen, Denmark. Association for Computational Linguistics.
 Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbelsoftmax. In Proceedings of International Conference on Learning Representations (ICLR).

Kaiser et al. (2018)
Lukasz Kaiser, Samy Bengio, Aurko Roy, Ashish Vaswani, Niki Parmar, Jakob
Uszkoreit, and Noam Shazeer. 2018.
Fast decoding in sequence models using discrete latent variables.
In
International Conference on Machine Learning
, pages 2395–2404.  Kim and Rush (2016) Yoon Kim and Alexander M. Rush. 2016. Sequencelevel knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas. Association for Computational Linguistics.
 Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. 2013. Autoencoding variational bayes. In Proceedings of International Conference on Learning Representations (ICLR).
 LeCun et al. (2006) Yann LeCun, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and FuJie Huang. 2006. A tutorial on energybased learning. In Predicting Structured Data. MIT Press.
 Lee et al. (2018) Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic nonautoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1173–1182, Brussels, Belgium. Association for Computational Linguistics.
 Libovický and Helcl (2018) Jindřich Libovický and Jindřich Helcl. 2018. Endtoend nonautoregressive neural machine translation with connectionist temporal classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3016–3021, Brussels, Belgium. Association for Computational Linguistics.
 Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attentionbased neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal. Association for Computational Linguistics.
 Ma et al. (2019) Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy. 2019. FlowSeq: Nonautoregressive conditional sequence generation with generative flow. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 4281–4291, Hong Kong, China. Association for Computational Linguistics.

Rezende et al. (2014)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014.
Stochastic backpropagation and approximate inference in deep generative models.
In International Conference on Machine Learning, pages 1278–1286.  Saharia et al. (2020) Chitwan Saharia, William Chan, Saurabh Saxena, and Mohammad Norouzi. 2020. Nonautoregressive machine translation with latent alignments. arXiv preprint arXiv:2004.07437.
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
 Shao et al. (2020) Chenze Shao, Jinchao Zhang, Yang Feng, Fandong Meng, and Jie Zhou. 2020. Minimizing the bagofngrams difference for nonautoregressive neural machine translation. In AAAI.
 Sun et al. (2019) Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He, Zi Lin, and Zhihong Deng. 2019. Fast structured decoding for sequence models. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchéBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 3016–3026. Curran Associates, Inc.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc.
 Tu and Gimpel (2018) Lifu Tu and Kevin Gimpel. 2018. Learning approximate inference networks for structured prediction. In Proceedings of International Conference on Learning Representations (ICLR).
 Tu and Gimpel (2019) Lifu Tu and Kevin Gimpel. 2019. Benchmarking approximate inference methods for neural structured prediction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3313–3324, Minneapolis, Minnesota. Association for Computational Linguistics.
 Tu et al. (2019) Lifu Tu, Richard Yuanzhe Pang, and Kevin Gimpel. 2019. Improving joint training of inference networks and structured prediction energy networks. arXiv preprint arXiv:1911.02891.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
 Zhou et al. (2020) Chunting Zhou, Jiatao Gu, and Graham Neubig. 2020. Understanding knowledge distillation in nonautoregressive machine translation. In International Conference on Learning Representations (ICLR).
Appendix A Appendix
a.1 Training with distilled outputs vs. training with energy.
In order to compare ENGINE with training on distilled outputs, we train BiLSTM models in three ways: “baseline” which is trained with the humanwritten reference translations, “distill” which is trained with the distilled outputs (generated using the autoregressive models), and “ENGINE”, our method which trains the BiLSTM as an inference network to minimize the pretrained seq2seq autoregressive energy. Oracle lengths are used for decoding. Table 5 shows test results for both datasets, showing significant gains of ENGINE over the baseline and distill methods. Although the results shown here are lower than the transformer results, the trend is clearly indicated.
IWSLT14 DEEN  WMT16 ROEN  
Energy ()  BLEU ()  Energy ()  BLEU ()  
baseline  153.54  8.28  175.94  9.47 
distill  112.36  14.58  205.71  5.76 
ENGINE  51.98  19.55  64.03  21.69 
a.2 Analysis of Translation Results
Source: 
seful onu a solicitat din nou tuturor partilor , inclusiv consiliului de securitate onu divizat sa se unifice si sa sustina negocierile pentru a gasi o solutie politica . 
Reference : 
the u.n. chief again urged all parties , including the divided u.n. security council , to unite and support inclusive negotiations to find a political solution . 
CMLM : 
the un chief again again urged all parties , including the divided un security council to unify and support negotiations in order to find a political solution . 
ENGINE : 
the un chief has again urged all parties , including the divided un security council to unify and support negotiations in order to find a political solution . 
Source: 
adevarul este ca a rupt o racheta atunci cand a pierdut din cauza ca a acuzat crampe in us , insa nu este primul jucator care rupe o racheta din frustrare fata de el insusi si il cunosc pe thanasi suficient de bine incat sa stiu ca nu s @@ ar mandri cu asta . 
Reference : 
he did break a racquet when he lost when he cramped in the us , but he 's not the first player to break a racquet out of frustration with himself , and i know thanasi well enough to know he wouldn 't be proud of that . 
CMLM : 
the truth is that it has broken a rocket when it lost because accused crcrpe in the us , but it is not the first player to break rocket rocket rocket frustration frustration himself himself and i know thanthanasi enough enough know know he would not be proud of that . 
ENGINE : 
the truth is that it broke a rocket when it lost because he accused crpe in the us , but it is not the first player to break a rocket from frustration with himself and i know thanasi well well enough to know he would not be proud of it . 
Source: 
realizatorii studiului mai transmit ca " romanii simt nevoie de ceva mai multa aventura in viata lor ( 24 % ) , urmat de afectiune ( 21 % ) , bani ( 21 % ) , siguranta ( 20 % ) , nou ( 19 % ) , sex ( 19 % ) , respect 18 % , incredere 17 % , placere 17 % , conectare 17 % , cunoastere 16 % , protectie 14 % , importanta 14 % , invatare 12 % , libertate 11 % , autocunoastere 10 % si control 7 % " . 
Reference : 
the study 's conductors transmit that " romanians feel the need for a little more adventure in their lives ( 24 % ) , followed by affection ( 21 % ) , money ( 21 % ) , safety ( 20 % ) , new things ( 19 % ) , sex ( 19 % ) respect 18 % , confidence 17 % , pleasure 17 % , connection 17 % , knowledge 16 % , protection 14 % , importance 14 % , learning 12 % , freedom 11 % , self @@ awareness 10 % and control 7 % . " 
CMLM : 
survey survey makers say that ' romanians romanians some something adventadventure ure their lives 24 24 % ) followed followed by % % % % % , ( 21 % % ), safety ( % % % ), new19% % ), ), 19 % % % ), respect 18 % % % % % % % % , , % % % % % % % , , % , 14 % , 12 % % 
ENGINE : 
realisation of the survey say that ' romanians feel a slightly more adventure in their lives ( 24 % ) followed by aff% ( 21 % ) , money ( 21 % ), safety ( 20 % ) , new 19 % ) , sex ( 19 % ) , respect 18 % , confidence 17 % , 17 % , connecting 17 % , knowledge % % , 14 % , 14 % , 12 % % 
In Table 6, we present randomly chosen translation outputs from WMT16 ROEN. For each Romanian sentence, we show the reference from the dataset, the translation from CMLM, and the translation from ENGINE. We could observe that without the refinement iterations, CMLM could performs well for shorter source sentences. However, it still prefers generating repeated tokens. ENGINE, on the other hand, could generates much better translations with fewer repeated tokens.