ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation

We propose to train a non-autoregressive machine translation model to minimize the energy defined by a pretrained autoregressive model. In particular, we view our non-autoregressive translation system as an inference network (Tu and Gimpel, 2018) trained to minimize the autoregressive teacher energy. This contrasts with the popular approach of training a non-autoregressive model on a distilled corpus consisting of the beam-searched outputs of such a teacher model. Our approach, which we call ENGINE (ENerGy-based Inference NEtworks), achieves state-of-the-art non-autoregressive results on the IWSLT 2014 DE-EN and WMT 2016 RO-EN datasets, approaching the performance of autoregressive models.


Can Latent Alignments Improve Autoregressive Machine Translation?

Latent alignment objectives such as CTC and AXE significantly improve no...

Inference Strategies for Machine Translation with Conditional Masking

Conditional masked language model (CMLM) training has proven successful ...

Non-Autoregressive Machine Translation: It's Not as Fast as it Seems

Efficient machine translation models are commercially important as they ...

Non-Autoregressive vs Autoregressive Neural Networks for System Identification

The application of neural networks to non-linear dynamic system identifi...

Global Autoregressive Models for Data-Efficient Sequence Learning

Standard autoregressive seq2seq models are easily trained by max-likelih...

Autoregressive Energy Machines

Neural density estimators are flexible families of parametric models whi...

Your Autoregressive Generative Model Can be Better If You Treat It as an Energy-Based One

Autoregressive generative models are commonly used, especially for those...

1 Introduction

The performance of non-autoregressive neural machine translation (NAT) systems, which predict tokens in the target language independently of each other conditioned on the source sentence, has been improving steadily in recent years 

(Lee et al., 2018; Ghazvininejad et al., 2019; Ma et al., 2019). One common ingredient in getting non-autoregressive systems to perform well is to train them on a corpus of distilled translations (Kim and Rush, 2016). This distilled corpus consists of source sentences paired with the translations produced by a pretrained autoregressive “teacher” system.

As an alternative to training non-autoregressive translation systems on distilled corpora, we instead propose to train them to minimize the energy defined by a pretrained autoregressive teacher model. That is, we view non-autoregressive machine translation systems as inference networks (Tu and Gimpel, 2018, 2019; Tu et al., 2019) trained to minimize the teacher’s energy. This provides the non-autoregressive model with additional information related to the energy of the teacher, rather than just the approximate minimizers of the teacher’s energy appearing in a distilled corpus.

In order to train inference networks to minimize an energy function, the energy must be differentiable with respect to the inference network output. We describe several approaches for relaxing the autoregressive teacher’s energy to make it amenable to minimization with an inference network, and compare them empirically. We experiment with two non-autoregressive inference network architectures, one based on bidirectional RNNs and the other based on the transformer model of Ghazvininejad et al. (2019).

In experiments on the IWSLT 2014 DE-EN and WMT 2016 RO-EN datasets, we show that training to minimize the teacher’s energy significantly outperforms training with distilled outputs. Our approach, which we call ENGINE (ENerGy-based Inference NEtworks), achieves state-of-the-art results for non-autoregressive translation on these datasets, approaching the results of the autoregressive teachers. Our hope is that ENGINE will enable energy-based models to be applied more broadly for non-autoregressive generation in the future.

2 Related Work

Non-autoregressive neural machine translation began with the work of Gu et al. (2018), who found that using knowledge distillation (Hinton et al., 2015), and in particular sequence-level distilled outputs (Kim and Rush, 2016), improved performance. Subsequent work has continued to narrow the gap between non-autoregressive and autoregressive translation, including multi-iteration refinements (Lee et al., 2018; Ghazvininejad et al., 2019; Saharia et al., 2020) and rescoring with autoregressive models (Kaiser et al., 2018; Ma et al., 2019; Sun et al., 2019). Recently, Ghazvininejad et al. (2020) and Saharia et al. (2020) proposed aligned cross entropy or latent alignment models and achieve the highest BLEU scores of all non-autoregressive models without refinement or rescoring. In our work, we propose training inference networks with autoregressive energy and outperform the best purely non-autoregressive methods.

Another related approach trains an “actor” network to manipulate the hidden state of an autoregressive neural MT system (Gu et al., 2017; Chen et al., 2018; Zhou et al., 2020) in order to bias it toward outputs with better BLEU scores. This work modifies the original pretrained network rather than using it to define an energy for training an inference network.

Energy-based models have had limited application in text generation due to the computational challenges involved in learning and inference in extremely large search spaces. The use of inference networks to output approximate minimizers of a loss function is popular in variational inference 

(Kingma and Welling, 2013; Rezende et al., 2014), and, more recently, in structured prediction (Tu and Gimpel, 2018, 2019; Tu et al., 2019).

3 Energy-Based Inference Networks for Non-Autoregressive NMT

Most neural machine translation (NMT) systems model the conditional distribution of a target sequence given a source sequence , where each comes from a vocabulary , is , and is . It is common in NMT to define this conditional distribution using an “autoregressive” factorization (Sutskever et al., 2014; Bahdanau et al., 2015; Vaswani et al., 2017):

This model can be viewed as an energy-based model (LeCun et al., 2006) by defining the energy function . Given trained parameters , test time inference seeks to find the translation for a given source sentence with the lowest energy: .

Finding the translation that minimizes the energy involves combinatorial search. In this paper, we train inference networks to perform this search approximately. The idea of this approach is to replace the test time combinatorial search typically employed in structured prediction with the output of a network trained to produce approximately optimal predictions (Tu and Gimpel, 2018, 2019). More formally, we define an inference network which maps an input to a translation and is trained with the goal that .

Specifically, we train the inference network parameters as follows (assuming is pretrained and fixed):


where is a training set of sentence pairs. The network architecture of can be different from the architectures used in the energy function. In this paper, we combine an autoregressive energy function with a non-autoregressive inference network. By doing so, we seek to combine the effectiveness of the autoregressive energy with the fast inference speed of a non-autoregressive network.

3.1 Energies for Inference Network Training

Figure 1: The ENGINE framework trains a non-autoregressive inference network to produce translations with low energy under a pretrained autoregressive energy .

In order to allow for gradient-based optimization of the inference network parameters , we now define a more general family of energy functions for NMT. First, we change the representation of the translation in the energy, redefining as a sequence of distributions over words instead of a sequence of words.

In particular, we consider the generalized energy




We use the notation in above to indicate that we may need the full distribution over words. Note that by replacing the with one-hot distributions we recover the original energy.

In order to train an inference network to minimize this energy, we simply need a network architecture that can produce a sequence of word distributions, which is satisfied by recent non-autoregressive NMT models (Ghazvininejad et al., 2019). However, because the distributions involved in the original energy are one-hot, it may be advantageous for the inference network too to output distributions that are one-hot or approximately so. We will accordingly view inference networks as producing a sequence of logit vectors , and we will consider two operators and that will be used to map these logits into distributions for use in the energy. Figure 1 provides an overview of our approach, including this generalized energy function, the inference network, and the two operators and . We describe choices for these operators in the next section.

Table 1: Let be the result of applying an or operation to logits output by the inference network. Also let , where is Gumbel noise, , and . We show the Jacobian (approximation) we use when computing , for each considered.

3.2 Choices for Operators

We now consider ways of defining the two operators that govern the interface between the inference network and the energy function. As shown in Figure 1, we seek an operator to modulate the way that logits output by the inference network are fed to the decoder input slots in the energy function, and an operator to determine how the distribution

is used to compute the log probability of a word in

. Explicitly, then, we rewrite each local energy term (Eq. 3) as

which our inference networks will minimize with respect to the .

The choices we consider for and , which we present generically for operator and logit vector , are shown in Table 1, and described in more detail below. Some of these operations are not differentiable, and so the Jacobian matrix must be approximated during learning; we show the approximations we use in Table 1 as well.

We consider five choices for each :

  • SX: . Here ; no Jacobian approximation is necessary.

  • STL: straight-through logits. Here .

    is approximated by the identity matrix

    (see Bengio et al. (2013)).

  • SG: straight-through Gumbel-Softmax. Here , where is Gumbel noise.222 and . is approximated with (Jang et al., 2016).

  • ST: straight-through. This setting is identical to SG with (see  Bengio et al. (2013)).

  • GX: Gumbel-Softmax. Here , where again is Gumbel noise; no Jacobian approximation is necessary.

SX 55 (20.2) 256 (0) 56 (19.6) 55 (20.1) 55 (19.6)
STL 97 (14.8) 164 (8.2) 94 (13.7) 95 (14.6) 190 (0)
SG 82 (15.2) 206 (0) 81 (14.7) 82 (15.0) 83 (13.5)
ST 81 (14.7) 170 (0) 81 (14.4) 80 (14.3) 83 (13.7)
GX 53 (19.8) 201 (0) 56 (18.3) 54 (19.6) 55 (19.4)
(a) seq2seq AR energy, BiLSTM inference networks
80 (31.7) 133 (27.8) 81 (31.5) 80 (31.7) 81 (31.6)
186 (25.3) 133 (27.8) 95 (20.0) 97 (30.1) 180 (26.0)
98 (30.1) 133 (27.8) 95 (30.1) 97 (30.0) 97 (29.8)
98 (30.2) 133 (27.8) 95 (30.0) 97 (30.1) 97 (30.0)
81 (31.5) 133 (27.8) 81 (31.2) 81 (31.5) 81 (31.4)
(b) transformer AR energy, CMLM inference networks
Table 2: Comparison of operator choices in terms of energies (BLEU scores) on the IWSLT14 DE-EN dev set with two energy/inference network combinations. Oracle lengths are used for decoding. is the operation for feeding inference network outputs into the decoder input slots in the energy. is the operation for computing the energy on the output. Each row corresponds to the same , and each column corresponds to the same .

4 Experimental Setup

4.1 Datasets

We evaluate our methods on two datasets: IWSLT14 German (DE) English (EN) and WMT16 Romanian (RO) English (EN). All data are tokenized and then segmented into subword units using byte-pair encoding (Sennrich et al., 2016). We use the data provided by Lee et al. (2018) for RO-EN.

4.2 Autoregressive Energies

We consider two architectures for the pretrained autoregressive (AR) energy function. The first is an autoregressive sequence-to-sequence (seq2seq) model with attention (Luong et al., 2015). The encoder is a two-layer BiLSTM with 512 units in each direction, the decoder is a two-layer LSTM with 768 units, and the word embedding size is 512. The second is an autoregressive transformer model (Vaswani et al., 2017), where both the encoder and decoder have 6 layers, 8 attention heads per layer, model dimension 512, and hidden dimension 2048.

4.3 Inference Network Architectures

We choose two different architectures: a BiLSTM “tagger” (a 2-layer BiLSTM followed by a fully-connected layer) and a conditional masked language model (CMLM; Ghazvininejad et al., 2019), a transformer with 6 layers per stack, 8 attention heads per layer, model dimension 512, and hidden dimension 2048. Both architectures require the target sequence length in advance; methods for handling length are discussed in Sec. 4.5. For baselines, we train these inference network architectures as non-autoregressive models using the standard per-position cross-entropy loss. For faster inference network training, we initialize inference networks with the baselines trained with cross-entropy loss in our experiments.

The baseline CMLMs use the partial masking strategy described by Ghazvininejad et al. (2019). This involves using some masked input tokens and some provided input tokens during training. At test time, multiple iterations (“refinement iterations”) can be used for improved results (Ghazvininejad et al., 2019). Each iteration uses partially-masked input from the preceding iteration. We consider the use of multiple refinement iterations for both the CMLM baseline and the CMLM inference network.333The CMLM inference network is trained according to Eq. 1 with full masking (no partial masking like in the CMLM baseline). However, since the CMLM inference network is initialized using the CMLM baseline, which is trained using partial masking, the CMLM inference network is still compatible with refinement iterations at test time.

4.4 Hyperparameters

For inference network training, the batch size is 1024 tokens. We train with the Adam optimizer (Kingma and Ba, 2015). We tune the learning rate in

. For regularization, we use L2 weight decay with rate 0.01, and dropout with rate 0.1. We train all models for 30 epochs. For the baselines, we train the models with local cross entropy loss and do early stopping based on the BLEU score on the dev set. For the inference network, we train the model to minimize the energy (Eq. 

1) and do early stopping based on the energy on the dev set.

4.5 Predicting Target Sequence Lengths

Non-autoregressive models often need a target sequence length in advance (Lee et al., 2018). We report results both with oracle lengths and with a simple method of predicting it. We follow Ghazvininejad et al. (2019) in predicting the length of the translation using a representation of the source sequence from the encoder. The length loss is added to the cross-entropy loss for the target sequence. During decoding, we select the top length candidates with the highest probabilities, decode with the different lengths in parallel, and return the translation with the highest average of log probabilities of its tokens.

5 Results

# iterations # iterations
1 10 1 10
CMLM 28.11 33.39 28.20 33.31
ENGINE 31.99 33.17 33.16 34.04
Table 3: Test BLEU scores of non-autoregressive models using no refinement (# iterations = 1) and using refinement (# iterations = 10). Note that the # iterations = 1 results are purely non-autoregressive. ENGINE uses a CMLM as the inference network architecture and the transformer AR energy. The length beam size is 5 for CMLM and 3 for ENGINE.
Autoregressive (Transformer)
Greedy Decoding 33.00 33.33
Beam Search 34.11 34.07

Iterative Refinement (Lee et al., 2018) - 25.73
NAT with Fertility (Gu et al., 2018) - 29.06
CTC (Libovický and Helcl, 2018) - 24.71
FlowSeq (Ma et al., 2019) 27.55 30.44

CMLM (Ghazvininejad et al., 2019) 28.25 28.20
Bag-of-ngrams-based loss (Shao et al., 2020) - 29.29
AXE CMLM (Ghazvininejad et al., 2020) - 31.54
Imputer-based model (Saharia et al., 2020) - 31.7
ENGINE (ours) 31.99 33.16
Table 4: BLEU scores on two datasets for several non-autoregressive methods. The inference network architecture is the CMLM. For methods that permit multiple refinement iterations (CMLM, AXE CMLM, ENGINE), one decoding iteration is used (meaning the methods are purely non-autoregressive). Results are from the corresponding papers.

Effect of choices for and .

Table 2 compares various choices for the operations and . For subsequent experiments, we choose the setting that feeds the whole distribution into the energy function ( = SX) and computes the loss with straight-through ( = ST). Using Gumbel noise in has only minimal effect, and rarely helps. Using ST instead also speeds up training by avoiding the noise sampling step.

Training with distilled outputs vs. training with energy.

We compared training non-autoregressive models using the references, distilled outputs, and as inference networks on both datasets. Table 5 in the Appendix shows the results when using BiLSTM inference networks and seq2seq AR energies. The inference networks improve over training with the references by 11.27 BLEU on DE-EN and 12.22 BLEU on RO-EN. In addition, inference networks consistently improve over non-autoregressive networks trained on the distilled outputs.

Impact of refinement iterations.

Ghazvininejad et al. (2019) show improvements with multiple refinement iterations. Table 3 shows refinement results of CMLM and ENGINE. Both improve with multiple iterations, though the improvement is much larger with CMLM. However, even with 10 iterations, ENGINE is comparable to CMLM on DE-EN and outperforms it on RO-EN.

Comparison to other NAT models.

Table 4

shows 1-iteration results on two datasets. To the best of our knowledge, ENGINE achieves state-of-the-art NAT performance: 31.99 on IWSLT14 DE-EN and 33.16 on WMT16 RO-EN. In addition, ENGINE achieves comparable performance with the autoregressive NMT model.

6 Conclusion

We proposed a new method to train non-autoregressive neural machine translation systems via minimizing pretrained energy functions with inference networks. In the future, we seek to expand upon energy-based translation using our method.


We would like to thank Graham Neubig for helpful discussions and the reviewers for insightful comments. This research was supported in part by an Amazon Research Award to K. Gimpel.


Appendix A Appendix

a.1 Training with distilled outputs vs. training with energy.

In order to compare ENGINE with training on distilled outputs, we train BiLSTM models in three ways: “baseline” which is trained with the human-written reference translations, “distill” which is trained with the distilled outputs (generated using the autoregressive models), and “ENGINE”, our method which trains the BiLSTM as an inference network to minimize the pretrained seq2seq autoregressive energy. Oracle lengths are used for decoding. Table 5 shows test results for both datasets, showing significant gains of ENGINE over the baseline and distill methods. Although the results shown here are lower than the transformer results, the trend is clearly indicated.

Energy () BLEU () Energy () BLEU ()
baseline 153.54 8.28 175.94 9.47
distill 112.36 14.58 205.71 5.76
ENGINE 51.98 19.55 64.03 21.69
Table 5: Test results of non-autoregressive models when training with the references (“baseline”), distilled outputs (“distill”), and energy (“ENGINE”). Oracle lengths are used for decoding. Here, ENGINE uses BiLSTM inference networks and pretrained seq2seq AR energies. ENGINE outperforms training on both the references and a pseudocorpus.

a.2 Analysis of Translation Results

seful onu a solicitat din nou tuturor partilor , inclusiv consiliului de securitate onu divizat sa se unifice si sa sustina negocierile pentru a gasi o solutie politica .
Reference :
the u.n. chief again urged all parties , including the divided u.n. security council , to unite and support inclusive negotiations to find a political solution .
the un chief again again urged all parties , including the divided un security council to unify and support negotiations in order to find a political solution .
the un chief has again urged all parties , including the divided un security council to unify and support negotiations in order to find a political solution .
adevarul este ca a rupt o racheta atunci cand a pierdut din cauza ca a acuzat crampe in us , insa nu este primul jucator care rupe o racheta din frustrare fata de el insusi si il cunosc pe thanasi suficient de bine incat sa stiu ca nu s @-@ ar mandri cu asta .
Reference :
he did break a racquet when he lost when he cramped in the us , but he 's not the first player to break a racquet out of frustration with himself , and i know thanasi well enough to know he wouldn 't be proud of that .
the truth is that it has broken a rocket when it lost because accused crcrpe in the us , but it is not the first player to break rocket rocket rocket frustration frustration himself himself and i know thanthanasi enough enough know know he would not be proud of that .
the truth is that it broke a rocket when it lost because he accused crpe in the us , but it is not the first player to break a rocket from frustration with himself and i know thanasi well well enough to know he would not be proud of it .
realizatorii studiului mai transmit ca " romanii simt nevoie de ceva mai multa aventura in viata lor ( 24 % ) , urmat de afectiune ( 21 % ) , bani ( 21 % ) , siguranta ( 20 % ) , nou ( 19 % ) , sex ( 19 % ) , respect 18 % , incredere 17 % , placere 17 % , conectare 17 % , cunoastere 16 % , protectie 14 % , importanta 14 % , invatare 12 % , libertate 11 % , autocunoastere 10 % si control 7 % " .
Reference :
the study 's conductors transmit that " romanians feel the need for a little more adventure in their lives ( 24 % ) , followed by affection ( 21 % ) , money ( 21 % ) , safety ( 20 % ) , new things ( 19 % ) , sex ( 19 % ) respect 18 % , confidence 17 % , pleasure 17 % , connection 17 % , knowledge 16 % , protection 14 % , importance 14 % , learning 12 % , freedom 11 % , self @-@ awareness 10 % and control 7 % . "
survey survey makers say that ' romanians romanians some something adventadventure ure their lives 24 24 % ) followed followed by % % % % % , ( 21 % % ), safety ( % % % ), new19% % ), ), 19 % % % ), respect 18 % % % % % % % % , , % % % % % % % , , % , 14 % , 12 % %
realisation of the survey say that ' romanians feel a slightly more adventure in their lives ( 24 % ) followed by aff% ( 21 % ) , money ( 21 % ), safety ( 20 % ) , new 19 % ) , sex ( 19 % ) , respect 18 % , confidence 17 % , 17 % , connecting 17 % , knowledge % % , 14 % , 14 % , 12 % %
Table 6: Examples of translation outputs from ENGINE and CMLM on WMT16 RO-EN without refinement iterations.

In Table 6, we present randomly chosen translation outputs from WMT16 RO-EN. For each Romanian sentence, we show the reference from the dataset, the translation from CMLM, and the translation from ENGINE. We could observe that without the refinement iterations, CMLM could performs well for shorter source sentences. However, it still prefers generating repeated tokens. ENGINE, on the other hand, could generates much better translations with fewer repeated tokens.