Towards Decoding as Continuous Optimization in Neural Machine Translation

01/11/2017 ∙ by Cong Duy Vu Hoang, et al. ∙ Monash University The University of Melbourne 0

We propose a novel decoding approach for neural machine translation (NMT) based on continuous optimisation. We convert decoding - basically a discrete optimization problem - into a continuous optimization problem. The resulting constrained continuous optimisation problem is then tackled using gradient-based methods. Our powerful decoding framework enables decoding intractable models such as the intersection of left-to-right and right-to-left (bidirectional) as well as source-to-target and target-to-source (bilingual) NMT models. Our empirical results show that our decoding framework is effective, and leads to substantial improvements in translations generated from the intersected models where the typical greedy or beam search is not feasible. We also compare our framework against reranking, and analyse its advantages and disadvantages.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sequence to sequence learning with neural networks

(Graves, 2013; Sutskever et al., 2014; Lipton et al., 2015) is typically associated with two phases: training and decoding (a.k.a.

inference). Model parameters are learned by optimising the training objective, so that the model generalises well when the unknown test data is decoded. The majority of literature have been focusing on developing better training paradigms or network architectures, but the decoding problem is arguably under-investigated. Conventional heuristic-based approaches for approximate inference include greedy, beam, and stochastic search. Greedy and beam search have been empirically proved to be adequate for many sequence to sequence tasks, and are the standard methods for decoding in NMT.

However, these approximate inference approaches have several drawbacks. Firstly, due to sequential decoding of symbols of the target sequence, the inter-dependencies among the target symbols are not fully exploited. For example, when decoding the words of the target sentence in a left-to-right manner, the right context is not exploited leading potentially to inferior performance (see Watanabe and Sumita (2002a) who apply this idea in traditional statistical MT). Secondly, it is not trivial to apply greedy or beam search to decode in NMT models involving global features or constraints, e.g., intersecting left-to-right and right-to-left models which do not follow the same generation order. These global constraints capture different aspects and can be highly useful in producing better and more diverse translations.

We introduce a novel decoding framework (§ 3) that effectively relaxes this discrete optimisation problem into a continuous

optimisation problem. This is akin to linear programming relaxation approach for approximate inference in graphical models with discrete random variables where the exact inference is NP-hard

(Sontag, 2010; Belanger and McCallum, 2016)

. Our continuous optimisation problems are challenging due to the non-linearity and non-convexity of the relaxed decoding objective. We make use of stochastic gradient descent (SGD) and exponentiated gradient (EG) algorithms, which are mainly used for training in the literature, for decoding based on our relaxation approach. Our decoding framework is powerful and flexible, as it enables us to decode with global constraints involving intersection of multiple NMT models (§

4). We present experimental results on Chinese-English and German-English translation tasks, confirming the effectiveness of our relaxed optimisation method for decoding (§5).

2 Neural Machine Translation

We briefly review the attentional neural translation model proposed by Bahdanau et al. (2015) as a sequence-to-sequence neural model onto which we will apply our decoding framework.

In neural machine translation (NMT), the probability of the target sentence

given a source sentence is written as:


where is a non-linear function of the previously generated sequence of words , the source sentence , and the model parameters . In this paper, we realise as follows:

where is a single hidden layer neural network with activation function, and is the embedding of the target word in the embedding matrix of the target language vocabulary and is the embedding dimension. The state of the decoder RNN is a function of , its previous state , and the context summarises parts of the source sentence which are attended to, where

In above, and are the states of the left-to-right and right-to-left RNNs encoding the source sentence, and is the embedding of the source word in the embedding matrix of the source language vocabulary and is the embedding dimension.

Given a bilingual corpus , the model parameters are learned by maximizing the (regularised) conditional log-likelihood:


The model parameters include the weight matrix and the bias – with denoting the hidden dimension size – as well as the RNN encoder / decoder parameters, word embedding matrices, and those of the attention mechanism. The model is trained end-to-end by optimising the training objective using stochastic gradient descent (SGD) or its variants. In this paper, we are interested in the decoding problem though which is outlined in the next section.

1:For all initialise
2:for   do is defined as eqn (6)
3:     For all

using backpropagation

4:     For all is the step size
Algorithm 1 The EG Algorithm for Decoding by Optimisation

3 Decoding as Continuous Optimisation

In decoding, we are interested in finding the highest probability translation for a given source sentence:


where is the space of possible translations for the source sentence . In general, searching to find the highest probability translation is intractable due to long-range dependency terms in eqn (1) which prevents dynamic programming for efficient search algorithms in this exponentially-large space of possible translations with respect to the input length .

We now formulate this discrete optimisation problem as a continuous one, and then use standard algorithms for continuous optimisation for decoding. Let us assume that the maximum length of a possible translation for a source sentence is known and denote it by . The best translation for a given source sentence solves the following optimisation problem:


Equivalently, we can re-write the above discrete optimisation problem as follows:



are vectors using the one-hot representation of the target words


We now convert the optimisation problem (5) to a continuous one by dropping the integrality constraints and require the variables to take values from the probability simplex:

where is the -dimensional probability simplex, i.e., . Intuitively, this amounts to replacing with the expected embedding of target language words under the distribution in the NMT model.

After solving the above constrained continuous optimisation problem, there is no guarantee that the resulting solution to include one-hot vectors corresponding to target language words. It instead will have distributions over target language vocabulary for each random variable of interest in prediction, so we need a technique to round up this fractional solution. Our method is to put all of the probability mass on the word with the highest probability111If there are multiple words with the same highest probability mass, we choose one of them arbitrarily. for each . We leave exploration of more elaborate projection techniques to the future work.

In the context of graphical models, the above relaxation technique gives rise to linear programming for approximate inference (Sontag, 2010; Belanger and McCallum, 2016). However, our decoding problem is much harder due to the non-linearity and non-convexity of the objective function operating on high dimensional space for deep models. We now turn our attention to optimisation algorithms to effectively solve the decoding optimisation problem.

3.1 Exponentiated Gradient (EG)

Exponentiated gradient (Kivinen and Warmuth, 1997) is an elegant algorithm for solving optimisation problems involving simplex constraints. Recall our constrained optimisation problem:

where is defined as


EG is an iterative algorithm, which updates each distribution in the current time-step based on the distributions of the previous time-step as follows:

where is the step size, and is the normalisation constant

The partial derivatives are calculated using the back propagation algorithm treating ’s as parameters and the original parameters of the model as constants. Adapting EG to our decoding problem leads to Algorithm 1. It can be shown that the EG algorithm is a gradient descent algorithm for minimising the following objective function subject to the simplex constraints:


In other words, the algorithm looks for the maximum entropy solution which also maximizes the log likelihood under the model. There are intriguing parallels with the maximum entropy formulation of log-linear models (Berger et al., 1996)

. In our setting, the entropy term acts as a prior which discourages overly-confident estimates without sufficient evidence.

1:For all initialise
2:for   do is defined in eqn (6) and
3:     For all using backpropagation
4:     For all is the step size
Algorithm 2 The SGD Algorithm for Decoding by Optimisation

3.2 Stochastic Gradient Descent (SGD)

To be able to apply SGD to our optimisation problem, we need to make sure that the simplex constraints are kept intact. One way to achieve this is by changing the optimisation variables from to through the transformation, i.e. . The resulting unconstrained optimisation problem then becomes

where is replaced with the expected embedding of the target words under the distribution resulted from the in the model.

To apply SGD updates, we need the gradient of the objective function with respect to the new variables

which can be derived with the back-propagation algorithm based on the chain rule:

The resulting SGD algorithm is summarized in Algorithm 2.

4 Decoding in Extended NMT

Our decoding framework allows us to effectively and flexibly add additional global factors over the output symbols during inference. This in enabling by allowing decoding for richer global models, for which there is no effective means of greedy decoding or beam search. We outline several such models, and their corresponding relaxed objective functions for optimisation-based decoding.

Bidirectional Ensemble.

Standard NMT generates the translation in a left-to-right manner, conditioning each target word on its left context. However, the joint probability of the translation can be decomposed in a myriad of different orders; one compelling alternative would be to condition each target word on its right context, i.e., generating the target sentence from right-to-left. We would not expect a right-to-left model to outperform a left-to-right, however, as the left-to-right ordering reflects the natural temporal order of spoken language. However, the right-to-left model is likely to provide a complementary signal in translation as it will be bringing different biases and making largely independent prediction errors to those of the left-to-right model. For this reason, we propose to use both models, and seek to find translations that have high probability according both models (this mirrors work on bidirectional decoding in classical statistical machine translation by Watanabe and Sumita (2002b).) Decoding under the ensemble of these models leads to an intractable search problem, not well suited to traditional greedy or beam search algorithms, which require a fixed generation order of the target words. This ensemble decoding problem can be formulated simply in our linear relaxation approach, using the following objective function:



is an interpolation hyper-parameter, which we set to 0.5;

and are the pre-trained left-to-right and right-to-left models, respectively. This bidirectional agreement may also lead to improvement in translation diversity, as shown in (Li and Jurafsky, 2016) in a re-ranking evaluation.

Bilingual Ensemble.

Another source of complementary information is in terms of the translation direction, that is forward translation from the source to the target language, and reverse translation in the target to source direction. The desire now is to find a translation which is good under both the forward and reverse translation models. This is inspired by the direct and reverse feature functions commonly used in classical discriminative SMT (Och and Ney, 2002) which have been shown to offer some complementary benefits (although see (Lopez and Resnik, 2006)). More specifically, we decode for the best translation in the intersection of the source-to-target and target-to-source models by minimizing the following objective function:


where is an interpolation hyper-parameter to be fine-tuned; and and are the pre-trained source-to-target and target-to-source models, respectively. Decoding for the best translation under the above objective function leads to an intractable search problem, as the reverse model is global over the target language, meaning there is no obvious means of search with greedy algorithm or alike.


There are two important considerations on how best to initialise the relaxed optimisation in the above settings, and how best to choose the step size. As the relaxed optimisation problem is, in general, non-convex, finding a plausible initialisation is likely to be important for avoiding local optima. Furthermore, a proper step size is a key in the success of the EG-based and SGD-based optimisation algorithms, and there is no obvious method how to best choose its value. We may also adaptively change the step size using (scheduled) annealing or via the line search. We return to this considerations in the experimental evaluation.

# tokens # types # sents
BTEC zhen
train 422k / 454k 3k / 3k 44,016
dev 10k / 10k 1k / 1k 1,006
test 5k / 5k 1k / 1k 506
TED Talks deen
train 4m / 4m 26k / 19k 194,181
dev-test2010 33k / 35k 4k / 3k 1,565
test2014 26k / 27k 4k / 3k 1,305
WMT 2016 deen
train 107m / 108m 90k / 78k 4m
dev-test2013&14 154k / 152k 20k / 13k 6003
test2015 54k / 54k 10k / 8k 2169
Table 1: Statistics of the training and evaluation sets; token and types are presented for both source/target languages.

5 Experiments

5.1 Setup


We conducted our experiments on datasets with different scales, translating between ChineseEnglish using the BTEC corpus, and GermanEnglish using the IWSLT 2015 TED Talks corpus (Cettolo et al., 2014) and WMT 2016222 corpus. The statistics of the datasets can be found in Table 1.

NMT Models.

We implemented our continuous-optimisation based decoding method on top of the Mantidae toolkit333 (Cohn et al., 2016), and using the dynetdeep learning library444 (Neubig et al., 2017). All neural network models were configured with 512 input embedding and hidden layer dimensions, and 256 alignment dimension, with 1 and 2 hidden layers in the source and target, respectively. We used a LSTM recurrent structure (Hochreiter and Schmidhuber, 1997) for both source and target RNN sequences. For vocabulary sizes, we have chosen the word frequency cut-off 5 for creating the vocabularies for all datasets. For large-scale dataset with WMT, we applied byte-pair encoding (BPE) method (Sennrich et al., 2016) so that the neural MT system can tackle the unknown word problem (Luong et al., 2015).555With this BPE method, the OOV rates of tune and test sets are lower than 1%.

For training our neural models, the best perplexity scores on the development set is used for early stopping, which usually occurs after 5-8 epochs.

Figure 1: Analysis on effects of initialisation states (uniform vs. greedy vs. beam), step size annealing, momentum mechanism from BTEC zhen translation. EG-400: EG algorithm with step size (otherwise ); EG-MOM: EG algorithm with momentum.

Evaluation Metrics.

We evaluated in terms of search error, measured using the model score of the inferred solution (either continuous or discrete), as well as measuring the end translation quality with case-insensitive BLEU (Papineni et al., 2002). The continuous cost measures under the model ; the discrete model score has the same formulation, albeit using the discrete rounded solution (see §3). Note the cost can be used as a tool for selecting the best inference solution, as well as assessing convergence, as we illustrate below.

5.2 Results and Analysis

Initialisation and Step Size.

As our relaxed optimisation problems are not convex, local optima are likely to be a problem. We test this empirically, focusing on the effect that initialisation and step size, , have on the inference quality.

For plausible initialisation states, we evaluate different strategies: uniform in which the relaxed variables are initialised to ; and greedy or beam whereby are initialised based on an already good solution produced by a baseline decoder with greedy (gdec) or beam (bdec). Instead of using the Viterbi outputs as a one-hot representation, we initialise to the probability prediction666Here, the EG algorithm uses normalization whereas the SGD algorithm uses pre- one. vectors, which serves to limit attraction of the initialisation condition, which is likely to be a local (but not global) optima.

Figure 1 illustrates the effect of initialisation on the EG algorithm, in terms of search error (left and middle) and translation quality (right), as we vary the number of iterations of inference. There is clear evidence of non-convexity: all initialisation methods can be seen to converge using all three measures, however they arrive at highly different solutions. Uniform initialisation is clearly not a viable approach, while greedy and beam initialisation both yield much better results. The best initialisation, beam, outperforms both greedy and beam decoding in terms of BLEU.

Note that the EG algorithm has fairly slow convergence, requiring at least 100 iterations, irrespective of the initialisation. To overcome this, we use momentum (Qian, 1999) to accelerate the convergence by modifying the term in Algorithm 1 with a weighted moving average of past gradients:

where we set the momentum term . The EG with momentum (EG-MOM) converges after fewer iterations (about 35), and results in marginally better BLEU scores. The momentum technique is usually used for SGD involving additive updates; it is interesting to see it also works in EG with multiplicative updates.

The step size, , is another important hyper-parameter for gradient based search. We tune the step size using line search over over the development set. Figure 1 illustrates the effect of changing step size from 50 to 400 (compare EG and EG-400 with uniform), which results in a marked difference of about 10 BLEU points, underlining the importance of tuning this value. We found that EG with momentum had less of a reliance on step size, with optimal values in ; we use this setting hereafter.

Continuous vs Discrete Costs.

Another important question is whether the assumption behind continuous relaxation is valid, i.e., if we optimise a continuous cost to solve a discrete problem, do we improve the discrete output? Although the continuous cost diminishes with inference iterations (Figure 1 centre), and appears to converge to an optima, it is not clear whether this corresponds to a better discrete output (note that the BLEU scores do show improvements Figure 1.) Figure 2 illustrates the relation between the two cost measures, showing that in almost all cases the discrete and continuous costs are identical. Linear relaxation effectively fails only for a handful of cases, where the nearest discrete solution is significantly worse than it would appear using the continuous cost.

Figure 2: Comparing discrete vs continuous costs from BTEC zhen translation, using the EG algorithm with momentum, . Each point corresponds to a sentence.

EG vs SGD.

Both the EG and SGD algorithms are iterative methods for solving the relaxed optimisation problem with simplex constraints. We measure empirically their difference in terms of quality of inference and speed of convergence, as illustrated in Figure 3. Observe that SGD requires 150 iterations for convergence, whereas EG requires many fewer (50). This concurs with previous work on learning structured prediction models with EG (Globerson et al., 2007). Further, the EG algorithm consistently produces better results in terms of both model cost and BLEU.

Figure 3: Analysis on convergence and performance comparing SOFTMAX and EG algorithms from BTEC zhen translation. Both algorithms use momentum and step size 50.
filtered rerank
EGdec w/ beam init
full rerank
EGdec w/ rerank init

Table 2: The BLEU evaluation results with EG algorithm against 100-best reranking on WMT evaluation dataset; : best performance.

EG vs Reranking.

Reranking is an alternative method for integrating global factors into the existing NMT systems. We compare our EG decoding algorithm against the reranking approach with bidirectional factor where the N-best outputs of a left-to-right decoder is re-scored with the forced decoder operateing in a right-to-left fashion. The results are shown in Table 2. Our EG algorithm initialised with the reranked output achieves the best BLEU score. We also compare reranking with EG algorithm initialised with the beam decoder, where for direct comparison we filter out sentences with length greater than that of the beam output in the k-best lists. These results show that the EG algorithm is capable of effectively exploiting the search space.

As opposed to re-ranking, our approach does not need a pipeline, e.g. to produce and score best lists, to tune the weights etc. The run time complexity of our approach is comparable with that of reranking: our model needs repeated application of the NMT global factors to navigate through the search space, whereas re-ranking needs to use the underlying NMT model to generate the the best list and generate their global NMT scores. Note that our method swaps some sparse 1-hot vector operations for dense and requires a back-propagation pass, where both operations are relatively cheap on GPUs. Overall our expectation is that for sufficiently large k to get improvements in BLEU, our relaxed decoding is potentially faster.

Computational Efficiency.

We also quantify the computational efficiency of the proposed decoding approach. Benchmarking on a GPU Titan X for decoding 506 sentences of BTEC zhen, it takes 0.02s/sentence for greedy, 0.07s/sentence for beam 5, 0.11s/sentence for beam 10, and 3.1s/sentence for our relaxed EG decoding (with an average of 35 EG iterations). More concretely, our relaxed EG decoding includes: 0.94s/sentence for the forward step, 2.09s/sentence for the backward step, and 0.01s/sentence for the update and additional steps. It turns out that the backward step is the most computationally expensive step, limiting the practical applicability of the proposed decoding approach. Addressing this important issue is left for our future research.


Table 3: The BLEU evaluation results across evaluation datasets for EG algorithm variants against the baselines; bold: statistically significantly better than the best greedy or beam baseline, : best performance on dataset.

Main Results.

Table 3 shows our experimental results across all datasets, evaluating the EG algorithm and its variants.777Given the aforementioned analysis and space constraints, here we reported the results for the EG algorithm only. For the EG algorithm with greedy initialisation (top), we see small but consistent improvements in terms of BLEU. Beam initialisation led to overall higher BLEU scores, and again demonstrating a similar pattern of improvements, albeit of a lower magnitude, over the initialisation values.

Next we evaluate the capability of our inference method with extended NMT models, where approximate algorithms such as greedy or beam search are infeasible. With the bidirectional ensemble, we obtained the statistically significant BLEU score improvements compared to the unidirectional models, for either greedy or beam initialisation. This is interesting in the sense that the unidirectional right-to-left model always performs worse than the left-to-right model. However, our method with bidirectional ensemble is capable of combining their strengths in a unified setting. For the bilingual ensemble, we see similar effects, with better BLEU score improvements in most cases, albeit of a lower magnitude, over the bidirectional one. This is likely to be due to a disparity with the training condition for the models, which were learned independently of one another.

Overall, decoding in extended NMT models leads to performance improvements compared to the baselines. This is one of the main findings in this work, and augurs well for its extension to other global model variants.

Figure 4: Translation examples generated by the models.

6 Related Work

Decoding (inference) for neural models is an important task; however, there is limited research in this space perhaps due to the challenging nature of this task, with only a few works exploring some extensions to improve upon them. The most widely-used inference methods include sampling (Cho, 2016), greedy and beam search (Sutskever et al., 2014; Bahdanau et al., 2015, inter alia), and reranking (Birch, 2016; Li and Jurafsky, 2016).

Cho (2016) proposed to perturb the neural model by injecting noise(s) in the hidden transition function of the conditional recurrent neural language model during greedy or beam search, and execute multiple parallel decoding runs. This strategy can improves over greedy and beam search; however, it is not clear how, when and where noise should be injected to be beneficial. Recently, Wiseman and Rush (2016) proposed beam search optimisation while training neural models, where the model parameters are updated in case the gold standard falls outside of the beam. This exposes the model to its past incorrect predicted labels, hence making the training more robust. This is orthogonal to our approach where we focus on the decoding problem with a pre-trained model.

Reranking has also been proposed as a means of global model combination: Birch (2016) and Li and Jurafsky (2016) re-rank the left-to-right decoded translations based on the scores of a right-to-left model, learning to more diverse translations. Related, Li et al. (2016)

learn to adjust the beam diversity with reinforcement learning.

Perhaps most relevant is Snelleman (2016), performed concurrently to this work, who also proposed an inference method for NMT using linear relaxation. Snelleman’s method was similar to our SGD approach, however he did not manage to outperform beam search baselines with an encoder-decoder. In contrast we go much further, proposing the EG algorithm, which we show works much more effectively than SGD, and demonstrate how this can be applied to inference in an attentional encoder-decoder. Moreover, we demonstrate the utility of related optimisation for inference over global ensembles of models, resulting in consistent improvements in search error and end translation quality.

Recently, relaxation techniques have been applied to deep models for training and inference in text classification (Belanger and McCallum, 2016; Belanger et al., 2017), and fully differentiable training of sequence-to-sequence models with scheduled-sampling (Goyal et al., 2017). Our work has applied the relaxation technique specifically for decoding in NMT models.

7 Conclusions

This work presents the first attempt in formulating decoding in NMT as a continuous optimisation problem. The core idea is to drop the integrality (i.e. one-hot vector) constraint from the prediction variables and allow them to have soft assignments within the probability simplex while minimising the loss function produced by the neural model. We have provided two optimisation algorithms – exponentiated gradient (EG) and stochastic gradient descent (SGD) – for optimising the resulting contained optimisation problem, where our findings show the effectiveness of EG compared to SGD. Thanks to our framework, we have been able to decode when intersecting left-to-right and right-to-left as well as source-to-target and target-to-source NMT models. Our results show that our decoding framework is effective and lead to substantial improvements in translations

888Some comparative translation examples are included in Figure 4. generated from the intersected models, where the typical greedy or beam search algorithms are not applicable.

This work raises several compelling possibilities which we intend to address in future work, including improving the decoding speed, integrating additional constraints such as word coverage and fertility into decoding,999These constraints have only been used for training in the previous works (Cohn et al., 2016; Mi et al., 2016). and applying our method to other intractable structured prediction such as parsing.


We thank the reviewers for valuable feedbacks and discussions. Cong Duy Vu Hoang is supported by Australian Government Research Training Program Scholarships at the University of Melbourne, Australia. Trevor Cohn is supported by the ARC Future Fellowship. This work is partly supported by an ARC DP grant to Trevor Cohn and Gholamreza Haffari.