Single-Queue Decoding for Neural Machine Translation

07/06/2017 ∙ by Raphael Shu, et al. ∙ The University of Tokyo 0

Neural machine translation models rely on the beam search algorithm for decoding. In practice, we found that the quality of hypotheses in the search space is negatively affected owing to the fixed beam size. To mitigate this problem, we store all hypotheses in a single priority queue and use a universal score function for hypothesis selection. The proposed algorithm is more flexible as the discarded hypotheses can be revisited in a later step. We further design a penalty function to punish the hypotheses that tend to produce a final translation that is much longer or shorter than expected. Despite its simplicity, we show that the proposed decoding algorithm is able to select hypotheses with better qualities and improve the translation performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

nmtdec

A framework for implementing decoding algorithms for Neural Machine Translation


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine translation models composed of end-to-end neural networks

Sutskever et al. (2014); Bahdanau et al. (2014); Shazeer et al. (2017); Gehring et al. (2017) are starting to become mainstream. Essentially, neural machine translation (NMT) models define a probabilistic distribution to generate translations, where is a source sentence, and is a history of emitted words for predicting the next word . One can think of a NMT model as a neural language model with the source sentence included in the context.

As the search space of possible outputs is incredibly large, we can only afford to explore a limited number of candidates. In practice, we use the beam search algorithm to generate output sequences Graves (2012); Sutskever et al. (2014). The algorithm limits the search space by considering only a fixed number of hypotheses (i.e., partial translations) in each step, and predicting next words only for the selected hypotheses.

However, we found that the strict limit of hypothesis selection affects the quality of the search space negatively. Since the number of active hypotheses is fixed, the algorithm must give up some existing hypotheses to explore new possible decoding paths, even though the discarded hypotheses are not “hopeless”. The problem has a similar flavor as the exploration-exploitation dilemma.

In this work, we extend beam search to introduce more flexibility in hypothesis selection. We manage all discarded hypotheses in a single priority queue so that they can be selected later when necessary. The extended algorithm is guided by a universal score function, which is capable of evaluating the hypotheses of different lengths. To encourage the algorithm to select hypotheses that can potentially result in good final translations, we design a length matching penalty that penalizes the hypotheses that may produce incorrect number of words in the final translation. Experiments show that the proposed algorithm is able to improve the quality of search space and thus results in better translation performance.

2 Related Work

To improve the performance of beam search, a basic technique is length-normalization that simply divides the log-probability by the number of words. As far as we know, it is firstly clearly described in

Graves (2012)

in the context of recurrent neural networks. We also apply length-normalization in the proposed algorithm.

To improve the quality of the score function in beam search, Wiseman and Rush (2016)

propose to run beam search in the forward pass of training, then apply a new objective function to ensure the gold output does not fall outside the beam. An alternative approach is to correct the scores with reinforcement learning

Li et al. (2017). This work focuses on fixing the limited search space of beam search rather than the score function.

Hu et al. (2015) also describes a priority queue but has a different mechanism and purpose. The priority queue in their work contains top-1 hypotheses from different hypothesis stacks. In each step, only one hypothesis from the queue is allowed to be considered. Their purpose is to use the priority queue to speed up beam search at the cost of performance degradation, which is different to this work.

3 Deficiency of Beam Search

Figure 1: Average scores (log-probability) of the hypotheses in each position of the selected hypothesis list in each step. is sorted in descending order. The data is collected by decoding 1k sentences with a beam size of 5. The proposed single-queue decoding is shown to select better alternative hypotheses compared to beam search.

Beam search finds a hypothesis that maximizes the log probability , given an input sentence . In each step, a fixed number of hypotheses are considered by the algorithm. Then the NMT model predicts the probabilities of the next output token for each hypothesis. Suppose that the fixed number (beam size) is , and the vocabulary size is . Then, theoretically, we can obtain a maximum of new hypotheses. Beam search then keep hypotheses with highest log probabilities. Thus, the hypotheses considered in each step have exactly the same length (i.e., number of tokens). The algorithm ends when finished translations are collected.

Since the beam size is fixed, when the algorithm attempts to explore multiple new decoding paths for a hypothesis, it has to discard some existing decoding paths. However, the new decoding paths may be found to have low scores in near future. Because the discarded hypotheses can not be revisited, beam search has to continue searching in a low-quality decoding paths. As a result, some hypotheses in the beam may have much lower qualities. Fig. 1 shows the average log-probabilities (normalized by length) of the hypotheses in each position of the beam, which indicates that the quality of alternative hypotheses is significantly worse than the best hypothesis. One can also think this problem as a result of limited search space, where the path from the “BOS” token to the “EOS” token is limited in width.

4 Single-Queue Decoding

In this section, we introduce an extended version of beam search, which maintains a single priority queue that contains all possible hypotheses. We refer to this algorithm as single-queue decoding in this paper. In contrast to the standard beam search, which only considers hypotheses with same length in each step, the proposed algorithm can select arbitrary hypotheses that differ in length.

By allowing mixing hypotheses of different lengths, the proposed algorithm is able to “regret” its decisions and explore a discarded hypotheses if the front (i.e., longer) hypotheses have worse scores.

4.1 Main Algorithm

The pseudo code of single-queue decoding algorithm is shown in Alg. 1. Let be the beam size, and be the number of max steps. Similar to the standard beam search, we run the decoding algorithm until finished translations are collected. In the worst case, the algorithm will run for a maximum of steps. All collected hypotheses are maintained in a priority queue .

Initialize:
      beam size empty hypothesis queue max steps
for  do
      select best unfinished hyps in
     Remove hyps in from
      decode to get new hyps with highest local scores
     Evaluate hyps in with Eq. 1
     Mark hyps end with a EOS token as finished
      select best hyps in
     Merge into
     if  then
         break     
best finished hyp in
output
Algorithm 1 Single-queue decoding

The proposed decoding algorithm relies on an universal score function to evaluate a hypothesis . In each step, hypotheses with the highest scores are removed from the queue to predict next words for them. We collect hypotheses that have top local scores. Specifically, for a hypothesis 111Note that is a short form of , which is a partial translation with words., we keep predictions with highest probabilities. This simple filtering can avoid huge computational cost caused by the score function. We mark all hypotheses end with a “EOS” token as finished, so that they will not be decoded anymore. Then, we evaluate these new hypotheses with a universal score function, and retain the top hypotheses in a list . Finally, is merged into the queue . Note that, if we keep only hypotheses in , then the algorithm will produce the same result as beam search.

4.2 Universal Score Function

The universal score function for evaluating a hypothesis in the queue has the following form:

(1)

The first part of the equation is the log probability with length-normalization, where

is a hyperparameter that is similar to the definition of length penalty in

Wu et al. (2016). The second part of Eq. 1 is a progress penalty, which encourages the algorithm to select longer hypotheses:

(2)

where and are the weights that control the strength of this function. The progress penalty is crucial to single-queue decoding, as it ensures that the decoding algorithm is progressing in general.

The last part of Eq. 1 is a length matching penalty, which will be described in Section 4.3.

4.3 Length Matching Penalty

The standard beam search evaluates hypotheses solely based on the emitting probabilities, which may result in a final translation much longer or shorter than expected. To compensate for this deficiency, we design a length matching penalty. The intuition is to correct the scores by predicting a Gaussian distribution of correct translation length, and penalize all hypotheses that tend to produce much longer or shorter translations.

Let the first state of the backward encoder in a standard NMT model Bahdanau et al. (2014) be

. We predict the mean and variance of the distribution of correct translation length using a simple neural network

:

(3)
(4)

where,

is a two-dimensional vector. To predict the distribution of the final length for a hypothesis

, we use a tiny LSTM hochreiter1997long followed by a simple neural net :

(5)
(6)
(7)

where, represents the embeddings of tokens. We train the parameters and with fixed NMT parameters. Let be the length of the gold output,

be the length of a sampled output obtained by greedy decoding, the loss function is defined as

(8)

where is a Gaussian distribution with the specified mean and variance.

Finally, the length matching penalty of a hypothesis is given by

(9)

where is an indicator, and are hyperparameters. computes a length matching score:

(10)

The LMS function outputs a large value when the final length tends to differ from the correct length. Note that Eq. 10 is a cross-entropy of two Gaussians, which can be deterministically computed as .

5 Experiments

5.1 Experimental Settings

We evaluate the proposed decoding algorithm with an off-the-shelf NMT model Bahdanau et al. (2014). The embeddings and LSTM layers have 1000 hidden units. We evaluate the algorithms on ASPEC English-Japanese translation dataset Nakazawa et al. (2016). We keep a 80k vocabulary for English side and 40k vocabulary for Japanese side. The NMT model is trained with Adam Kingma and Ba (2014)

for 6 epochs using a learning rate of

. We report BLEU score based on a standard post-processing procedure 222We use Kytea to re-tokenize results. Details can be found in http://lotus.kuee.kyoto-u.ac.jp/WAT/..

The additional network components for computing length matching penalty (, and ) have 128 hidden units in our experiments. They are trained with Adam for 2 epochs with the NMT parameters fixed.

The hyperparameters of the decoding algorithms are tuned by Bayesian optimization Snoek et al. (2012) on a small validation set composed of 500 sentences. We focus on evaluating algorithms with a small beam size, which is more useful in a productive system. We allow the decoding algorithms to run for a maximum of 150 steps.

5.2 Evaluation Results

The main results are shown in Table 1, which use a beam size of 5. The results show that the proposed single-queue decoding (SQD) algorithm significantly improves the quality of translations. With the length matching penalty (LMP), SQD outperforms beam search with length-normalization by 1.14 BLEU on test set. Without the progress penalty (PG), the scores are much worse.

Since SQD computes hypotheses in batch mode in each step just like beam search, the computational cost inside the loop of Alg. 1 remains the same. The factor affecting the computational cost is the actual number of decoding steps. To clarify that SQD does not improve the performance by significantly increasing the number of steps, we also report the average number of steps and decoding time for translating one sentence in the right-most columns. We can see that the average number of steps that SQD computes is still close to beam search. Note that our implementation for LMP is not fully optimized, there is room for further speed improvement.

BLEU(%) #step
time
(ms)
valid test
vanilla beam search 29.61 32.87 30.3 199
w/ length-norm 37.16 34.29 30.3 208
SQD w/o PG, LMP 38.09 34.62 36.1 238
SQD w/ PG 38.50 35.03 33.8 225
SQD w/ PG, LMP 38.93 35.43 35.0 260
Table 1: Evaluation results on ASPEC En-Ja task with a beam size of 5

In order to get insight into how SQD improves the performance, we plot the average log-probability of the selected hypotheses in Fig. 1. We can see that SQD improves the quality of the hypotheses other than the first one in the beam, which indicates that the proposed algorithm can rescue high-quality hypotheses from the queue that are previously discarded.

6 Conclusion

In this paper, we extend the beam search with a single hypotheses queue, which can revisit a discard hypothesis, and thus more flexible in decoding. We design a length matching penalty to further help the proposed algorithm to select a hypothesis that can potentially produce a final translation with correct length.

Although the proposed algorithm does not cause a speed issue, it requires a block of GPU memory for storing the decoder states of discarded hypotheses, which has a shape of , where is the maximum steps, is the beam size and is the dimension of the LSTM states. The increased memory usage does not cause a problem unless and are both large numbers.

The proposed algorithm is still compatible with other techniques, such as the threshold-based pruning method Freitag and Al-Onaizan (2017), reinforcement learning based scoring Li et al. (2017), reducing Softmax computation Hu et al. (2015); L’Hostis et al. (2016) and diverse decoding Li et al. (2016); Li and Jurafsky (2016).

References

Appendix A Supplemental Materials

In this work, we focus on testing the performance of our proposed algorithm with a small beam size. Theoretically, one can alleviate the problem of limited search space by using a very large beam size. However, the increased computational cost makes it impractical in a productive system. As supplemental data, we also report the experiment results with different beam sizes in Table 2.

Test BLEU(%)
BS=5 BS=8 BS=12
vanilla beam search 32.87 32.91 32.67
w/ length-normalization 34.29 34.82 35.05
SQD w/ PG 35.03 35.44 35.65
SQD w/ PG, LMP 35.43 35.54 35.75
Table 2: Evaluation results on ASPEC En-Ja task with different beam sizes (BS).

Our implementation is based on Theano. We utilize the Python package “bayes_opt” for Bayesian optimization. We apply the default acquisition function “ucb” with a

value of 5. The hyperparameters of the length matching penalty are searched independently from others. We first explore 20 initial points, then optimize for another 20 iterations. The source code of this work along with a toolkit that allows one to apply single-queue decoding and test new penalty functions for any encoder-decoder model, will be open-sourced in https://github.com/zomux/nmtdec.