Later-stage Minimum Bayes-Risk Decoding for Neural Machine Translation

04/11/2017 ∙ by Raphael Shu, et al. ∙ The University of Tokyo 0

For extended periods of time, sequence generation models rely on beam search algorithm to generate output sequence. However, the correctness of beam search degrades when the a model is over-confident about a suboptimal prediction. In this paper, we propose to perform minimum Bayes-risk (MBR) decoding for some extra steps at a later stage. In order to speed up MBR decoding, we compute the Bayes risks on GPU in batch mode. In our experiments, we found that MBR reranking works with a large beam size. Later-stage MBR decoding is shown to outperform simple MBR reranking in machine translation tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


GPU-based BLEU Computation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, neural-based sequence generation models have achieved state-of-the-art performance in machine translation (Wu et al., 2016)

. Neural machine translation (NMT) models normally utilize recurrent neural networks (RNNs) as decoders to generate output tokens sequentially. In the decoding phase, it is a common practice to use the beam search algorithm to find a candidate translation, which approximately maximizes the posterior probability.

Although beam search can find much better candidates compared to greedy decoding, sometimes it still produces inappropriate outputs. Szegedy et al. (2016) has shown that a neural network can become too confident in a suboptimal prediction. In beam search, assigning high probability to a suboptimal prediction in one step may lead to a chain reaction that generates unnatural output sequences.

To improve beam search, various approaches have been explored recently either by enhancing the scoring method (Li et al., 2016)

or using reinforcement learning

(Li et al., 2017; Gu et al., 2017). In this work, we try to apply Minimum Bayes-Risk (MBR) decoding (Kumar and Byrne, 2004) to guide the decoding algorithm of NMT to find a better candidate. This approach exploits the similarity between candidate translations instead of predicting the quality of each single candidate.

In preliminary experiments, we found that simply reranking the results of NMT by their Bayes risks only works with a large beam size. However, we are facing two major difficulties when integrating the MBR reranking into beam search.

Firstly, we found that performing MBR reranking simultaneously with beam search does not work. MBR reranking shrinks the diversity of the candidate space, thus traps the decoding algorithm in a suboptimal search space. Moreover, as the probability of a partial translation cannot represent final confidence, the values of Bayes risks are inaccurate in the beginning.

Secondly, as MBR reranking requires computing discrepancy values to obtains the Bayes risks for candidates, CPU-based MBR reranking is excessively time-consuming.

In this paper, we propose to perform MBR decoding at a later stage to search for a “refined” candidate with low Bayes risk. To speed up MBR decoding, we designed two approaches to compute the Bayes risks on GPU, which are shown to be much faster than a standard implementation. The main contributions of this paper can be summarized as two folds:

  1. We found that MBR reranking only works with a large beam size. Conversely, performing MBR decoding at a later-stage is proved to be effective regardless of the choice of beam size, and outperform simple MBR reranking.

  2. We found that the computation of Bayes risks can be much faster by computing the discrepancy matrix on GPU in batch mode.

2 Related Work

MBR decoding is widely applied in SMT (Kumar and Byrne, 2004; González-Rubio et al., 2011; Duh et al., 2011), which is also found to improve the translation quality (Ehling et al., 2007). Recently, Stahlberg et al. (2016) utilized the translation lattice of SMT to guide NMT decoding with the MBR decision rule, which is shown to be better than simply rescoring the N-best results of SMT. A drawback of this approach is that it requires a SMT system to be available and decode simultaneously with the NMT model.

Recently, some studies are proposed to enhance beam search with reinforcement learning. Li et al. (2017)

utilizes a simplified version of the actor-critic model to decode for arbitrary decoding objectives. The scoring function is modified to be an interpolation of the log probability and the output of the value function (or Q function).

Gu et al. (2017) extends the noisy, parallel approximate decoding (NPAD) algorithm (Cho, 2016) by adjusting the hidden states in recurrent networks with an agent, which is trained with the deterministic policy gradient algorithm (Silver et al., 2014). These approaches predict the quality of each single candidate, whereas MBR reranking considers the relation between multiple candidates. Therefore, they can be combined with our model to further improve the quality of output sequences.

3 Minimum Bayes-Risk Decoding

MBR decoding is a technique to find a candidate with the least expected loss (Bickel and Doksum, 1977). Following previous work in SMT (Kumar and Byrne, 2004), given an evidence space , the Bayes risk of a candidate is computed by:


The term gives the discrepancy between two candidates, which is normally computed by using in machine translation. In this paper, we use smoothed BLEU (Lin and Och, 2004), and the probability is calculated by a softmax over the average log probabilities of all candidates in given by a NMT model. Intuitively, a candidate gets low Bayes risk if it is similar to the candidates in the evidence space.

4 Later-stage MBR Decoding

In this section, we propose a simple decoding strategy, which searches for low-risk hypotheses after finishing beam search. The basic idea is to utilize the results of beam search as an evidence space to guide the later-stage MBR decoding.

As the hypotheses111Each hypothesis is a tuple of the average log probability, candidate tokens, and the lastly computed hidden state of the decoder LSTM. that were discarded by beam search provide good starting points for finding low-risk hypotheses, we begin the later-stage decoding from a selection of discarded hypotheses rather than from scratch. To do this, in each step of beam search, we save discarded hypotheses that are outside the “beam” to a hypothesis list , where is the number of beam size. After beam search, we select top finished hypotheses222A hypothesis is finished when a “EOS” token is reached., and save them to an evidence space .

2:      = beam size, = time budget
4:      discarded hypotheses finished candidates
5:for  do
6:     sort by Eq. 2 in descending order
7:      pop hyps. from with highest scores
8:      decode to get new hyps.
9:     push finished hyps. in to
10:     push unfinished hyps. in back to
11:perform MBR reranking for
Algorithm 1 Later-stage MBR decoding

After we collected the discarded hypotheses in and the evidences in from beam search, we perform the later-stage MBR decoding for an extra steps to recover low-risk hypotheses. In our experiments, is fixed to be the number of input words.

In each extra step, we sort333As computing for all hypotheses in is time-consuming, in practice, we only rerank top hypotheses with the highest average log probabilities. the hypothesis list with a score function, which will be described later in Section 4.1. Then we pick hypotheses with the highest scores from , and generate the next words for them, resulting in new hypotheses. If a hypothesis is finished, we add it to the evidence space . Otherwise, it is put back to the hypothesis list .

Finally, after performing the later-stage MBR decoding for steps, we select the candidate translation with the lowest risk in as the output. The complete algorithm is summarized in Alg. 1.

4.1 Score Function for Hypothesis Selection

In the later-stage MRB decoding, we desire to guide the algorithm to find a hypothesis with low Bayes risk under the evidence space . To achieve this, we select the hypotheses in to decode according to a score function computed for each hypothesis in each extra step, which is computed as follows:


where the first part of the equation is the average log probability. The risk term is computed by Eq. 1.

However, reranking with alone will over-penalize short hypotheses as they are certainly dissimilar to the finished hypotheses in . Therefore, we add a length penalty term to encourage the selection of short hypotheses, which is proportional to the number of remaining steps:


In the last step of MBR decoding when , as will be zero, the hypotheses are selected only by their confidence scores and risks.

4.2 Fast Computation of Bayes Risk

Unfortunately, computing the Bayes risk for candidates requires evaluating the discrepancy function by times. The computation is excessively time-consuming with a CPU-based implementation, whose bottleneck is to compute the following discrepancy matrix:


To tackle this problem, we designed two approaches to compute the discrepancy matrix efficiently on GPU: (1) compute BLEU values in batches with a sophisticated GPU-based implementation (2) approximate the discrepancy values with a neural net. The advantage of the approximation approach is that the implementation is independent of the chosen criterion of discrepancy. In practice, we found that reranking with approximated discrepancies performs as good as a standard reranker.

For the GPU-based BLEU computation, the trick is to construct a count vector that contains the counts of all unique n-grams to compute matches. The implementational details are described in the supplementary material.

The neural-based approach approximates true discrepancy values with a simple LSTM, which can be computed in batches naturally:


where is the one-hot embedding a token.

4.3 Dynamically Adjusting Weights

In this section, we turn our attention to the scoring function in Eq. 2. Similar to Li et al. (2016), we apply the REINFORCE algorithm (Williams, 1992) to learn the optimal weights ( and ) for each input. The difference is that we do not discretize the weights to make a finite action space. Instead, we directly apply REINFORCE algorithm to learn a Gaussian policy with continuous actions. The merit of this approach is that we do not need to find the effective range of each weight value beforehand.

We use the last state of the backward encoder LSTM in the NMT model to represent an input sequence, which is denoted by . The stochastic policy is defined as:


where is the number of actions, which equals 2 in our case. The function approximators are implemented with simple two-layer neural networks. In the training time, we sample actions from the distribution defined by the policy , whereas in the test time, the actions are computed deterministically with .

5 Experiments

We evaluate our proposed decoding strategy on English-Japanese translation task, using ASPEC parallel corpus (Nakazawa et al., 2016). The corpus contains 3M training pairs and 1812 sentence pairs in the test set. We tokenize the sentences with “tokenizer.perl” for English side, and Kytea (Neubig et al., 2011) for Japanese side. The sizes of vocabulary are cropped to 80k and 40k respectively. In our experiments, we trained a NMT model with a standard architecture (Bahdanau et al., 2014), which has 1000 units for both the embeddings and LSTMs.

5.1 Fast Bayes-Risk Computation

In this section, we compare the speed between the GPU-based Bayer-risk rerankers and a standard reranker. A comparison of reranking speed is shown in Fig. 1.

Figure 1: A comparison of reranking speed (time per sentence) with different approaches of computing Bayes risks. is the number of candidates.

For each input sentence, we rerank a list of candidate translations. The average reranking time per sentence is reported. The results show that both GPU-based approaches are much faster than a standard CPU-based reranker, thus capable of being integrated into beam search. Remarkably, the growing pattern of the average reranking time for GPU-based approaches is linear rather than exponential.

In our experiment, reranking candidates with approximated discrepancy values is found to be as good as a normal reranker, shown in the middle of Table 1. The training details are provided in supplementary material.

5.2 Later-stage MBR decoding

In Table 1, we compare different decoding strategies in various settings of beam sizes. The BLEU scores are reported following a standard post-processing procedure444We produce the BLEU scores with Kytea tokenizer. The post-processing script can be found in ..

B=5 B=20 B=100
standard beam search (BS) 34.58 35.20 35.24
BS + MBR rerank 34.56 35.35 35.65
BS + MBR rerank (approx) 34.57 35.34 35.70
BS + LaterMBR 34.96 35.67 35.87
Table 1: Evaluation results of different decoding strategies. is the number of beam size.

We found that increasing the beam size to a large number does not consequently contribute to the evaluation scores. On the contrary, MBR reranking improves the scores when a large beam size is used, but less effective with a small beam size. Later-stage MBR decoding is shown to outperform the simple MBR reranking in all settings of beam size.

Additionally, we also found that the number of candidates in the evidence space largely affects the effectiveness of MBR reranking. In our experiments, the number of evidences is fixed to the same number of beam size. Using more evidences degrades the quality of selected candidates.

6 Conclusion and Future Work

In this paper, we propose a simple decoding strategy to search a hypotheses with lowest Bayes risk at a later stage, which outperforms simple MBR reranking. We compute the Bayes risk on GPU to speed up step-wise MBR decoding.

Interestingly, we found the simple MBR reranking is especially effective with a large beam size. Without MBR reranking, further increasing the beam size does not result in significant gain.

For future work, we intend to construct a better evidence space with an alternative neural network, in order to benefit the later-stage MBR decoding phase.


Appendix A Supplemental Materials

a.1 A Note on the Standard Implementation for Computing Bayes Risks

As in the equation for computing , all the terms in the summation are positive numbers, we stop computing the risk of a candidate if the sum is already higher than the lowest risk value of computed candidates. Even with this early stopping technique, the standard MBR reranker still runs very slow as shown in Fig. 1 of the paper.

a.2 GPU-based BLEU Computation

For the particular discrepancy function based on BLEU, we found that a matrix of BLEU values can be calculated efficiently on GPU. The trick is to build a -dimensional count vector for each candidate , which contains the count of all uniq n-grams in the candidate space. Another vector is used to indicate the rank of each n-gram.

For example, let a set of n-grams be {“a”, “in”, “park”, “ball”, “in a”, “a park”}. Then for the sentence “a park in a park”, will be a 6-dimensional vector of , whereas will be a vector of .

The BLEU score of a candidate pair can be computed with:


In this way, a BLEU matrix can be obtained in one shot with an input of count matrix. In practice, we use smoothed BLEU, which simply adds 1 to both top and bottom parts of the fraction in Eq. 11.

a.3 Training details of Neural-based Bayes-risk Approximation

To learn a neural-based estimator that approximates the discrepancy matrix, we collect 100K candidate pairs by decoding the source sentences in the training data. For each input sentence, a new ID is assigned for each unique token in the candidate space to reduce the vocabulary size. For example, “A cat eats” and “A dog eats” will be converted to

and respectively.

The model is trained with MSE loss function to predict

given a candidate pair. We use Adam optimizer (Kingma and Ba, 2014)

and train the model for 50 epoch. The learning rate is fixed to

. In practice, we scale the discrepancy scores by . After 50 epochs, we obtain a MSE loss of .

a.4 A Note on Dynamic Weight Adjusting

The model for predicting and are both simple two-layer neural networks with a shared hidden layer of 100 units, followed by a tanh nonlinearity.

The gradient for updating by gradient ascent is given by the REINFORCE rule with a baseline:


where is the reward for taking action given input , whereas is a baseline computed by another simple two-layer neural network. In practice, the computation of gradients used in the REINFORCE rule (Eq. 13) can be simplified as: