Top-Rank Enhanced Listwise Optimization for Statistical Machine Translation

Pairwise ranking methods are the basis of many widely used discriminative training approaches for structure prediction problems in natural language processing(NLP). Decomposing the problem of ranking hypotheses into pairwise comparisons enables simple and efficient solutions. However, neglecting the global ordering of the hypothesis list may hinder learning. We propose a listwise learning framework for structure prediction problems such as machine translation. Our framework directly models the entire translation list's ordering to learn parameters which may better fit the given listwise samples. Furthermore, we propose top-rank enhanced loss functions, which are more sensitive to ranking errors at higher positions. Experiments on a large-scale Chinese-English translation task show that both our listwise learning framework and top-rank enhanced listwise losses lead to significant improvements in translation quality.


page 1

page 2

page 3

page 4


MTIL17: English to Indian Langauge Statistical Machine Translation

English to Indian language machine translation poses the challenge of st...

Deep Attentive Ranking Networks for Learning to Order Sentences

We present an attention-based ranking framework for learning to order se...

A Batch Learning Framework for Scalable Personalized Ranking

In designing personalized ranking algorithms, it is desirable to encoura...

A simple discriminative training method for machine translation with large-scale features

Margin infused relaxed algorithms (MIRAs) dominate model tuning in stati...

Understanding Exhaustive Pattern Learning

Pattern learning in an important problem in Natural Language Processing ...

L2R2: Leveraging Ranking for Abductive Reasoning

The abductive natural language inference task (αNLI) is proposed to eval...

Joint Optimization of Ranking and Calibration with Contextualized Hybrid Model

Despite the development of ranking optimization techniques, the pointwis...

1 Introduction

Discriminative training methods for structured prediction in natural language processing (NLP) aim to estimate the parameters of a model that assigns a score to each hypothesis in the (possibly very large) search space. For example, in statistical machine translation (SMT), the model assigns a score to each possible translation, and in syntactic parsing, the function assigns a score to each possible syntactic tree. Ideally, the model should assign scores that rank hypotheses according to their true quality. In this paper, we consider the problem of discriminative training for SMT.

Traditional SMT systems use log-linear models with only about a dozen features, such as translation probabilities and language model probabilities 

Yamada and Knight (2001); Koehn et al. (2003); Chiang (2005); Liu et al. (2006). These models can be tuned by minimum error rate training (MERT) Och (2003), which directly optimizes BLEU using coordinate ascent combined with a global line search.

To enable training of modern SMT systems, which can have thousands of features or more, many research efforts have been made towards scalable discriminative training methods Chiang et al. (2008); Hopkins and May (2011); Bazrafshan et al. (2012). Most of these methods either define loss functions that push the model to correctly compare pairs of hypotheses, or use approximate optimization methods that effectively do the same. For practical reasons, only a subset of the pairs are considered; these pairs are selected by either sampling Hopkins and May (2011)

or heuristic methods 

Watanabe et al. (2007); Chiang et al. (2008).

But this pairwise approach neglects the global ordering of the list of hypotheses, which may lead to problems trying to learn good parameter values. Inspired by research in information retrieval (IR) Cao et al. (2007); Xia et al. (2008), we propose to directly model the ordering of the whole translation list, instead of decomposing it into translation pairs.

Previous research has tried to integrate listwise methods into SMT, but almost all of them focus on the reranking task, which aims to rescore the fixed translation lists generated by a baseline system. They try to either use listwise approaches to training the reranking model Li et al. (2013); Niehues et al. (2015) or replace the pointwise ranking function, i.e. the log-linear model, with a listwise ranking function by introducing listwise features Zhang et al. (2016). In this paper, we focus on listwise approaches that can learn better discriminative models for SMT. We present a listwise learning framework for tuning translation systems that uses two listwise ranking objectives originally developed for IR, ListNet Cao et al. (2007) and ListMLE Xia et al. (2008)

. But unlike standard IR problems, structured prediction problems usually have a huge search space, and at each training iteration, the list of search results can vary. The usual strategy is to form the union of all lists of search results, but this can lead to a “patchy” list that doesn’t represent the full search space well. The listwise approaches always based on the permutation probability distribution over the list. Modeling the distribution over a “patchy” list, whose elements were generated by different parameters will affect listwise approaches’ performance. To address this issue, we design an

instance-aggregating method: Instead of treating the data as a fixed-size set of lists that each grow over time as new translations are added at each iteration, we treat the data as a growing set of lists; each time a sentence is translated, the -best list of translations is added as a new list.

We also extend standard listwise training by considering the importance of different instances in the list. Based on the intuition that instances at the top may be more important for ranking, we propose top-rank enhanced loss functions, which incorporate a position-dependent cost that penalizes errors occurring at the top of the list more strongly.

We conduct large-scale Chinese-to-English translation experiments showing that our top-rank enhanced listwise learning methods significantly outperform other tuning methods with high dimensional feature sets. Additionally, even with a small basic feature set, our methods still obtain better results than MERT.

2 Background

2.1 Log-linear models

In this paper, we assume a log-linear model, which defines a scoring function on target translation hypotheses , given a source sentence :



is the feature vector and

is the feature weight vector.

The process of training a SMT system includes both learning the sub-models, which are included in the feature vector , and learning the weight vector .

Then the decoding of SMT systems can be formulated as a search for the translation with the highest model score:


where is the set of all reachable hypotheses.

2.2 SMT Features

In this paper, we use a hierarchical phrase based translation system Chiang (2005). For convenient comparison, we divide features of SMT into the following three sets.

Basic Features: The basic features are those commonly used in hierarchical phrase based translation systems, including a language model, four translation model features, word, phrase and rule penalties, and penalties for unknown words, the glue rule and null translations.

Extended Features: Inspired by Chen2013, we manually group the parallel training data into 15 sets, according to their genre and origin. The translation models trained on each set are used as separate features. We also add an indicator feature for each individual training set to mark where the translation rule comes from. The extended features provide additional 60 translation model features and 16 indicator features, which is too many to be tuned with MERT.

Figure 1: An example of word-phrase features for a phrase translation. The and represent the -th in the source phrase and -th word in the target phrase, respectively.

Sparse Features: We use word-phrase pair features as our sparse features, which reflect the word-phrase correspondence in a hierarchical phrase Watanabe et al. (2007). Figure 1 illustrates an example of word-phrase pair features for a phrase translation pair and . Word-phrase pair features , , , will be fired for the translation rule with the given word alignment. In practice, these feature only fire when all the source and target words in the feature are both in the top 100 most frequent words.

2.3 Tuning via Pairwise Ranking

The beam search strategy for SMT decoding process makes it convenient to get a -best translation list for each source sentence. Given a set of source sentences and their corresponding translation lists, the tuning problem could be regarded as a ranking task. Many recently proposed SMT tuning methods are based on the pairwise ranking framework Chiang et al. (2008); Hopkins and May (2011); Bazrafshan et al. (2012).

Pairwise ranking optimization (PRO) Hopkins and May (2011) is a commonly used tuning method. The idea of PRO is to sample pairs from the

-best list, and train a linear binary classifier to predict whether

or , where is an extrinsic metric like BLEU. In this paper, we use sentence-level BLEU with add-one smoothing Lin and Och (2004).

The method gets a comparable BLEU score to MERT and MIRA Chiang et al. (2008), and scales well on large feature sets. Other pairwise ranking methods employ similar procedures.

3 Listwise Learning Framework

Although ranking methods have shown their effectiveness in tuning for SMT systems Hopkins and May (2011); Watanabe (2012); Dreyer and Dong (2015), most proposed ranking approaches view tuning as pairwise ranking. These approaches decompose the ranking of the hypothesis list into pairs, which might limit the training method’s ability to learn better parameters. To preserve the ranking information, we first formulate training as an instance of the listwise ranking problem. Then we propose a learning method based on the iterative learning framework of SMT tuning and further investigate the top-rank enhanced losses.

3.1 Training Objectives

3.1.1 The Permutation Probability Model

In order to directly model the translation list, we first introduce a probabilistic model proposed by Guiver2009. A ranking of a list of translations can be thought of as a function from to translations, where each is the -th translation candidate in the ranking. A scoring function (which could be either the model score, , or the BLEU score, eval) induces a probability distribution over rankings:


3.1.2 Loss Functions

Based on the probabilistic model above, the loss function can be defined as the difference between the distribution over the ranking according to and . Thus, we introduce the following two standard listwise losses.

ListNet: The ListNet loss is the cross entropy between the distributions calculated from and , respectively, over all permutations.

Due to the exponential number of permutations, Cao2007 propose a top-one loss instead. Given the function and , the top-one loss is defined as:

where is the -th element in the -best list, and is the probability that translation is ranked at the top by the function .

ListMLE: The ListMLE loss is the negative log-likelihood of the permutation probability of the correct ranking , calculated according to  Xia et al. (2008):


The training objective, which we want to minimize, is simply the total loss over all the lists in the tuning set.

3.2 Training with Instance Aggregating

Because there can be exponentially many possible translations of a sentence, it’s only feasible to rank the best translations rather than all of them; because the feature weights change at each iteration, we have a different -best list to rank at each iteration. This is different from standard ranking problems in which the training instances stay the same each iteration.

0:  Training sentences , maximum number of iterations , randomly initialized model parameters .
1:  for  to  do
2:     for source sentences  do
3:        Decode :
5:     end for
6:     Training:
7:  end for
Algorithm 1 MERT-like tuning algorithm

Many previous tuning methods address this problem by merging the -best list at the current iteration with the -best lists at all previous iterations into a single list Hopkins and May (2011). We call this -best merging. More formally, if is the -best list of source sentence at iteration , then at each iteration, the model is trained on the set of lists:

For each source sentence , has only one training sample, which is a better and better approximation to the full hypothesis set of as more iterations pass.

Unlike previous tuning methods, our tuning method focuses on the distribution over permutation of the whole list. Moreover, unlike with listwise optimization methods used in IR, the -best list produced for a source sentence at one iteration can differ dramatically from the -best list produced at the next iteration. Merging -best lists across iterations, each of which represents only a tiny fraction of the full search space, will lead to a “patchy” list that may hurt the learning performance of the listwise optimization algorithms.

To address this challenge, we propose instance aggregating: instead of merging -best lists across different iterations, we view the translation lists from different iterations as individual training instances:

With this method, each source sentence has training instances at the -th training iteration. In this way, we avoid “patchy” lists and obtain a better set of instances for tuning.

0:  Training instances , model parameters

, maximum number of epochs

, batch size , number of batches
1:  for  to  do
2:     for  to  do
3:        Sample a minibatch of lists from without replacement
4:        Calculate loss function
5:        Calculate gradient
7:     end for
8:  end for
Algorithm 2 Listwise Optimization Algorithm

The above instance aggregating method can be used in a MERT-like iterative tuning algorithm as shown in Algorithm 1, which can be easily integrated into current open source systems. The two standard listwise losses can be easily optimized using gradient-based methods (Algorithm 2); both losses are convex, so convergence to a global optimum is guaranteed. The gradients of ListNET and ListMLE with respect to the parameters for a single sentence are:

For optimization, We use a mini-batch stochastic gradient descent (SGD) algorithm together with AdaDelta 

Zeiler (2012) algorithm to adaptively set the learning rate.

4 Top-Rank Enhanced Losses

In evaluating an SMT system, one naturally cares much more about the top-ranked results than the lower-ranked results. Therefore, we think that getting the ranking right at the top of a list is more relevant for tuning. Therefore, we should pay more attention to the top-ranked translations instead of forcing the model to rank the entire list correctly.

Position-dependent Attention: To do this, we assign a higher cost to ranking errors that occur at the top and a lower cost to errors at the bottom. To make the cost sensitive to position, we define it as:


where is the position in the ranking and is the size of the list.

Based on this cost function, we propose simple top-rank enhanced listwise losses as extensions of both the ListNet loss and the ListMLE loss. The loss functions are defined as follows:

Along similar lines, Xia2008 also proposed a top- ranking method, which assumes that only the correct ranking of top- hypotheses is useful. Compared to our top-rank enhanced losses, it may be too harsh to discard information about the rest of the ordering altogether; our method retains the whole ordering but weights it by position.

5 Experiments and Results

5.1 Data and Preparation

We conduct experiments on a large scale Chinese-English translation task. The parallel data comes from LDC corpora111The corpora include LDC2002E18, LDC2003E14, LDC2004E12, LDC2004T08, LDC2005T10 and LDC2007T09, which consists of 8.2 million of sentence pairs. Monolingual data includes Xinhua portion of Gigaword corpus. We use NIST MT03 evaluation test data as the development set, MT02, MT04 and MT05 as the test set.

Data Usage Sents.
LDC TM train 8,260,093
Gigaword LM train 14,684,074
MT03 train 919
MT02 test 878
MT04 test 1,788
MT05 test 1,082
Table 1: Experimental data and statistics.

The Chinese side of the corpora is word segmented using ICTCLAS222 Word alignments of the parallel data are learned by running GIZA++ Och and Ney (2003) in both directions and refined under the “grow-diag-final-and” method. We train a 5-gram language model on the monolingual data with Modified Kneser-Ney smoothingChen and Goodman (1999). Throughout the experiments, our translation system is an in-house implementation of the hierarchical phrase-based translation system Chiang (2005). The translation quality is evaluated by 4-gram case-insensitive BLEU Papineni et al. (2002). Statistical significance testing between systems is conducted by bootstrap re-sampling implemented by Clark2011.

5.2 Tuning Settings

We build baselines for extended and sparse feature sets with two different tuning methods. First, we tune with PRO Hopkins and May (2011). As reported by Cherry2012, it’s hard to find the setting that performs well in general. We use MegaM version Daumé III (2004) with 30 iterations for basic feature set and 100 iterations for extended and sparse feature sets. Second, we run the k-best batch MIRA (KB-MIRA) which shows comparable results with online version of MIRA Cherry and Foster (2012); Green et al. (2013). In our experiments, we run KB-MIRA with standard settings in Moses333 For the basic feature set, the baseline is tuned with MERT Och (2003).

For all our listwise tuning methods, we set batch size to 10. In our experiments, we can’t find a epoch size perform well in general, so we set epoch size to 100 for ListMLE with basic features, 200 for ListMLE with extended and sparse features, and 300 for ListNet. These values are set to achieve the best performance on the development set.

We set beam size to 20 throughout our experiments unless otherwise noted. Following Clark2011, we run the same training procedure 3 times and present the average results for stability. All tuning methods are executed for 40 iterations of the outer loop and returned the weights that achieve the best development BLEU scores. For all tuning methods on sparse feature set, we use the weight vector tuned by PRO on the extended feature set as initial weights.

5.3 Experiments of Listwise Learning Framework

We first investigate the effectiveness of our instance aggregating training procedure. The results are presented in Table 2. The table compare training with instance aggregating and -best merging. As the result suggested, with the instance aggregating method, the performance improves on both listwise tuning approaches. For the rest of this paper, we use the instance aggregating as standard setting for listwise tuning approaches.

Methods MT02 MT04 MT05 AVG
Net 40.36 38.30 37.93 38.86(+0.00)
ListNet 40.75 38.69 38.31 39.25(+0.39)
MLE 39.82 37.88 37.65 38.45(+0.00)
ListMLE 40.40 38.21 38.04 38.88(+0.43)
Table 2: The comparison of instances aggregating and -best merging on the extended feature set.(Net and MLE denote ListNet and ListMLE with -best merging respectively.)
Method Extended Features Sparse Features
MT02 MT04 MT05 AVG MT02 MT04 MT05 AVG
PRO 40.30 38.12 37.69 38.70(+0.00) 40.63 38.46 38.24 39.11(+0.00)
KB-MIRA 40.48 37.71 37.37 38.52(-0.18) 40.67 38.48 38.21 39.12(+0.01)
ListNet 40.75 38.69 38.31 39.25(+0.55) 40.91 38.77 38.42 39.37(+0.26)
ListMLE 40.40 38.21 38.04 38.88(+0.18) 40.63 38.68 38.24 39.18(+0.07)
ListMLE-T5 41.02 38.84 38.79 39.55(+0.85) 41.12 38.91 38.89 39.64(+0.53)
ListMLE-TE 41.15 39.01 39.16 39.77(+1.07) 41.25 39.00 39.27 39.84(+0.73)
Table 3: BLEU4 in percentage for comparing of baseline systems and systems with listwise losses. , marks results that are significant better than the baseline system with and . (ListMLE-T5 and ListMLE-TE refer to top-5 LisMLE and our top-rank enhanced ListMLE, respectively.)
Figure 2: Effect of different for Top- ListMLE. We investigate the effect on the extended feature set.

To verify the performance of our proposed listwise learning framework, we first compare systems with standard listwise losses to the baseline systems. The first four rows in Table 3 show the results. ListNet can outperform PRO by 0.55 BLEU score and 0.26 BLEU score on extended feature set and sparse feature set, respectively. Its main reason is that our listwise methods can obtain structured order information when we take complete translation list as instance.

We also observe that ListMLE can only get a modest performance compare to ListNet. We think the objective function of standard ListMLE which forces the whole list ranking in a correct order is too hard. ListNet mainly benefits from its top one permutation probability which only concerns the permutation with the best object ranked first.

Figure 3: Listwise losses v.s. BLEU in (a) top-5 ListMLE and (b) top-rank enhanced ListMLE

5.4 Effect of Top-rank Enhanced Losses

To verify our assumption that the correct rank in the top portion of a list is more informative, we conduct this set of experiments. Figure 2 shows the results of top- ListMLE with different . Compared to ListMLE in Table 2, we find top- ListMLE can make significant improvements, which means that the top rank is more important. We can observe an improvement in all test sets when we set from 1 to 5, but when we further increase , the results dropped. This situation indicates that the correct ranking at the top of the list is more informative and forcing the model to rank the bottom correctly as important as the top will sacrifice the ability to guide better search.

In Table 3, top-5 ListMLE which only aims to rank the top five translations correctly can outperform the baseline and standard ListMLE. With our position-dependent attention, the top-rank enhanced ListMLE can make further improvement over the baseline system(+1.07 and +0.73 on extended and sparse feature sets, respectively.) and achieves the best performance.

The top- loss might be too loose as an approximation of the measure of BLEU. Compared to top- ListMLE, our top-rank enhanced ListMLE can further utilize the different portions of the list by different weights. To verify the claim, we further examined the learning processes of the two losses. For simplicity, the experiment is conducted on a translation list generated by random parameters. The results are shown in Figure 3. We can see that our top-rank enhanced loss almost completely inversely correlates with BLEU after iteration 70. In contrast, after iteration 150, although top-5 loss is still decreasing, BLEU starts to drop.

Methods MT02 MT04 MT05 AVG
PRO 40.90 38.84 38.64 39.64(+0.00)
KB-MIRA 41.09 38.49 38.62 39.40(-0.06)
ListNet 41.49 39.25 39.17 39.97(+0.51)
ListMLE-T5 41.26 39.63 39.32 40.07(+0.61)
ListMLE-TE 41.85 39.96 39.88 40.56(+1.10)
Table 4: Comparison of baselines and listwise approaches with a larger k-best list on extended feature set.

Due to the high computation cost of ListNet, we only perform the top-rank enhanced ListMLE in this paper. Our preliminary experiments indicate that the performance of ListNet can be further improved with a top-2 loss. We think our top-rank enhanced method is also useful for ListNet, but due to its computational demands it needs to be further investigated.

5.5 Impact of the Size of Candidate Lists

Our listwise tuning methods directly model the order of the translation list, it is clear that the choice of the translation list size has an impact on our methods. A larger candidate list size may result in the availability of more information during tuning. In order to verify our tuning methods’ capability of handling the larger translation list, we increase from 20 to 100. The comparison results are shown in Table 4. With a larger size , our tuning methods also perform better than baselines. For ListNet and top-5 ListMLE, we observe that the improvements over baseline is smaller than size 20. This results show that the order information loss caused by directly drop the bottom is aggravated with larger list size. However, our top-rank enhanced method still get a slight better result than size 20 and significant improvement over baseline by 1.1 BLEU score. This indicate that our top-rank enhanced method is more stable and can still effectively exploit the larger size translation list.

5.6 Performance on Basic Feature Set

Since the effectiveness of high dimensional feature set, recent work pays more attention to this scenario. Although previous discriminative tuning methods can effectively handle high dimensional feature set, MERT is still the dominant tuning method for basic features. Here, we investigate our top-rank enhanced tuning methods’ capability of handling basic feature set. Table 5 summarizes the comparison results. Firstly, we observe that ListNet and ListMLE can perform comparable with MERT. With our top-ranked enhanced method, we can get a better performance than MERT by 0.25 BLEU score. These results show that our top-ranked enhanced tuning method can learn more informations of translation list even with a basic feature set.

Methods MT02 MT04 MT05 AVG
MERT 37.72 37.13 36.77 37.21(+0.00)
PRO 37.85 37.21 36.68 37.24(+0.03)
KB-MIRA 37.97 37.28 36.58 37.28(+0.07)
ListNet 37.71 37.47 36.78 37.32(+0.11)
ListMLE 37.54 37.54 36.65 37.24(+0.03)
ListMLE-T5 37.90 37.32 36.84 37.35(+0.14)
ListMLE-TE 38.03 37.49 36.85 37.46(+0.25)
Table 5: Comparison of baseline and liswise approaches on basic feature set.

6 Related Work

The ranking problem is well studied in IR community. There are many methods been proposed, including pointwise Nallapati (2004), pairwise Herbrich et al. (1999); Burges et al. (2005) and listwise Cao et al. (2007); Xia et al. (2008) algorithms. Experiment results show that listwise methods deliver better performance than pointwise and pairwise methods in general Liu (2010).

Most NLP researches take ranking as an extra step after searching from its output space Charniak and Johnson (2005); Collins and Terry Koo (2005); Duh (2008). In SMT research, listwise approaches also have been employed for the reranking tasks. For example, Li2013 utilized two listwise approaches to rerank the translation outputs and achieved the best segment-level correlation with human judgments. Niehues2015 employed ListNet to rescore the -best translations, which significantly outperforms MERT, KB-MIRA and PRO. zhang2016 viewed the log-linear model as a pointwise ranking function and shifted it to listwise ranking function by introducing listwise features and outperformed the log-linear model. Compared to these efforts, our method takes a further step by integrating listwise ranking methods into the iterative training.

There are also some researches use ranking methods for tuning to guide better search. In SMT, previous attempts on using ranking as a tuning methods usually perform pairwise comparisons on a subset of translation pairs Chiang et al. (2008); Hopkins and May (2011); Watanabe (2012); Bazrafshan et al. (2012); Guzmán et al. (2015). Dreyer2015 even took all translation pairs of the -best list as training instances, which only obtained a comparable result with PRO and the implementation is more complicate. In this paper, we model the entire list as a whole unit, and propose training objectives that are sensitive to different parts of the list.

7 Conclusion

In this paper, we propose a listwise learning framework for statistical machine translation. In order to adapt listwise approaches, we use an iterative training framework in which instances from different iterations are aggregated into the training set. To emphasize the top order of the list, we further propose top-rank enhanced listwise learning losses. Compared to previous efforts in SMT tuning, our method directly models the order information of the complete translation list. Experiments show our method could lead to significant improvements of translation quality in different feature sets and beam size.

Our current work focuses on the traditional SMT task. For future work, it will be interesting to integrate our methods to modern neural machine translation systems or other structure prediction problems. It may also be interesting to explore more methods on listwise tuning framework, such as investigating different methods to enhance top order of translation list directly w.r.t a given evaluation metric.


The authors would like to thank the anonymous reviewers for their valuable comments. This work is supported by the National Science Foundation of China (No. 61672277, 61300158 and 61472183). Part of Huadong Chen’s contribution was made while visiting University of Notre Dame. His visit was supported by the joint PhD program of China Scholarship Council.