Since Och [och2003minimum] proposed minimum error rate training (MERT) to exactly optimize objective evaluation measures, MERT has become a standard model tuning technique in statistical machine translation (SMT). Though MERT performs better by improving its searching algorithm [macherey2008lattice, cer2008regularization, galley2011optimal, moore2008random], it does not work reasonably when there are lots of features111The regularized MERT seems promising from Galley et al. [galley2013regularized] at the cost of model complexity.. As a result, margin infused relaxed algorithms (MIRA) dominate in this case [mcdonald2005online, watanabe2007online, chiang2008online, tan2013corpus, cherry2012batch].
In SMT, MIRAs consider margin losses related to sentence-level BLEUs. However, since the BLEU is not decomposable into each sentence, these MIRA algorithms use some heuristics to compute the exact losses, e.g., pseudo-document[chiang2008online], and document-level loss [tan2013corpus].
Recently, another successful work in large-scale feature tuning include force decoding based[yu2013max], classification based [hopkins2011tuning].
We aim to provide a simpler tuning method for large-scale features than MIRAs. Out motivation derives from an observation on MERT. As MERT considers the quality of only top hypothesis set, there might have more-than-one set of parameters, which have similar top1 performances in tuning, but have very different topN hypotheses. Empirically, we expect an ideal model to benefit the total N-best list. That is, better hypotheses should be assigned with higher ranks, and this might decrease the error risk of top1 result on unseen data.
Plackett[plackett1975analysis] offered an easy-to-understand theory of modeling a permutation. An N-best list is assumedly generated by sampling without replacement. The th hypothesis to sample relies on those ranked after it, instead of on the whole list. This model also supports a partial permutation which accounts for top positions in a list, regardless of the remaining. When taking as 1, this model reduces to a standard conditional probabilistic training, whose dual problem is actual the maximum entropy based [och2002discriminative]. Although Och [och2003minimum] substituted direct error optimization for a maximum entropy based training, probabilistic models correlate with BLEU well when features are rich enough. The similar claim also appears in [zhu2001kernel]. This also make the new method be applicable in large-scale features.
2 Plackett-Luce Model
Plackett-Luce was firstly proposed to predict ranks of horses in gambling [plackett1975analysis]. Let be
horses with a probability distributionon their abilities to win a game, and a rank of horses can be understood as a generative procedure, where denotes the index of the horse in the th position.
In the 1st position, there are horses as candidates, each of which has a probability to be selected. Regarding the rank , the probability of generating the champion is . Then the horse is removed from the candidate pool.
In the 2nd position, there are only horses, and their probabilities to be selected become , where is the normalization. Then the runner-up in the rank , the th horse, is chosen at the probability . We use a consistent terminology in selecting the champion, though equals trivially.
This procedure iterates to the last rank in . The key idea for the Plackett-Luce model is the choice in the th position in a rank only depends on the candidates not chosen at previous stages. The probability of generating a rank is given as follows
We offer a toy example (Table 1) to demonstrate this procedure.
The permutation probabilities form a probability distribution over a set of permutations . For example, for each , we have , and .
We have to note that, is not necessarily required to be completely ranked permutations in theory and in practice, since gamblers might be interested in only the champion and runner-up, and thus . In experiments, we would examine the effects on different length of permutations, systems being termed .
Given any two permutations and , and they are different only in two positions and , , with and . If , then .
In other words, exchanging two positions in a permutation where the horse more likely to win is not ranked before the other would lead to an increase of the permutation probability.
This suggests the ground-truth permutation, ranked decreasingly by their probabilities, owns the maximum permutation probability on a given distribution. In SMT, we are motivated to optimize parameters to maximize the likelihood of ground-truth permutation of an N-best hypotheses.
Due to the limitation of space, see [plackett1975analysis, cao2007learning] for the proofs of the theorems.
3 Plackett-Luce Model in Statistical Machine Translation
In SMT, let denote source sentences, and denote target hypotheses. A set of features are defined on both source and target side. We refer to
as a feature vector of a hypothesis from theth source sentence, and its score from a ranking function is defined as the inner product of the weight vector and the feature vector.
We first follow the popular exponential style to define a parameterized probability distribution over a list of hypotheses.
The ground-truth permutation of an best list is simply obtained after ranking by their sentence-level BLEUs. Here we only concentrate on their relative ranks which are straightforward to compute in practice, e.g. add 1 smoothing. Let be the ground-truth permutation of hypotheses from the
th source sentences, and our optimization objective is maximizing the log-likelihood of the ground-truth permutations and penalized using a zero-mean and unit-variance Gaussian prior. This results in the following objective and gradient:
where is defined as the in Formula (1) of the th source sentence.
The log-likelihood function is smooth, differentiable, and concave with the weight vector , and its local maximal solution is also a global maximum. Iteratively selecting one parameter in for tuning in a line search style (or MERT style) could also converge into the global global maximum [bertsekas1999nonlinear]. In practice, we use more fast limited-memory BFGS (L-BFGS) algorithm [byrd1995limited].
N-best Hypotheses Resample
The log-likelihood of a Plackett-Luce model is not a strict upper bound of the BLEU score, however, it correlates with BLEU well in the case of rich features. The concept of “rich” is actually qualitative, and obscure to define in different applications. We empirically provide a formula to measure the richness in the scenario of machine translation.
The greater, the richer. In practice, we find a rough threshold of r is 5.
In engineering, the size of an N-best list with unique hypotheses is usually less than several thousands. This suggests that, if features are up to thousands or more, the Plackett-Luce model is quite suitable here. Otherwise, we could reduce the size of N-best lists by sampling to make beyond the threshold.
Their may be other efficient sampling methods, and here we adopt a simple one. If we want to samples from a list of hypotheses , first, the best hypotheses and the worst hypotheses are taken by their sentence-level BLEUs. Second, we sample the remaining hypotheses on distribution , where is an initial weight from last iteration.
We compare our method with MERT and MIRA222MIRA is from the open-source Moses [koehn2007moses] in two tasks, iterative training, and N-best list rerank. We do not list PRO [hopkins2011tuning] as our baseline, as Cherry et al.[cherry2012batch] have compared PRO with MIRA and MERT massively.
In the first task, we align the FBIS data (about 230K sentence pairs) with GIZA++, and train a 4-gram language model on the Xinhua portion of Gigaword corpus. A hierarchical phrase-based (HPB) model (Chiang, 2007) is tuned on NIST MT 2002, and tested on MT 2004 and 2005. All features are eight basic ones [chiang2007hierarchical] and extra 220 group features. We design such feature templates to group grammars by the length of source side and target side, (feat-type,asrc-sideb,ctgt-sided), where the feat-type denotes any of the relative frequency, reversed relative frequency, lexical probability and reversed lexical probability, and [a, b], [c, d] enumerate all possible subranges of [1, 10], as the maximum length on both sides of a hierarchical grammar is limited to 10. There are 4 55 extra group features.
In the second task, we rerank an N-best list from a HPB system with 7491 features from a third party. The system uses six million parallel sentence pairs available to the DARPA BOLT Chinese-English task. This system includes 51 dense features (translation probabilities, provenance features, etc.) and up to 7440 sparse features (mostly lexical and fertility-based). The language model is a 6-gram model trained on a 10 billion words, including the English side of our parallel corpora plus other corpora such as Gigaword (LDC2011T07) and Google News. For the tuning and test sets, we use 1275 and 1239 sentences respectively from the LDC2010E30 corpus.
4.1 Plackett-Luce Model for SMT Tuning
We conduct a full training of machine translation models. By default, a decoder is invoked for at most 40 times, and each time it outputs hypotheses to be combined with those from previous iterations and sent into tuning algorithms.
In getting the ground-truth permutations, there are many ties with the same sentence-level BLEU, and we just take one randomly. In this section, all systems have only around two hundred features, hence in Plackett-Luce based training, we sample 30 hypotheses in an accumulative best list in each round of training.
All results are shown in Table 2, we can see that all PL() systems does not perform well as MERT or MIRA in the development data, this maybe due to that PL() systems do not optimize BLEU and the features here are relatively not enough compared to the size of N-best lists (empirical Formula 5). However, PL() systems are better than MERT in testing. PL() systems consider the quality of hypotheses from the th to the th, which is guessed to act the role of the margin like SVM in classification . Interestingly, MIRA wins first in training, and still performs quite well in testing.
The PL(1) system is equivalent to a max-entropy based algorithm [och2002discriminative] whose dual problem is actually maximizing the conditional probability of one oracle hypothesis. When we increase the , the performances improve at first. After reaching a maximum around , they decrease slowly. We explain this phenomenon as this, when features are rich enough, higher BLEU scores could be easily fitted, then longer ground-truth permutations include more useful information.
4.2 Plackett-Luce Model for SMT Reranking
After being de-duplicated, the N-best list has an average size of around 300, and with 7491 features. Refer to Formula 5, this is ideal to use the Plackett-Luce model. Results are shown in Figure 1. We observe some interesting phenomena.
First, the Plackett-Luce models boost the training BLEU very greatly, even up to 2.5 points higher than MIRA. This verifies our assumption, richer features benefit BLEU, though they are optimized towards a different objective.
Second, the over-fitting problem of the Plackett-Luce models PL() is alleviated with moderately large . In PL(1), the over-fitting is quite obvious, the portion in which the curve overpasses MIRA is the smallest compared to other , and its convergent performance is below the baseline. When is not smaller than 5, the curves are almost above the MIRA line. After 500 L-BFGS iterations, their performances are no less than the baseline, though only by a small margin.
This experiment displays, in large-scale features, the Plackett-Luce model correlates with BLEU score very well, and alleviates overfitting in some degree.