Phrase-based statistical machine translation (PBSMT) [williams2016] has proven to be an effective approach to the task of machine translation. Even though in the last years neural systems have gained most of the attention from industry and academia, a number of recent works show that statistical approaches may still provide relevant results in hybrid, unsupervised or low-resource scenarios [artetxe, byrne]. In PBSMT systems, the source-language (SL) sentence is split into non overlapping word sequences (known as phrases) and a translation into target-language (TL) is chosen for each phrase from a phrase table
of bilingual phrase pairs extracted from a parallel corpus. A decoder efficiently chooses the splitting points and the corresponding TL equivalents by using the information provided by a set of features, which usually include probabilities provided by translation models as well as a language model. In spite of their performance, PBSMT systems are affected by some limitations due to the local strategy they follow. More specifically, they tend to overpass long range dependencies, which may negatively affect translation quality. Additionally, phrase reordering is usually instrumented by means of distortion heuristics and lexicalised reordering models or an operation-sequence model[durrani15_osm] that cannot cope with the multiple structural divergences that are often necessary to translate between most language pairs.
Tree-based models address these issues by relying on a recursive representation of sentences that allows for gaps between words. Among these models, hierarchical phrase-based statistical machine translation (HSMT) [chiang2005, Chiang-CL-2007] has gained lot of attention due to its relative simplicity and the lack of need for linguistic knowledge. HSMT infers synchronous context-free grammars (SCFG); remarkably, HSMT infers grammars with a single non-terminal .111Strictly speaking, two non-terminals are used since an additional non-terminal (used in the glue rules) is set as the initial symbol of the grammar [chiang2005, Chiang-CL-2007]. Chiang-CL-2007 also sets some additional restrictions on the extracted rules in order to contain the combinatorial explosion in the number of rules, thus reducing decoding complexity.
This paper shows that the main limitation introduced in standard HSMT, namely, the use of a single non-terminal, can be overcome to improve translation quality. Our strategy differs from standard HSMT in that a different non-terminal is used for every possible gap when extracting the initial set of rules; this may easily result in millions of non-terminals which are then merged following an equivalence criterion, thus reducing the initial number of non-terminals in several orders of magnitude. Our experiments show that the resulting set of non-terminals correctly capture the contextual information that makes it possible to statistically significantly improve the BLEU score of the standard HSMT approach. However, additional research has to be carried out so that our method scales up with large training corpora.
Section 2 discusses related works. Section 3 then introduces the formalism of synchronous context-free grammars and the standard procedure to obtain them from parallel corpora and use them in HSMT. Section 4 presents our proposal for rule inference, including our criterion for variable equivalence and refinement. The experimental set-up and the results of our experiments are presented in Section 5. Finally, the paper ends with some discussion and conclusions.
2 Related work
Probabilistic context-free grammars (PCFGs) have been traditionally used to model the generation of monolingual languages [Charniak-1994]. The refinement of probabilistic context-free grammars has been addressed before in a number of papers. For example, Matsuzaki-ACL-2005 use a set of latent symbols to annotate —and therefore specialize— non-terminals in a probabilistic context-free grammar. Every non-terminal in the grammar is split into
non-terminals and the probabilities of the new productions are estimated to maximize the likelihood of the training set of strings.
In the approach for the refinement of PCFGs followed by Matsuzaki-ACL-2005, the number of latent non-terminals must be small ( ranging from 1 to 16 in the experiments) since the training time and memory grow very fast with
. The input trees are also binarized before the estimation of the parameters to avoid an exponential growth in the number of productions contributing to every sum of probabilities. The method outperforms non-lexicalized approaches in terms of parsing accuracy measured over the Wall Street Journal (WSJ) portion of the Penn treebank[Marcus-CL-1993, Marcus-HTL-1994].
In contrast, Petrov-ACL-2006 apply a more sophisticated procedure for the specialization of the non-terminals in a PCFG. Instead of generating all possible annotations, their hierarchical splitting starts with the nearly one hundred tags in the WSJ corpus (binarization of the parse trees is also applied here) and leads to about one thousand symbols with a significant reduction in parsing error rate. The procedure splits iteratively every symbol into two sub-symbols and performs expectation-maximization optimization to estimate the probability of the productions. This method allows for a deeper recursive partition of some symbols (up to 6 times in the experiments with the WSJ corpus, as reported by the authors). Their results showed that significant improvements in the accuracy of parsing can be achieved with grammars which are more compact than those obtained by Matsuzaki-ACL-2005. Parsing with the enhanced grammar can be accelerated with a coarse-to-fine scheme[Petrov-HLTC-2007] which prunes the items analyzed in the chart according to the estimations given by a simpler grammar where non-terminals are clustered into a smaller number of classes.
While strings are usually not enough to identify a particular PCFG, there exist methods which guaranteee convergence to the true grammar when structural information is available. For example, Pereira-ACL-1992 apply expectation-maximization optimization to the identification of PCFGs from partially bracketed samples (sentences with parenthesis marking the constituent boundaries which the analysis should respect). The probabilistic tree grammars obtained after the optimization are non-deterministic (in the sense that multiple states are reachable as the result of processing a subtree bottom-up) and their trees must be binarized, a procedure that needs some linguistic guidance in order to generate meaningful probabilistic models.
The identification of probabilistic grammars has been addressed also through state-merging techniques. For example, such techniques have been applied to the case of regular grammars modelling string languages by Stolcke-NIPS-1992 and Carrasco-ITA-1999. In these methods, the initial grammar has one state or non-terminal per string-prefix in the sample and, then, the procedure looks for the optimal partition of the set of non-terminals, an approach that is similar to the one we also follow in this work. Convergence is guaranteed by keeping a frontier of states with all strings whose prefixes have been already examined.
Experimental work on the identification of regular string languages on sparse data [Lang-ICGI-1998] has shown the importance of exploring earlier the nodes about which there is more information available (a procedure known as evidence-driven merging). This result suggests that instead of using the straightforward breadth-first search (the simplest which guarantees convergence) one should consider more elaborate orders where those pairs of subtrees in the frontier with a higher number of observations are considered earlier.
Generalizing the merging techniques for string languages, Carrasco-ML-2001 define a procedure which identifies any regular tree language by comparing pairs of non-terminals: each initial non-terminal corresponds to a subtree in the sample and they are compared in a breadth-first mode and merged when they are found to be compatible.
As already commented in the introduction, chiang2005 extended the phrase-based approach in statistical machine translation to tree-based translation by adapting the rule extraction algorithm to learn phrase pairs with gaps [Koehn2010], and also presented a decoding method based on chart parsing, thus obtaining a working hierarchical phrase-based statistical machine translation system. Chiang-CL-2007 then refined the method by introducing cube pruning to improve the efficiency of the chart decoder. Other authors [MaillettedeBuyWenniger2015, vilar2010, mylonakis2011] have introduced additional improvements to the rule learning procedure in the original proposal.
3 Synchronous context-free grammars in machine translation
Transduction grammars [Lewis-ACM-1968] have been used in SMT to model the hierarchical nature of translation which is implicit in the alignments between words in a bitext [Wu-CL-1997, Chiang-CL-2007]. A bitext or bilingual text consist of a finite sequence of sentence pairs, where every second component , the target sentence, is the translation of the first component in the pair, the source sentence.
A word-level alignment annotates every sentence pair with a subset of —where and are the sentence lengths— and provides those pairs of word positions in the source and target sentence which can be linked together according to a translation model.
A transduction grammar consists of a finite set of non-terminals , two sets of terminals —here, and consist of segments of words contained in source and target sentences respectively—, an initial symbol , a finite set of production rules (rules or productions, for short) and a set of one-to-one mappings with . For every rule , couples every instance of a non-terminal in and an instance of the same non-terminal in —therefore, and must have an identical number and type of non-terminals. A production in a transduction grammar will be written in the following as and their left and right components as and , respectively.
The synchronous context-free grammars introduced by Chiang-CL-2007 are transduction grammars whose productions have some restrictions:
there is a single productive non-terminal , in addition to the initial non-terminal , that is, ;
there is an upper limit on the length of the bilingual phrases of 10 words in either side;
there is an upper limit of 2 in the number of non-terminals that can appear in a production;
the size of the source side of production rules can have at most 5 terminals and non-terminals;
the productions cannot contain contiguous non-terminals on the source side and they must include some lexical content (terminals); and
two glue rules are defined to start derivations: and .
The transduction grammars employed by Chiang-CL-2007 restrict terminals to be in the set of bilingual phrase pairs obtained with the same extraction algorithm used in phrase-based SMT [Koehn2010, sect. 5.2.2]: a bilingual phrase pair or biphrase is a pair in which is consistent with the word alignments provided by a statistical aligner for the bitext .222Essentially, a pair is a bilingual phrase pair if: and are segments in the source and target sentence, respectively; no word in is aligned to a word not in and vice versa; and at least one word in is aligned to a word in . The procedure also applies the following extraction rules:
For every phrase pair , add a production to .
For every production such that with , , and , add the production to —with extending with a new link between the inserted pair.
The previous procedure ends up generating production rules such as the following English–Chinese rule:
where the numbers in the superindexes are not used to represent different non-terminals but the coupling between non-terminals resulting from the corresponding one-to-one mapping .
Each rule is given a probabilistic score. In order to find the most probable translation of an input sentence according to the grammar model, chart parsing is used at decoding time [Chiang-CL-2007].
4 A new method for grammar induction
Our strategy differs from the one by Chiang-CL-2007 already introduced in the previous section in that a different non-terminal is used for every possible gap when extracting the initial set of rules; this may easily result in millions of non-terminals which are then merged following an equivalence criterion, thus reducing the initial number of non-terminals in several order of magnitudes. Consequently, the following sections present the algorithm for extraction of production rules (Section 4.1), the criterion for considering two non-terminals as equivalent (Section 4.2) and the merging methods which join non-terminals based on the equivalence criterion (Section 4.3).
4.1 Extraction of production rules
The extraction phase assigns a different non-terminal to every production as follows (compare with the strategy proposed by Chiang-CL-2007 and described in Section 3):
Start with the set of initial non-terminals — being the initial symbol— and empty set of production rules .
For every phrase pair , add a new non-terminal to — being the current size of —, and the new production , to . If is a sentence pair, then add also to . In contrast to Chiang-CL-2007, the length of the phrase-pairs used is not constrained. This results in a large number of production rules as well as in a large number of non-terminals but it is necessary to be able to reproduce each sentence pair in the training corpus.
For every production such that and there is a production , add the production to — being the size of . Note that the subscript in the left-hand side is not changed, that is, . Note also that for every phrase pair there is only one non-terminal that can be derived to obtain , either by means of the immediate rule or through a number of derivations starting with some other rule having .
Finally, the count of every non-terminal is computed as done by Chiang-CL-2007, that is, is the number of occurrences in the training corpus of the phrase pair generated by . In order to generate the count for the production rule , the count is equally distributed among all the productions generating that phrase pair, that is, between and the other productions with in the left-hand side added in step 3.
Figure 1 shows the resulting production rules and their fractional counts after applying our extraction procedure to the German–English sentence pairs (“das neue Haus”, “the new house”) and (“das House”, “the house”), assuming a monotonic alignment in which the -th word of one sentence is aligned with the -th word of the other sentence. Following Chiang-CL-2007, rules with no lexical content, such as (, ), have not been considered.
|(das neue Haus, the new house)|
|(das neue, the new)|
|(neue Haus, new house)|
|(das Haus, the house)|
|( Haus, house)|
|(das , the )|
|( neue Haus, new house)|
|(das Haus, the house)|
|(das neue , the new )|
|( neue, new)|
|(das , the )|
|( Haus, house)|
|(neue , new )|
|( Haus, house)|
|(das , the )|
4.2 Determining equivalent non-terminals
As will be presented in Section 4.3, our method will merge pairs of equivalent non-terminals. We will denote with the fact that and are equivalent and, thus, they must be merged. Then, implies that for all pairs of productions and which, for some and in and and in , have the form
the following equality is (approximately) satisfied
and, recursively, . The approximate matching between the above quotients is probabilistic in nature and must be defined therefore in terms of a stochastic test, such as the Hoeffding-JASA-1963 bound. When using this test for proportion comparison, two proportions and are not statistically different if:
where is the confidence level.333 Following sebban, we use a Fisher exact test as a back-off test in the experiments when the number of observations is small.
Following sebban, we use a Fisher exact test as a back-off test in the experiments when the number of observations is small.Note that the possible use of in the proportion test is twofold: on the one hand, we may set to a fixed value in order to test whether two proportions are statistically different; on the other hand, we may compute the value of for which the test changes from true to false and use this value as a continuous measure of non-terminal dissimilarity: after isolating in the Hoeffding test equation and removing terms which are constant across different evaluations, we can easily arrive to a function that provides the dissimilarity of two non-terminals in a particular comparison context based on their respective counts:
The dissimilarity of two variables is therefore obtained as the maximum value of for all the contexts representing the different tests performed upon the variables as described in the previous algorithm.
4.2.1 Example of equivalence computation
In order to gain some insight on the meaning of equivalence between non-terminals, let us now consider, for instance, the comparison between and in the example in Table 1, which can be considered plausible candidates for equivalence since they both generate a noun phrase pair. The fractional number of occurrences for and are given by and . Note that receives contributions from both sentence pairs.
Non-terminal is only in —with weight — and therefore, equivalence implies that a production with right-hand side (das , the ) must be in with a weight which is consistent with Equation (1). Indeed, is in with weight : if both production will appear with a relative frequency which must be similar to the relative frequency for and . In this case,
Furthermore, the left-hand sides, and must be also equivalent.444Since they only appear in -productions, this is trivially true.
The quotients above reflect that the word pair (Haus, house) has been observed in two different contexts: after the biword (das, the) and also following (neue, new); in contrast, the phrase pair (neue Haus, new house) has been only observed after (das, the). This asymmetry will lead to different values in the relative frequencies. Of course, one cannot expect to draw definitive conclusions from such a tiny bitext —clearly, a much larger sample will be needed to extract reliable estimates— but the example illustrates how the frequency estimates provide hints to differentiate between equivalent non-terminal pairs and those which, in contrast, should remain distinct.
Once productions with on the right-hand side have been checked, those with should be checked although, by the symmetry of the test, only those that have no correspondent -production. In our example, the tests will be
and, thus, .
Note that, if done carefully, recursion is always finite because the comparison between non-terminals is consistent with the depth of the subtree associated to each non-terminal.
An efficient algorithm has to be carefully designed in order to avoid duplicate calls when the equivalence between and is tested (since equivalence is a symmetric and transitive relation).
4.3 Merging variables
Given the equivalence criterion presented in the previous section, an algorithm for non-terminal merging can be run in order to group equivalent non-terminals and reduce the initial number. We have evaluated two different algorithms: an adaptation of the Blue-Fringe algorithm and a -medoids clustering algorithm.
4.3.1 The Blue-Fringe algorithm
The following description, based on that by Lang-ICGI-1998 of the procedure proposed by Juille-AAAI-1998, adapts the Blue-Fringe algorithm for the identification of regular string languages to the case of transduction grammars.
The procedure splits the set of non-terminals into three subsets: red (the kernel of mutually non-equivalent non-terminals), blue (the frontier being explored) and white (the subset of non-terminals with pending classification). At every iteration a non-terminal is removed from the frontier and either it becomes a new member of or it is merged with an equivalent one in (and, thus, removed from ). As will be seen immediately, after every addition or merge, some non-terminals in can be moved to .
In order to avoid a possible infinite recursion in equivalence tests, non-terminals in must not produce any non-terminal in through derivation. This condition can be guaranteed if a non-terminal can only enter if all the content on the right-hand side of productions with the form is either lexical or already in : this implies that any production such that is in satisfies .
Initially, leaves (non-terminals producing only irreducible phrase pairs) are red; non-terminals with a production with only lexical content and red non-terminals are blue; all other non-terminals are white. Our method will merge pairs of equivalent non-terminals by comparing those with a higher number of observations first [Lang-ICGI-1998]. The policy described by Juille-AAAI-1998 performs the following actions while there are still blue non-terminals:
Evaluate all red–blue merges.
If there exists a blue non-terminal that cannot be merged with any red non-terminal (meaning that no equivalent non-terminal is found), promote one of the shallowest such blue non-terminals to red (ties are broken at random).
Otherwise (if no blue non-terminal can be promoted), perform the red–blue merge with highest score. This score here is be based on the fractional number of occurrences of the corresponding non-terminals.
Then, some white non-terminals are moved to the frontier as stated before.
4.3.2 -medoids clustering
The -medoids algorithm is a clustering algorithm similar to the well-known -means, but with the particularity that the center of each cluster (the medoid) is a point in the data, which is specially relevant in our case since the representative of each cluster must be an existing non-terminal in the grammar. Unlike the Blue-Fringe algorithm, which attains a different number of final non-terminals depending on the confidence level , the parameter set a priori in this case is the number of final clusters (i.e. non-terminals) . Note that when following this merge approach, the dissimilarity between non-terminals in Equation (2) is used to compute the required distances.
5 Experimental setup
We trained and evaluated our grammar induction procedure on the English–Spanish language pair using a small fraction of the EMEA corpus.555http://opus.nlpl.eu/EMEA.php Table 2 provide additional information about the corpora used in the experiments.
In order to have a manageable initial set of non-terminals and productions and make the problem computationally affordable we limited the sentences to be included in the training corpus to a maximum of 20 words. In addition, instead of using the words themselves when defining the initial set of non-terminals we used word classes. In particular, we used 10 word classes obtained by running mkcls666http://www.statmt.org/moses/giza/mkcls.html [och99] for 5 iterations. As a result, the initial set of non-terminals contains 852,423 non-terminals and the amount of productions, not including those involving the initial non-terminal , is 2,055,902.
|Corpus||# Sentences||# Words (en/es)|
|train||73,372||519,763 / 556,453|
|dev||2,000||22,410 / 25,219|
|test||3,000||33,281 / 37,492|
All the experiments were carried out with the free/open-source SMT system Moses[koehn2007], release 2.1.1. GIZA++ [ochney2003] was used for computing words alignments. KenLM was used to train a 5-gram language model on a monolingual corpus made of Europarl v7,777http://www.statmt.org/wmt13/ training-monolingual-europarl-v7.tgz News Commentary v8888http://www.statmt.org/wmt13/ training-monolingual-nc-v8.tgz and the EMEA sentences in the training corpus; in total the corpus used for training the language model consists of 3,423,702 Spanish sentences. The weights of the different feature functions were optimised by means of minimum error rate training [och2003]. The parallel corpora were tokenised and truecased before training, as were the development and test sets used.
Table 3 reports the BLEU scores obtained on the test set when running the Blue-Fringe algorithm for different values of . The amount of output non-terminals in the inferred context-free grammar is also reported. The best results are obtained with . The performance of the baseline hierarchical phrase-based system [chiang2005] is 0.5818. The difference in performance is statistically significant for the figures in bold according to paired bootstrap resampling [koehn04:_statis] with .
|# final non-terminals||BLEU|
Table 4 shows the BLEU scores obtained on the test set when the -medoids clustering algorithm is run over the most frequent non-terminals (# clustered non-terminals) to obtained a pre-fixed number of clusters, that is, of non-terminals in the inferred grammar; the remaining non-terminals are then added to the nearest cluster after the algorithm finishes. The table reports results for 2, 3 and 4 clusters; although we tried with more clusters these are the numbers of clusters for which we got the best results. The difference in performance with the baseline is statistically significant for all the figures reported in the table according to paired bootstrap resampling [koehn04:_statis] with .
The results obtained when the number of non-terminals over which the k-medoids is run is set to 250 and the number of cluster to obtained is set to 3 are better than those obtained with the Blue-Fringe algorithm and better that the results achieved by the baseline. The k-medoids allow us to get an 3% improvement in BLEU with just three non-terminals, in contrast with the thousand non-terminals obtained by the Blue-Fringe algorithm.
|# clustered||# final non-terminals|
This work extends the well-known algorithm for rule extraction in hierarchical statistical machine translation originally proposed by Chiang-CL-2007. Our proposal allows for more than one non-terminal in the resulting synchronous context-free grammar, thus incorporating a specialisation in the resulting non-terminals. Our method works by initially creating a different non-terminal for every possible gap when extracting the initial set of rules; this may easily result in millions of non-terminals which are then merged following a novel equivalence criterion for non-terminals. Two merging strategies are proposed: one inspired on the Blue-Fringe algorithm that joins non-terminals on a one-by-one basis, and another one that performs a -medoids clustering over a reduce set of non-terminals. A statistically significant improvement in BLEU as with respect to the original method is obtained with both merging criteria.
For the experiments we used a small parallel corpus and restricted the length of the parallel sentences in the training corpus to 20 words. This was necessary in order to be able to run the merging algorithm in reasonable time. Recall that, in contrast to Chiang-CL-2007, we do not restrict the length of the phrase pairs in order to be able to reproduce the parallel sentences in the training corpus; otherwise long-range reorderings happening near the root of the parse tree of the sentences would not be possible. We tried different methods for filtering the rule table before applying the approach described in this paper so as to be able to use larger corpora and longer sentences, although with no success. A deeper exploration of potential optimizations is necessary.
Work supported by the Spanish government through project EFFORTUNE (TIN2015-69632-R). The authors thank Mikel L. Forcada for his helpful comments.