1 Introduction
Phrasebased statistical machine translation (PBSMT) [williams2016] has proven to be an effective approach to the task of machine translation. Even though in the last years neural systems have gained most of the attention from industry and academia, a number of recent works show that statistical approaches may still provide relevant results in hybrid, unsupervised or lowresource scenarios [artetxe, byrne]. In PBSMT systems, the sourcelanguage (SL) sentence is split into non overlapping word sequences (known as phrases) and a translation into targetlanguage (TL) is chosen for each phrase from a phrase table
of bilingual phrase pairs extracted from a parallel corpus. A decoder efficiently chooses the splitting points and the corresponding TL equivalents by using the information provided by a set of features, which usually include probabilities provided by translation models as well as a language model. In spite of their performance, PBSMT systems are affected by some limitations due to the local strategy they follow. More specifically, they tend to overpass long range dependencies, which may negatively affect translation quality. Additionally, phrase reordering is usually instrumented by means of distortion heuristics and lexicalised reordering models or an operationsequence model
[durrani15_osm] that cannot cope with the multiple structural divergences that are often necessary to translate between most language pairs.Treebased models address these issues by relying on a recursive representation of sentences that allows for gaps between words. Among these models, hierarchical phrasebased statistical machine translation (HSMT) [chiang2005, ChiangCL2007] has gained lot of attention due to its relative simplicity and the lack of need for linguistic knowledge. HSMT infers synchronous contextfree grammars (SCFG); remarkably, HSMT infers grammars with a single nonterminal .^{1}^{1}1Strictly speaking, two nonterminals are used since an additional nonterminal (used in the glue rules) is set as the initial symbol of the grammar [chiang2005, ChiangCL2007]. ChiangCL2007 also sets some additional restrictions on the extracted rules in order to contain the combinatorial explosion in the number of rules, thus reducing decoding complexity.
This paper shows that the main limitation introduced in standard HSMT, namely, the use of a single nonterminal, can be overcome to improve translation quality. Our strategy differs from standard HSMT in that a different nonterminal is used for every possible gap when extracting the initial set of rules; this may easily result in millions of nonterminals which are then merged following an equivalence criterion, thus reducing the initial number of nonterminals in several orders of magnitude. Our experiments show that the resulting set of nonterminals correctly capture the contextual information that makes it possible to statistically significantly improve the BLEU score of the standard HSMT approach. However, additional research has to be carried out so that our method scales up with large training corpora.
Section 2 discusses related works. Section 3 then introduces the formalism of synchronous contextfree grammars and the standard procedure to obtain them from parallel corpora and use them in HSMT. Section 4 presents our proposal for rule inference, including our criterion for variable equivalence and refinement. The experimental setup and the results of our experiments are presented in Section 5. Finally, the paper ends with some discussion and conclusions.
2 Related work
Probabilistic contextfree grammars (PCFGs) have been traditionally used to model the generation of monolingual languages [Charniak1994]. The refinement of probabilistic contextfree grammars has been addressed before in a number of papers. For example, MatsuzakiACL2005 use a set of latent symbols to annotate —and therefore specialize— nonterminals in a probabilistic contextfree grammar. Every nonterminal in the grammar is split into
nonterminals and the probabilities of the new productions are estimated to maximize the likelihood of the training set of strings.
In the approach for the refinement of PCFGs followed by MatsuzakiACL2005, the number of latent nonterminals must be small ( ranging from 1 to 16 in the experiments) since the training time and memory grow very fast with
. The input trees are also binarized before the estimation of the parameters to avoid an exponential growth in the number of productions contributing to every sum of probabilities. The method outperforms nonlexicalized approaches in terms of parsing accuracy measured over the Wall Street Journal (WSJ) portion of the Penn treebank
[MarcusCL1993, MarcusHTL1994].In contrast, PetrovACL2006 apply a more sophisticated procedure for the specialization of the nonterminals in a PCFG. Instead of generating all possible annotations, their hierarchical splitting starts with the nearly one hundred tags in the WSJ corpus (binarization of the parse trees is also applied here) and leads to about one thousand symbols with a significant reduction in parsing error rate. The procedure splits iteratively every symbol into two subsymbols and performs expectationmaximization optimization to estimate the probability of the productions. This method allows for a deeper recursive partition of some symbols (up to 6 times in the experiments with the WSJ corpus, as reported by the authors). Their results showed that significant improvements in the accuracy of parsing can be achieved with grammars which are more compact than those obtained by MatsuzakiACL2005. Parsing with the enhanced grammar can be accelerated with a coarsetofine scheme
[PetrovHLTC2007] which prunes the items analyzed in the chart according to the estimations given by a simpler grammar where nonterminals are clustered into a smaller number of classes.While strings are usually not enough to identify a particular PCFG, there exist methods which guaranteee convergence to the true grammar when structural information is available. For example, PereiraACL1992 apply expectationmaximization optimization to the identification of PCFGs from partially bracketed samples (sentences with parenthesis marking the constituent boundaries which the analysis should respect). The probabilistic tree grammars obtained after the optimization are nondeterministic (in the sense that multiple states are reachable as the result of processing a subtree bottomup) and their trees must be binarized, a procedure that needs some linguistic guidance in order to generate meaningful probabilistic models.
The identification of probabilistic grammars has been addressed also through statemerging techniques. For example, such techniques have been applied to the case of regular grammars modelling string languages by StolckeNIPS1992 and CarrascoITA1999. In these methods, the initial grammar has one state or nonterminal per stringprefix in the sample and, then, the procedure looks for the optimal partition of the set of nonterminals, an approach that is similar to the one we also follow in this work. Convergence is guaranteed by keeping a frontier of states with all strings whose prefixes have been already examined.
Experimental work on the identification of regular string languages on sparse data [LangICGI1998] has shown the importance of exploring earlier the nodes about which there is more information available (a procedure known as evidencedriven merging). This result suggests that instead of using the straightforward breadthfirst search (the simplest which guarantees convergence) one should consider more elaborate orders where those pairs of subtrees in the frontier with a higher number of observations are considered earlier.
Generalizing the merging techniques for string languages, CarrascoML2001 define a procedure which identifies any regular tree language by comparing pairs of nonterminals: each initial nonterminal corresponds to a subtree in the sample and they are compared in a breadthfirst mode and merged when they are found to be compatible.
As already commented in the introduction, chiang2005 extended the phrasebased approach in statistical machine translation to treebased translation by adapting the rule extraction algorithm to learn phrase pairs with gaps [Koehn2010], and also presented a decoding method based on chart parsing, thus obtaining a working hierarchical phrasebased statistical machine translation system. ChiangCL2007 then refined the method by introducing cube pruning to improve the efficiency of the chart decoder. Other authors [MaillettedeBuyWenniger2015, vilar2010, mylonakis2011] have introduced additional improvements to the rule learning procedure in the original proposal.
3 Synchronous contextfree grammars in machine translation
Transduction grammars [LewisACM1968] have been used in SMT to model the hierarchical nature of translation which is implicit in the alignments between words in a bitext [WuCL1997, ChiangCL2007]. A bitext or bilingual text consist of a finite sequence of sentence pairs, where every second component , the target sentence, is the translation of the first component in the pair, the source sentence.
A wordlevel alignment annotates every sentence pair with a subset of —where and are the sentence lengths— and provides those pairs of word positions in the source and target sentence which can be linked together according to a translation model.
A transduction grammar consists of a finite set of nonterminals , two sets of terminals —here, and consist of segments of words contained in source and target sentences respectively—, an initial symbol , a finite set of production rules (rules or productions, for short) and a set of onetoone mappings with . For every rule , couples every instance of a nonterminal in and an instance of the same nonterminal in —therefore, and must have an identical number and type of nonterminals. A production in a transduction grammar will be written in the following as and their left and right components as and , respectively.
The synchronous contextfree grammars introduced by ChiangCL2007 are transduction grammars whose productions have some restrictions:

there is a single productive nonterminal , in addition to the initial nonterminal , that is, ;

there is an upper limit on the length of the bilingual phrases of 10 words in either side;

there is an upper limit of 2 in the number of nonterminals that can appear in a production;

the size of the source side of production rules can have at most 5 terminals and nonterminals;

the productions cannot contain contiguous nonterminals on the source side and they must include some lexical content (terminals); and

two glue rules are defined to start derivations: and .
The transduction grammars employed by ChiangCL2007 restrict terminals to be in the set of bilingual phrase pairs obtained with the same extraction algorithm used in phrasebased SMT [Koehn2010, sect. 5.2.2]: a bilingual phrase pair or biphrase is a pair in which is consistent with the word alignments provided by a statistical aligner for the bitext .^{2}^{2}2Essentially, a pair is a bilingual phrase pair if: and are segments in the source and target sentence, respectively; no word in is aligned to a word not in and vice versa; and at least one word in is aligned to a word in . The procedure also applies the following extraction rules:

For every phrase pair , add a production to .

For every production such that with , , and , add the production to —with extending with a new link between the inserted pair.
The previous procedure ends up generating production rules such as the following English–Chinese rule:
where the numbers in the superindexes are not used to represent different nonterminals but the coupling between nonterminals resulting from the corresponding onetoone mapping .
Each rule is given a probabilistic score. In order to find the most probable translation of an input sentence according to the grammar model, chart parsing is used at decoding time [ChiangCL2007].
4 A new method for grammar induction
Our strategy differs from the one by ChiangCL2007 already introduced in the previous section in that a different nonterminal is used for every possible gap when extracting the initial set of rules; this may easily result in millions of nonterminals which are then merged following an equivalence criterion, thus reducing the initial number of nonterminals in several order of magnitudes. Consequently, the following sections present the algorithm for extraction of production rules (Section 4.1), the criterion for considering two nonterminals as equivalent (Section 4.2) and the merging methods which join nonterminals based on the equivalence criterion (Section 4.3).
4.1 Extraction of production rules
The extraction phase assigns a different nonterminal to every production as follows (compare with the strategy proposed by ChiangCL2007 and described in Section 3):

Start with the set of initial nonterminals — being the initial symbol— and empty set of production rules .

For every phrase pair , add a new nonterminal to — being the current size of —, and the new production , to . If is a sentence pair, then add also to . In contrast to ChiangCL2007, the length of the phrasepairs used is not constrained. This results in a large number of production rules as well as in a large number of nonterminals but it is necessary to be able to reproduce each sentence pair in the training corpus.

For every production such that and there is a production , add the production to — being the size of . Note that the subscript in the lefthand side is not changed, that is, . Note also that for every phrase pair there is only one nonterminal that can be derived to obtain , either by means of the immediate rule or through a number of derivations starting with some other rule having .

Finally, the count of every nonterminal is computed as done by ChiangCL2007, that is, is the number of occurrences in the training corpus of the phrase pair generated by . In order to generate the count for the production rule , the count is equally distributed among all the productions generating that phrase pair, that is, between and the other productions with in the lefthand side added in step 3.
Figure 1 shows the resulting production rules and their fractional counts after applying our extraction procedure to the German–English sentence pairs (“das neue Haus”, “the new house”) and (“das House”, “the house”), assuming a monotonic alignment in which the th word of one sentence is aligned with the th word of the other sentence. Following ChiangCL2007, rules with no lexical content, such as (, ), have not been considered.
Production  Count 
1  
1  
(das neue Haus, the new house)  
(das neue, the new)  
(neue Haus, new house)  
(das, the)  
(neue, new)  
(Haus, house)  
(das Haus, the house)  
( Haus, house)  
(das , the )  
( neue Haus, new house)  
(das Haus, the house)  
(das neue , the new )  
( neue, new)  
(das , the )  
( Haus, house)  
(neue , new )  
( Haus, house)  
(das , the ) 
4.2 Determining equivalent nonterminals
As will be presented in Section 4.3, our method will merge pairs of equivalent nonterminals. We will denote with the fact that and are equivalent and, thus, they must be merged. Then, implies that for all pairs of productions and which, for some and in and and in , have the form
and
the following equality is (approximately) satisfied
(1) 
and, recursively, . The approximate matching between the above quotients is probabilistic in nature and must be defined therefore in terms of a stochastic test, such as the HoeffdingJASA1963 bound. When using this test for proportion comparison, two proportions and are not statistically different if:
where is the confidence level.^{3}^{3}3
Following sebban, we use a Fisher exact test as a backoff test in the experiments when the number of observations is small.
Note that the possible use of in the proportion test is twofold: on the one hand, we may set to a fixed value in order to test whether two proportions are statistically different; on the other hand, we may compute the value of for which the test changes from true to false and use this value as a continuous measure of nonterminal dissimilarity: after isolating in the Hoeffding test equation and removing terms which are constant across different evaluations, we can easily arrive to a function that provides the dissimilarity of two nonterminals in a particular comparison context based on their respective counts:(2) 
The dissimilarity of two variables is therefore obtained as the maximum value of for all the contexts representing the different tests performed upon the variables as described in the previous algorithm.
4.2.1 Example of equivalence computation
In order to gain some insight on the meaning of equivalence between nonterminals, let us now consider, for instance, the comparison between and in the example in Table 1, which can be considered plausible candidates for equivalence since they both generate a noun phrase pair. The fractional number of occurrences for and are given by and . Note that receives contributions from both sentence pairs.
Nonterminal is only in —with weight — and therefore, equivalence implies that a production with righthand side (das , the ) must be in with a weight which is consistent with Equation (1). Indeed, is in with weight : if both production will appear with a relative frequency which must be similar to the relative frequency for and . In this case,
while
Furthermore, the lefthand sides, and must be also equivalent.^{4}^{4}4Since they only appear in productions, this is trivially true.
The quotients above reflect that the word pair (Haus, house) has been observed in two different contexts: after the biword (das, the) and also following (neue, new); in contrast, the phrase pair (neue Haus, new house) has been only observed after (das, the). This asymmetry will lead to different values in the relative frequencies. Of course, one cannot expect to draw definitive conclusions from such a tiny bitext —clearly, a much larger sample will be needed to extract reliable estimates— but the example illustrates how the frequency estimates provide hints to differentiate between equivalent nonterminal pairs and those which, in contrast, should remain distinct.
Once productions with on the righthand side have been checked, those with should be checked although, by the symmetry of the test, only those that have no correspondent production. In our example, the tests will be
and
and, thus, .
Note that, if done carefully, recursion is always finite because the comparison between nonterminals is consistent with the depth of the subtree associated to each nonterminal.
An efficient algorithm has to be carefully designed in order to avoid duplicate calls when the equivalence between and is tested (since equivalence is a symmetric and transitive relation).
4.3 Merging variables
Given the equivalence criterion presented in the previous section, an algorithm for nonterminal merging can be run in order to group equivalent nonterminals and reduce the initial number. We have evaluated two different algorithms: an adaptation of the BlueFringe algorithm and a medoids clustering algorithm.
4.3.1 The BlueFringe algorithm
The following description, based on that by LangICGI1998 of the procedure proposed by JuilleAAAI1998, adapts the BlueFringe algorithm for the identification of regular string languages to the case of transduction grammars.
The procedure splits the set of nonterminals into three subsets: red (the kernel of mutually nonequivalent nonterminals), blue (the frontier being explored) and white (the subset of nonterminals with pending classification). At every iteration a nonterminal is removed from the frontier and either it becomes a new member of or it is merged with an equivalent one in (and, thus, removed from ). As will be seen immediately, after every addition or merge, some nonterminals in can be moved to .
In order to avoid a possible infinite recursion in equivalence tests, nonterminals in must not produce any nonterminal in through derivation. This condition can be guaranteed if a nonterminal can only enter if all the content on the righthand side of productions with the form is either lexical or already in : this implies that any production such that is in satisfies .
Initially, leaves (nonterminals producing only irreducible phrase pairs) are red; nonterminals with a production with only lexical content and red nonterminals are blue; all other nonterminals are white. Our method will merge pairs of equivalent nonterminals by comparing those with a higher number of observations first [LangICGI1998]. The policy described by JuilleAAAI1998 performs the following actions while there are still blue nonterminals:

Evaluate all red–blue merges.

If there exists a blue nonterminal that cannot be merged with any red nonterminal (meaning that no equivalent nonterminal is found), promote one of the shallowest such blue nonterminals to red (ties are broken at random).

Otherwise (if no blue nonterminal can be promoted), perform the red–blue merge with highest score. This score here is be based on the fractional number of occurrences of the corresponding nonterminals.
Then, some white nonterminals are moved to the frontier as stated before.
4.3.2 medoids clustering
The medoids algorithm is a clustering algorithm similar to the wellknown means, but with the particularity that the center of each cluster (the medoid) is a point in the data, which is specially relevant in our case since the representative of each cluster must be an existing nonterminal in the grammar. Unlike the BlueFringe algorithm, which attains a different number of final nonterminals depending on the confidence level , the parameter set a priori in this case is the number of final clusters (i.e. nonterminals) . Note that when following this merge approach, the dissimilarity between nonterminals in Equation (2) is used to compute the required distances.
5 Experimental setup
We trained and evaluated our grammar induction procedure on the English–Spanish language pair using a small fraction of the EMEA corpus.^{5}^{5}5http://opus.nlpl.eu/EMEA.php Table 2 provide additional information about the corpora used in the experiments.
In order to have a manageable initial set of nonterminals and productions and make the problem computationally affordable we limited the sentences to be included in the training corpus to a maximum of 20 words. In addition, instead of using the words themselves when defining the initial set of nonterminals we used word classes. In particular, we used 10 word classes obtained by running mkcls^{6}^{6}6http://www.statmt.org/moses/giza/mkcls.html [och99] for 5 iterations. As a result, the initial set of nonterminals contains 852,423 nonterminals and the amount of productions, not including those involving the initial nonterminal , is 2,055,902.
Corpus  # Sentences  # Words (en/es) 
train  73,372  519,763 / 556,453 
dev  2,000  22,410 / 25,219 
test  3,000  33,281 / 37,492 
All the experiments were carried out with the free/opensource SMT system Moses
[koehn2007], release 2.1.1. GIZA++ [ochney2003] was used for computing words alignments. KenLM was used to train a 5gram language model on a monolingual corpus made of Europarl v7,^{7}^{7}7http://www.statmt.org/wmt13/ trainingmonolingualeuroparlv7.tgz News Commentary v8^{8}^{8}8http://www.statmt.org/wmt13/ trainingmonolingualncv8.tgz and the EMEA sentences in the training corpus; in total the corpus used for training the language model consists of 3,423,702 Spanish sentences. The weights of the different feature functions were optimised by means of minimum error rate training [och2003]. The parallel corpora were tokenised and truecased before training, as were the development and test sets used.6 Results
Table 3 reports the BLEU scores obtained on the test set when running the BlueFringe algorithm for different values of . The amount of output nonterminals in the inferred contextfree grammar is also reported. The best results are obtained with . The performance of the baseline hierarchical phrasebased system [chiang2005] is 0.5818. The difference in performance is statistically significant for the figures in bold according to paired bootstrap resampling [koehn04:_statis] with .
# final nonterminals  BLEU  
4,434  0.5838  
2,346  0.5868  
1,606  0.5859  
1,261  0.5855  
1,074  0.5866 
Table 4 shows the BLEU scores obtained on the test set when the medoids clustering algorithm is run over the most frequent nonterminals (# clustered nonterminals) to obtained a prefixed number of clusters, that is, of nonterminals in the inferred grammar; the remaining nonterminals are then added to the nearest cluster after the algorithm finishes. The table reports results for 2, 3 and 4 clusters; although we tried with more clusters these are the numbers of clusters for which we got the best results. The difference in performance with the baseline is statistically significant for all the figures reported in the table according to paired bootstrap resampling [koehn04:_statis] with .
The results obtained when the number of nonterminals over which the kmedoids is run is set to 250 and the number of cluster to obtained is set to 3 are better than those obtained with the BlueFringe algorithm and better that the results achieved by the baseline. The kmedoids allow us to get an 3% improvement in BLEU with just three nonterminals, in contrast with the thousand nonterminals obtained by the BlueFringe algorithm.
# clustered  # final nonterminals  
nonterminals  2  3  4 
125  0.5975  0.5991  0.5952 
250  0.5986  0.6000  0.5942 
500  0.5946  0.5972  0.5936 
1000  0.5950  0.5998  0.5939 
2500  0.5985  0.5984  0.5964 
5000  0.5947  0.5991  0.5947 
7 Conclusions
This work extends the wellknown algorithm for rule extraction in hierarchical statistical machine translation originally proposed by ChiangCL2007. Our proposal allows for more than one nonterminal in the resulting synchronous contextfree grammar, thus incorporating a specialisation in the resulting nonterminals. Our method works by initially creating a different nonterminal for every possible gap when extracting the initial set of rules; this may easily result in millions of nonterminals which are then merged following a novel equivalence criterion for nonterminals. Two merging strategies are proposed: one inspired on the BlueFringe algorithm that joins nonterminals on a onebyone basis, and another one that performs a medoids clustering over a reduce set of nonterminals. A statistically significant improvement in BLEU as with respect to the original method is obtained with both merging criteria.
For the experiments we used a small parallel corpus and restricted the length of the parallel sentences in the training corpus to 20 words. This was necessary in order to be able to run the merging algorithm in reasonable time. Recall that, in contrast to ChiangCL2007, we do not restrict the length of the phrase pairs in order to be able to reproduce the parallel sentences in the training corpus; otherwise longrange reorderings happening near the root of the parse tree of the sentences would not be possible. We tried different methods for filtering the rule table before applying the approach described in this paper so as to be able to use larger corpora and longer sentences, although with no success. A deeper exploration of potential optimizations is necessary.
Acknowledgements
Work supported by the Spanish government through project EFFORTUNE (TIN201569632R). The authors thank Mikel L. Forcada for his helpful comments.
Comments
There are no comments yet.