. Another way to encourage the beam decoder to produce certain words in the output is to explicitly reward n-grams provided by an SMT systemStahlberg et al. (2017) or language model Gulcehre et al. (2017) or to modify the vocabulary distribution of the decoder with suggestions from a terminology Chatterjee et al. (2017). While providing lexical guidance to the decoder, these methods do not strictly enforce a terminology. This is a requisite, however, for companies wanting to ensure that brand-related information is rendered correctly and consistently when translating web content or manuals and is often more important than translation quality alone. Although domain adaptation and guided decoding can help to reduce errors in these use cases, they do not provide reliable solutions.
Another recent line of work strictly enforces a given set of words in the output Anderson et al. (2017); Hokamp and Liu (2017); Crego et al. (2016). Anderson et al. address the task of image captioning with constrained beam search where constraints are given by image tags and constraint permutations are encoded in a finite-state acceptor (FSA). Hokamp and Liu propose grid beam search to enforce target-side constraints for domain adaptation via terminology. However, since there is no correspondence between constraints and the source words they cover, correct constraint placement is not guaranteed and the corresponding source words may be translated more than once. Crego et al. replace entities with special tags that remain unchanged during translation and are replaced in a post-processing step using attention weights. Given good alignments, this method can translate entities correctly but it requires training data with entity tags and excludes the entities from model scoring.
We address decoding with constraints to produce translations that respect the terminologies of corporate customers while maintaining the high quality of unconstrained translations. To this end, we apply the constrained beam search of Anderson et al. to machine translation and propose to employ alignment information between target-side constraints and their corresponding source words. The lack of explicit alignments in NMT systems poses an extra challenge compared to statistical MT where alignments are given by translation rules. We address the problem of constraint placement by expanding constraints when the NMT model is attending to the correct source span. We also reduce output duplication
by masking covered constraints in the NMT attention model.
2 Constrained Beam Search
A naive approach to decoding with constraints would be to use a large beam size and select from the set of complete hypotheses the best that satisfies all constraints. However, this is infeasible in practice because it would require searching a potentially very large space to ensure that even hypotheses with low model score due to the inclusion of a constraint would be part of the set of outputs. A better strategy is to force the decoder to produce hypotheses that satisfy the constraints regardless of their score and thus guide the decoder into the right area of the search space. We follow Anderson et al. (2017) in organizing our beam search into multiple stacks corresponding to subsets of satisfied constraints as defined by FSA states.
2.1 Finite-state Acceptors for Constraints
Before decoding, we build an FSA defining the constrained target language for an input sentence. It contains all permutations of constraints interleaved with loops over the remaining vocabulary.
Phrase Constraints: Constraints consisting of multiple tokens are encoded by one state per token. We refer to states within a phrase as intermediate states and restrict their outgoing vocabulary to the next token in the phrase.
Alternative Constraints: Synonyms of constraints can be defined as alternatives and encoded as different arcs connecting the same states. When alternatives consist of multiple tokens, the alternative paths will contain intermediate states.
Figure 1 shows an FSA with constraints and where is a phrase (yielding intermediate states , ) and consists of two single-token alternatives. Both permutations and lead to final state with both constraints satisfied.
2.2 Multi-Stack Decoding
When extending a hypothesis to satisfy a constraint which is not among the top- vocabulary items in the current beam, the overall likelihood may drop and the hypothesis may be pruned in subsequent steps. To prevent this, the extended hypothesis is placed on a new stack along with other hypotheses that satisfy the same set of constraints. Each stack maps to an acceptor state which helps to keep track of the permitted extensions for hypotheses on this stack. The stack where a hypothesis should be placed is found by following the appropriate arc leaving the current acceptor state. The stack mapping to the final state is used to generate complete hypotheses. At each time step, all stacks are pruned to the beam size and therefore the actual beam size for constrained decoding depends on the number of acceptor states.
2.3 Decoding with Attentions
Since an acceptor encoding single-token constraints has states, the constrained search of Anderson et al. (2017) can be inefficient for large numbers of constraints. In particular, all unsatisfied constraints are expanded at each time step which increases decoding complexity from for normal beam search to . Hokamp and Liu (2017) organize their grid beam search into beams that group hypotheses with the same number of constraints, thus their decoding time is . However, this means that different constraints will compete for completion of the same hypothesis and their placement is determined locally. We assume that a target-side constraint can come with an aligned source phrase which is encoded as a span in source sentence and stored with the acceptor arc label:
Because the attention weights in attention-based decoders function as soft alignments from the target to the source sentence Alkhouli and Ney (2017), we use them to decide at which position a constraint should be inserted in the output. At each time step in a hypothesis, we determine the source position with the maximum attention. If it falls into a constrained source span and this span matches an outgoing arc in the current acceptor state, we extend the current hypothesis with the arc label. Thus, the outgoing arcs in non-intermediate states are active or inactive depending on the current attentions. This reduces the complexity from to by ignoring all but one constraint permutation and in practice, disabling vocabulary loops saves extra time.
State-specific Attention Mechanism:
Once a constraint has been completed, we need to ensure that its source span will not be translated again. We force the decoder to respect covered constraints by masking their spans during all future expansions of the hypothesis. This is done by zeroing out the attention weights on covered positions to exclude them from the context vector computed by the attention mechanism.
Implications: Constrained decoding with aligned source phrases relies on the quality of the source-target pairs. Over- and under-translation can occur as a result of incomplete source or target phrases in the terminology.
Special Cases: Monitoring the source position with the maximum attention is a relatively strict criterion to decide where a constraint should be placed in the output. It turns out that depending on the language pair, the decoder may produce translations of neighbouring source tokens when attending to a constrained source span.111For example, to produce an article before a noun when the constrained source span includes just the noun. The strict requirement of only producing constraint tokens can be relaxed to accommodate such cases, for example by allowing extra tokens before () or after () constraint while attending to span ,
Conversely, the decoder may never place the maximum attention on a constraint span which can lead to empty translations. Relaxing this requirement using thresholding on the attention weights to determine positions with secondary attention can help in those cases.
3 Experimental Setup
We build attention-based neural machine translation models Bahdanau et al. (2015) using the Blocks implementation of van Merriënboer et al. (2015) for English-German and English-Chinese translation in both directions. We combine three models per language pair as ensembles and further combine the NMT systems with n-grams extracted from SMT lattices using Lattice minimum Bayes-risk as described by Stahlberg et al. (2017), referred to as Lnmt. We decode with a beam size of 12 and length normalization Johnson et al. (2017) and back off to constrained decoding without attentions when decoding with attentions fails.222This usually applies to less than 2% of the inputs. We report lowercase Bleu using mteval-v13.pl.
Our models are trained on the data provided for the 2017 Workshop for Machine Translation Bojar et al. (2017). We tokenize and truecase the English-German data and apply compound splitting when the source language is German. The training data for the NMT systems is augmented with backtranslation data Sennrich et al. (2016). For English-Chinese, we tokenize and lowercase the data. We apply byte-pair encoding Sennrich et al. (2017) to all data.
3.2 Terminology Constraints
We run experiments with two types of constraints to evaluate our constrained decoder.
Gold Constraints: For each input sentence, we extract up to two tokens from the reference which were not produced by the baseline system, favouring rarer words. This aims at testing the performance in a setup where users may provide corrections to the NMT output which are to be incorporated into the translation. These reference tokens may consist of one or more subwords. Similarly, we extract phrases of up to five subwords surrounding a reference token missing from the baseline output. We do not have access to aligned source words for gold constraints.
We automatically extract bilingual dictionary entries using terms and phrases from the reference translations as candidates in order to ensure that the entries are relevant for the inputs. In a real setup, the dictionaries would be provided by customers and would be expected to contain correct translations without ambiguity. We apply a filter of English stop words and verbs to the candidates and look them up in a pruned phrase table to find likely pairs, resulting in entries as shown below:333Our dictionaries are available on request.
|The Wall Street Journal||The Wall Street Journal|
|Dead Sea||Tote MeerToten Meer|
For evaluation purposes, we ensure that dictionary entries match the reference when applying them to an input sentence.
4.1 Results with Gold Constraints
|eng-ger-wmt17||Example 1||Example 2|
|Source||It already has the budget …||And it often costs over a hundred dollars to obtain the required identity card.|
|Constraints||Budget [4,5)||Ausweis [12,14)|
|Lnmt||Es hat bereits den Haushalt…||Und es kostet oft mehr als hundert Dollar, um die erforderliche Personalausweis zu erhalten.|
|+ dictionary (v1)||Das Budget hat bereits den Haushalt…||Und es kostet oft mehr als hundert Dollar, um den Ausweis zu erhalten, um die erforderliche Personalausweis zu erhalten.|
|+ dictionary (v2)||Es verfügt bereits über das Budget…||Und es kostet oft mehr als hundert Dollar, um den gewünschten Ausweis zu erhalten.|
|ger-eng-wmt17||Example 3||Example 4|
|Source||Der Pokal war die einzige Möglichkeit , etwas zu gewinnen .||Aber es ist keine typische Zeichensprache – sagt sie . Edmund hat einige Zeichen alleine erfunden .|
|Constraints||cup [1,2), chance [5,6)||signsigns [13,14)|
|Lnmt||The trophy was the only way to win something.||But it’s not a typical sign language – says, Edmund invented some characters alone.|
|+ dictionary (v1)||The cup was the only way to get something to win a chance.||But it’s not a typical sign language – says, Edmund invented some characters alone.|
|+ dictionary (v2)||The cup was the only chance to win something.||But it is not a typical sign language – she says, Edmund invented some signs alone.|
Decoding with gold constraints yields large Bleu gains over Lnmt for all language pairs. However, the length ratio on the dev set increases significantly. Inspecting the output reveals that this is often caused by constraints being translated more than once which can lead to whole passages being retranslated. Phrase constraints seem to integrate better into the output than single token constraints which may be due to the longer gold context being fed back to the Nmt state.
4.2 Results with Dictionary Constraints
Decoding with up to two dictionary constraints per sentence yields gains of up to 3 Bleu. This is partly because we do not control whether Lnmt already produced the constraint tokens and because not all sentences have dictionary matches. The length ratios are better compared to the gold experiments which we attribute to our filtering of tokens such as verbs which tend to influence the general word order more than nouns, for example.
Decoding with or without attentions yields similar Bleu scores overall and a consistent improvement for English-German. Note that decoding with attentions is sensitive to errors in the automatically extracted dictionary entries.
Output Duplication The first three examples in Table 2 show EnglishGerman translations where decoding without attentions has generated both the target side of the constraint and the translation preferred by the NMT system. When using the attentions, each constraint is only translated once.
Constraint Placement The fourth example demonstrates the importance of tying constraints to source words. Decoding without attentions fails to translate Zeichen as signs because the alternative sign already appears in the translation of Zeichensprache as sign language. When using the attentions, signs is generated at the correct position in the output.
4.3 Output length ratio and repetitions
To back up our hypothesis that increases in length ratio are related to output duplication, Table 0(a) column rep shows the number of repeated character 7-grams within a sentence of the dev set, ignoring stop words and overlapping n-grams. This confirms that constrained decoding with attentions reduces the number of repeated n-grams in the output. While this does not account for alignments to the source or capture duplicated translations with unrelated surface forms, it provides evidence that the outputs are not just shorter than for decoding without attentions but in fact contain fewer repetitions and likely fewer duplicated translations.
|+ dict (v1)||28.2||0.20||28.4||0.14||28.5||0.11|
|+ dict (v2)||27.8||0.69||28.0||0.66||28.1||0.59|
4.4 Comparison of decoding speeds
To evaluate the speed of constrained decoding with and without attentions, we decode newstest- 2017 on a single GPU using our English-German production system Iglesias et al. (2018) which in comparison to the systems described in Section 3 uses a beam size of 4 and an early pruning strategy similar to that described in Johnson et al. (2017), amongst other differences. About 89% of the sentences have at least one dictionary match and we allow up to two, three or four matches per sentence. Because the constraints result from dictionary application, the number of constraints per sentence varies and not all sentences contain the maximum number of constraints.
Tab. 3 reports Bleu and speed ratios for different decoding configurations. Rows two and three confirm that the reduced computational complexity of our approach yields faster decoding speeds than the approach of Anderson et al. (2017) while incurring a small decrease in Bleu. Moreover, it compares favourably for larger numbers of constraints per sentence: v2* is 3.5x faster than v1 for =2 and more than 5x faster for =4. Relaxing the restrictions of decoding with attentions improves the Bleu scores but increases runtime. However, the slowest v2 configuration is still faster than v1. The optimal trade-off between quality and speed is likely to differ for each language pair.
We have presented our approach to NMT decoding with terminology constraints using decoder attentions which enables reduced output duplication and better constraint placement compared to existing methods. Our results on four language pairs demonstrate that terminology constraints as provided by customers can be respected during NMT decoding while maintaining the overall translation quality. At the same time, empirical results confirm that our improvements in computational complexity translate into faster decoding speeds. Future work includes the application of our approach to more recent architectures such as Vaswani et al. (2017) which will involve extracting attentions from multiple decoding layers and attention heads.
This work was partially supported by U.K. EPSRC grant EP/L027623/1.
Alkhouli and Ney (2017)
Tamer Alkhouli and Hermann Ney. 2017.
Biasing Attention-Based Recurrent Neural Networks Using External Alignment Information.In Proceedings of the Conference on Machine Translation (WMT), Volume 1: Research Papers. Association for Computational Linguistics, pages 108–117.
Anderson et al. (2017)
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2017.
Guided Open Vocabulary Image Captioning with Constrained Beam
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pages 936–945.
- Bahdanau et al. (2015) Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of ICLR 2015.
- Bojar et al. (2017) Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Raphael Rubino, Lucia Specia, and Marco Turchi. 2017. Findings of the 2017 Conference on Machine Translation (WMT17). In Proceedings of the Conference on Machine Translation (WMT), Volume 2: Shared Task Papers. Association for Computational Linguistics, pages 169–214.
- Chatterjee et al. (2017) Rajen Chatterjee, Matteo Negri, Marco Turchi, Marcello Federico, Lucia Specia, and Frédéric Blain. 2017. Guiding Neural Machine Translation Decoding with External Knowledge. In Proceedings of the Conference on Machine Translation (WMT), Volume 1: Research Papers. Association for Computational Linguistics, pages 157–168.
- Crego et al. (2016) Josep Crego, Jungi Kim, Guillaume Klein, Anabel Rebollo, Kathy Yang, Jean Senellart, Egor Akhanov, Patrice Brunelle, Aurélien Coquard, Yongchao Deng, Satoshi Enoue, Chiyo Geiss, Joshua Johanson, Ardas Khalsa, Raoum Khiari, Byeongil Ko, Catherine Kobus, Jean Lorieux, Leidiana Martins, Dang-Chuan Nguyen, Alexandra Priori, Thomas Riccardi, Natalia Segal, Christophe Servan, Cyril Tiquet, Bo Wang, Jin Yang, Dakun Zhang, Jing Zhou, and Peter Zoldan. 2016. SYSTRAN’s Pure Neural Machine Translation Systems. In Arxiv preprint.
- Gulcehre et al. (2017) Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, and Yoshua Bengio. 2017. On integrating a language model into neural machine translation. Computer Speech & Language 45:137–148.
- Hokamp and Liu (2017) Chris Hokamp and Qun Liu. 2017. Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pages 1535–1546.
- Iglesias et al. (2018) Gonzalo Iglesias, William Tambellini, Adrià de Gispert, Eva Hasler, and Bill Byrne. 2018. Accelerating NMT Batched Beam Decoding with LMBR Posteriors for Deployment. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Industry Track). Association for Computational Linguistics.
- Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Computational Linguistics 5:339–351.
- Luong and Man-ning (2015) Minh-Thang Luong and Christopher D Man-ning. 2015. Stanford Neural Machine Translation Systems for Spoken Language Domains. In Proceedings of the 12th International Workshop on Spoken Language Translation. pages 76–79.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. pages 86–96.
- Sennrich et al. (2017) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2017. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pages 1715–1725.
- Stahlberg et al. (2017) Felix Stahlberg, Adrià de Gispert, Eva Hasler, and Bill Byrne. 2017. Neural Machine Translation by Minimising the Bayes-risk with Respect to Syntactic Translation Lattices. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, pages 362–368.
van Merriënboer et al. (2015)
Bart van Merriënboer, Dzmitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk,
David Warde-Farley, Jan Chorowski, and Yoshua Bengio. 2015.
Blocks and Fuel: Frameworks for deep learning.In Proceedings of ICLR 2015. Arxiv preprint.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pages 5998–6008.