1 Many-to-One Translations
Languages differ in what meaning distinctions they must mark explicitly. As such, translations risk mapping from a form in one language to a more ambiguous form in another. For example, the definite (1) and indefinite (2) both translate (under Fairseq and Google Translate) to (3) in French, which is ambiguous in definiteness.
(1) | |||
(2) | |||
(3) |

Survey
To evaluate the nature of this problem, we explored a corpus111Generated by selecting short sentences from the Brown corpus (Kučera and Francis, 1967), translating them to German, and taking the best two candidate translations back into English, if these two themselves translate to a single German sentence. of 500 pairs of distinct English sentences which map to a single German one (the evaluation language in section 2.3). We identify a number of common causes for the many-to-one maps. Two frequent types of verbal distinction lost when translating to German are tense (54 pairs, e.g. “…others {were, have been} introduced .”) and modality (16 pairs, e.g. “…prospects for this year {could, might} be better.”), where German “können” can express both epistemic and ability modality, distinguished in English with “might” and “could” respectively. Owing to English’s large vocabulary, lexical difference in verb (31 pairs, e.g. “arise” vs. “emerge” ), noun (56 pairs, e.g. “mystery” vs. “secret”), adjective (47 pairs, e.g. “unaffected” vs. “untouched”) or deictic/pronoun (32 pairs, usually “this” vs “that”) are also common. A large number of the pairs differ instead either orthographically, or in other ways that do not correspond to a clear semantic distinction (e.g. “She had {taken, made} a decision.”).
Our approach
While languages differ in what distinctions they are required to express, all are usually capable of expressing any given distinction when desired. As such, meaning loss of the kind discussed above is, in theory, avoidable. To this end, we propose a method to reduce meaning loss by applying the Rational Speech Acts (RSA) model of an informative speaker to translation. RSA has been used to model natural language pragmatics (Goodman and Frank, 2016), and recent work has shown its applicability to image captioning (Andreas and Klein, 2016; Vedantam et al., 2017; Mao et al., 2016), another sequence-generation NLP task. Here we use RSA to define a translator which reduces many-to-one mappings and consequently meaning loss, in terms of a pretrained neural translation model. We introduce a strategy for performing inference efficiently with this model in the setting of translation, and show gains in cycle-consistency222Formally, say that a pair of functions , is cycle-consistent if , the identity function. If is not one-to-one, then is not cycle-consistent. (Note however that when and are infinite, the converse does not hold: even if and are both one-to-one, need not be cycle-consistent, since many-to-one maps between infinite sets are not necessarily bijective.) as a result. Moreover, we obtain improvements in translation quality (BLEU score), demonstrating that the goal of meaning preservation directly yields improved translations.
He is wearing glasses. | |
He wears glasses. | |
Er trägt eine Brille. | |
Er trägt eine Brille . | |
Er trägt jetzt eine Brille. | |
Er hat eine Brille. |
2 Meaning Preservation as Informativity
In the RSA framework, the informative speaker model is given a state , and chooses an utterance to convey to ’s own model of a listener. For translation, the state space is a set of source language sentences (sequences of words in the language), while is a set of target language sentences. ’s informative behavior discourages many-to-one maps that a non-informative translation model might allow.
Model
BiLSTMs with attention (Bahdanau et al., 2014), and more recently CNNs (Gehring et al., 2016) and entirely attention based models (Vaswani et al., 2017)
constitute the state-of-the-art architectures in neural machine translation . All of these systems, once trained end-to-end on aligned data, can be viewed as a conditional distribution
333We use and respectively to distinguish word and sentence level speaker models , for a word wd in the target language, a source language sentence , and a partial sentence in the target language. yields a distribution over full sentences444Python list indexing conventions are used, “+” means concatenation of list to element or list:(4) |
returns a distribution over continuations of into full target language sentences555In what follows, we omit c when it is empty, so that is the probability of sentence given . To obtain a sentence from given a source language sentence
, one can greedily choose the highest probability word from
at each timestep, or explore a beam of possible candidates. We implement (in terms of which all our other models are defined) using Fairseq’s publicly available666https://github.com/pytorch/fairseq pretrained Transformer models for English-German and English-French, and for German-English train a CNN using Fairseq.2.1 Explicit Distractors
We first describe a model for the simple case where a source language sentence needs to be distinguished from a presupplied distractor (as in the pairs shown in figures (2) and (1)). We use this model as a stepping stone to one which requires an input sentence in the source language only, and no distractors. We begin by defining a listener , which receives a target language sentence and infers which sentence (a presupplied set such as the pair (1) and (2)) would have resulted in the pretrained neural model producing :
(5) |
This allows the informative speaker model to be defined in terms of , where is the set of all possible target language sentences777 is a hyperparameter of
(6) |
The key property of this model is that, for , when translating , prefers translations of that are unlikely to be good translations of . So for pairs like (1) and (2), is compelled to produce a translation for the former that reflects its difference from the latter, and vice versa.
Inference
Since is an infinite set, exactly computing the most probable utterance under is intractable, because of the normalizing term. Andreas and Klein (2016) and Mao et al. (2016) perform approximate inference by sampling the subset of produced by a beam search from . Vedantam et al. (2017) and Cohn-Gordon et al. (2018) employ a different method, using an incremental model as an approximation of on which inference can be tractably performed.
considers informativity not over the whole set of utterances, but instead at each decision of the next word in the target language sentence. For this reason, the incremental method avoids the problem of lack of beam diversity encountered when subsampling from , which becomes especially bad when producing long sequences, as is often the case in translation. is defined as the product of informative decisions, specified by and in turn , which are defined analogously to (6) and (5).
(7) |
(8) | ||||
(9) |
Examples
2.2 Avoiding Explicit Distractors
While can disambiguate between pairs of sentences, it has two shortcomings. First, it requires one (or more) distractors to be provided, so translation is no longer fully automatic. Second, because the distractor set consists of only a pair (or finite set) of sentences, only cares about being informative up to the goal of distinguishing between these sentences. Intuitively, total meaning preservation is achieved by a translation which distinguishes the source sentence from every other sentence in the source language which differs in meaning.
Both of these problems can be addressed by introducing a new “cyclic” model which reasons not about but about a pretrained translation model from target language to source language, .
(10) |
is like , but its goal is to produce a translation which allows a listener model (now ) to infer the original sentence, not among a small set of presupplied possibilities, but among all source language sentences. As such, an optimal translation of under has high probability of being generated by and high probability of being back-translated888Unlike back-translation to augment data during training Sennrich et al. (2015), our model uses pretrained translators. to by .
Incremental Model
Exact inference is again intractable, though as with , it is possible to approximate by subsampling from . This is very close to the approach taken by (Li et al., 2016), who find that reranking a set of outputs by probability of recovering input “dramatically decreases the rate of dull and generic responses.” in a question-answering task. However, because the subsample is small relative to , they use this method in conjunction with a diversity increasing decoding algorithm.
As in the case with explicit distractors, we instead opt for an incremental model, now which approximates . The definition of (12) is more complex than the incremental model with explicit distractors () since must receive complete sentences, rather than partial ones like . As such, we need to marginalize over continuations of partial sentences in the target language:
(11) |
(12) |
Since the sum over continuations of in (11) is intractable to compute exactly, we approximate it by a single continuation, obtained by greedily unrolling . The following pseudocode resembles the Python code999Note the use of Python indexing conventions, and Numpy (numerical Python) broadcasting. implementing . In practice, we fix WIDTH=2:
2.3 Evaluating the Informative Translator
Our objective is to improve meaning preservation without detracting from translation quality in other regards (e.g. grammaticality). We conduct our evaluations on English to German translation.
We use cycle-consistency as a measure of meaning preservation, since the ability to recover the original sentence requires meaning distinctions not to be collapsed. In evaluating cycle-consistency, it is important to use a separate target-source translation mechanism than that used to define the . Otherwise, the system has access to the model which evaluates it and may improve cycle-consistency without producing meaningful target language sentences. For this reason, we translate German sentences (produced by or ) back to English with Google Translate. To measure cycle-consistency, we use the BLEU metric (implemented with sacreBLEU (Post, 2018)), with the original sentence as the reference.
To further ensure that translation quality is not compromised by , we evaluate BLEU scores of the German sentences it produces.
We perform both evaluations (cycle-consistency and translation) on 750 sentences101010Our implementation of was not efficient, and we could not evaluate on more sentences for reasons of time. of the 2018 English-German WMT News test-set.111111http://www.statmt.org/wmt18/translation-task.html We use greedy unrolling in all models (using beam search is a goal for future work). For (which represents the trade-off between informativity and translation quality) we use , obtained by tuning on validation data.
Results
As shown in table (1), improves over not only in cycle-consistency, but in translation quality as well. This suggests that the goal of preserving information, in the sense defined by and approximated by , is important for translation quality.
Model | Cycle | Translate |
---|---|---|
43.35 | 37.42 | |
47.34 | 38.29 |
3 Conclusions
We identify a shortcoming of state-of-the-art translation systems and show that a version of the RSA framework’s informative speaker , adapted to the domain of translation, alleviates this problem in a way which improves not only cycle-consistency but translation quality as well. The success of on two fairly similar languages suggests the possibility of larger improvements when translating between languages in which larger scale differences exist in what information is obligatorily represented - such as evidentiality or formality marking.
References
-
Andreas and Klein (2016)
Jacob Andreas and Dan Klein. 2016.
Reasoning about
pragmatics with neural listeners and speakers.
In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pages 1173–1182. Association for Computational Linguistics. - Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Cohn-Gordon et al. (2018) Reuben Cohn-Gordon, Noah Goodman, and Christopher Potts. 2018. Pragmatically informative image captioning with character-level inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 439–443. Association for Computational Linguistics.
- Gehring et al. (2016) Jonas Gehring, Michael Auli, David Grangier, and Yann N Dauphin. 2016. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344.
- Goodman and Frank (2016) Noah D Goodman and Michael C Frank. 2016. Pragmatic language interpretation as probabilistic inference. Trends in Cognitive Sciences, 20(11):818–829.
- Kučera and Francis (1967) Henry Kučera and Winthrop Nelson Francis. 1967. Computational analysis of present-day American English. Dartmouth Publishing Group.
- Li et al. (2016) Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562.
-
Mao et al. (2016)
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and
Kevin Murphy. 2016.
Generation and comprehension of unambiguous object descriptions.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 11–20. - Post (2018) Matt Post. 2018. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771.
- Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
- Vedantam et al. (2017) Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, and Gal Chechik. 2017. Context-aware captions from context-agnostic supervision. In Computer Vision and Pattern Recognition (CVPR), volume 3.