Log In Sign Up

Lost in Machine Translation: A Method to Reduce Meaning Loss

by   Reuben Cohn-Gordon, et al.

A desideratum of high-quality translation systems is that they preserve meaning, in the sense that two sentences with different meanings should not translate to one and the same sentence in another language. However, state-of-the-art systems often fail in this regard, particularly in cases where the source and target languages partition the "meaning space" in different ways. For instance, "I cut my finger." and "I cut my finger off." describe different states of the world but are translated to French (by both Fairseq and Google Translate) as "Je me suis coupe le doigt.", which is ambiguous as to whether the finger is detached. More generally, translation systems are typically many-to-one (non-injective) functions from source to target language, which in many cases results in important distinctions in meaning being lost in translation. Building on Bayesian models of informative utterance production, we present a method to define a less ambiguous translation system in terms of an underlying pre-trained neural sequence-to-sequence model. This method increases injectivity, resulting in greater preservation of meaning as measured by improvement in cycle-consistency, without impeding translation quality (measured by BLEU score).


page 1

page 2

page 3

page 4


Neural Machine Translation model for University Email Application

Machine translation has many applications such as news translation, emai...

Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages

Back-translation is widely known for its effectiveness for neural machin...

Towards Two-Dimensional Sequence to Sequence Model in Neural Machine Translation

This work investigates an alternative model for neural machine translati...

Winograd Schemas and Machine Translation

A Winograd schema is a pair of sentences that differ in a single word an...

Neural Machine Translation for Cebuano to Tagalog with Subword Unit Translation

The Philippines is an archipelago composed of 7, 641 different islands w...

How to Measure Gender Bias in Machine Translation: Optimal Translators, Multiple Reference Points

In this paper, as a case study, we present a systematic study of gender ...

Evolving knowledge through negotiation

Semantic web information is at the extremities of long pipelines held by...

1 Many-to-One Translations

Languages differ in what meaning distinctions they must mark explicitly. As such, translations risk mapping from a form in one language to a more ambiguous form in another. For example, the definite (1) and indefinite (2) both translate (under Fairseq and Google Translate) to (3) in French, which is ambiguous in definiteness.

Figure 1: State-of-the-art neural image captioner loses a meaning distinction which informative model preserves.


To evaluate the nature of this problem, we explored a corpus111Generated by selecting short sentences from the Brown corpus (Kučera and Francis, 1967), translating them to German, and taking the best two candidate translations back into English, if these two themselves translate to a single German sentence. of 500 pairs of distinct English sentences which map to a single German one (the evaluation language in section 2.3). We identify a number of common causes for the many-to-one maps. Two frequent types of verbal distinction lost when translating to German are tense (54 pairs, e.g. “…others {were, have been} introduced .”) and modality (16 pairs, e.g. “…prospects for this year {could, might} be better.”), where German “können” can express both epistemic and ability modality, distinguished in English with “might” and “could” respectively. Owing to English’s large vocabulary, lexical difference in verb (31 pairs, e.g. “arise” vs. “emerge” ), noun (56 pairs, e.g. “mystery” vs. “secret”), adjective (47 pairs, e.g. “unaffected” vs. “untouched”) or deictic/pronoun (32 pairs, usually “this” vs “that”) are also common. A large number of the pairs differ instead either orthographically, or in other ways that do not correspond to a clear semantic distinction (e.g. “She had {taken, made} a decision.”).

Our approach

While languages differ in what distinctions they are required to express, all are usually capable of expressing any given distinction when desired. As such, meaning loss of the kind discussed above is, in theory, avoidable. To this end, we propose a method to reduce meaning loss by applying the Rational Speech Acts (RSA) model of an informative speaker to translation. RSA has been used to model natural language pragmatics (Goodman and Frank, 2016), and recent work has shown its applicability to image captioning (Andreas and Klein, 2016; Vedantam et al., 2017; Mao et al., 2016), another sequence-generation NLP task. Here we use RSA to define a translator which reduces many-to-one mappings and consequently meaning loss, in terms of a pretrained neural translation model. We introduce a strategy for performing inference efficiently with this model in the setting of translation, and show gains in cycle-consistency222Formally, say that a pair of functions , is cycle-consistent if , the identity function. If is not one-to-one, then is not cycle-consistent. (Note however that when and are infinite, the converse does not hold: even if and are both one-to-one, need not be cycle-consistent, since many-to-one maps between infinite sets are not necessarily bijective.) as a result. Moreover, we obtain improvements in translation quality (BLEU score), demonstrating that the goal of meaning preservation directly yields improved translations.

He is wearing glasses.
He wears glasses.
Er trägt eine Brille.
Er trägt eine Brille .
Er trägt jetzt eine Brille.
Er hat eine Brille.
Figure 2: Similar to Figure 1, collapses two English sentences into a single German one, whereas distinguishes the two in German.

2 Meaning Preservation as Informativity

In the RSA framework, the informative speaker model is given a state , and chooses an utterance to convey to ’s own model of a listener. For translation, the state space is a set of source language sentences (sequences of words in the language), while is a set of target language sentences. ’s informative behavior discourages many-to-one maps that a non-informative translation model might allow.


BiLSTMs with attention (Bahdanau et al., 2014), and more recently CNNs (Gehring et al., 2016) and entirely attention based models (Vaswani et al., 2017)

constitute the state-of-the-art architectures in neural machine translation . All of these systems, once trained end-to-end on aligned data, can be viewed as a conditional distribution

333We use and respectively to distinguish word and sentence level speaker models , for a word wd in the target language, a source language sentence , and a partial sentence in the target language. yields a distribution over full sentences444Python list indexing conventions are used, “+” means concatenation of list to element or list:


returns a distribution over continuations of into full target language sentences555In what follows, we omit c when it is empty, so that is the probability of sentence given . To obtain a sentence from given a source language sentence

, one can greedily choose the highest probability word from

at each timestep, or explore a beam of possible candidates. We implement (in terms of which all our other models are defined) using Fairseq’s publicly available666 pretrained Transformer models for English-German and English-French, and for German-English train a CNN using Fairseq.

2.1 Explicit Distractors

We first describe a model for the simple case where a source language sentence needs to be distinguished from a presupplied distractor (as in the pairs shown in figures (2) and (1)). We use this model as a stepping stone to one which requires an input sentence in the source language only, and no distractors. We begin by defining a listener , which receives a target language sentence and infers which sentence (a presupplied set such as the pair (1) and (2)) would have resulted in the pretrained neural model producing :


This allows the informative speaker model to be defined in terms of , where is the set of all possible target language sentences777

is a hyperparameter of

; as it increases, the model cares more about being informative and less about producing a reasonable translation.:


The key property of this model is that, for , when translating , prefers translations of that are unlikely to be good translations of . So for pairs like (1) and (2), is compelled to produce a translation for the former that reflects its difference from the latter, and vice versa.


Since is an infinite set, exactly computing the most probable utterance under is intractable, because of the normalizing term. Andreas and Klein (2016) and Mao et al. (2016) perform approximate inference by sampling the subset of produced by a beam search from . Vedantam et al. (2017) and Cohn-Gordon et al. (2018) employ a different method, using an incremental model as an approximation of on which inference can be tractably performed.

considers informativity not over the whole set of utterances, but instead at each decision of the next word in the target language sentence. For this reason, the incremental method avoids the problem of lack of beam diversity encountered when subsampling from , which becomes especially bad when producing long sequences, as is often the case in translation. is defined as the product of informative decisions, specified by and in turn , which are defined analogously to (6) and (5).



is able to avoid many-to-one mappings by choosing more informative translations. For instance, its translation of (1) is “Ces animaux courent vite” (These animals run fast.). See figures (1) and (2) for other examples of many-to-one mappings under avoided by .

2.2 Avoiding Explicit Distractors

While can disambiguate between pairs of sentences, it has two shortcomings. First, it requires one (or more) distractors to be provided, so translation is no longer fully automatic. Second, because the distractor set consists of only a pair (or finite set) of sentences, only cares about being informative up to the goal of distinguishing between these sentences. Intuitively, total meaning preservation is achieved by a translation which distinguishes the source sentence from every other sentence in the source language which differs in meaning.

Both of these problems can be addressed by introducing a new “cyclic” model which reasons not about but about a pretrained translation model from target language to source language, .


is like , but its goal is to produce a translation which allows a listener model (now ) to infer the original sentence, not among a small set of presupplied possibilities, but among all source language sentences. As such, an optimal translation of under has high probability of being generated by and high probability of being back-translated888Unlike back-translation to augment data during training Sennrich et al. (2015), our model uses pretrained translators. to by .

Incremental Model

Exact inference is again intractable, though as with , it is possible to approximate by subsampling from . This is very close to the approach taken by (Li et al., 2016), who find that reranking a set of outputs by probability of recovering input “dramatically decreases the rate of dull and generic responses.” in a question-answering task. However, because the subsample is small relative to , they use this method in conjunction with a diversity increasing decoding algorithm.

As in the case with explicit distractors, we instead opt for an incremental model, now which approximates . The definition of (12) is more complex than the incremental model with explicit distractors () since must receive complete sentences, rather than partial ones like . As such, we need to marginalize over continuations of partial sentences in the target language:


Since the sum over continuations of in (11) is intractable to compute exactly, we approximate it by a single continuation, obtained by greedily unrolling . The following pseudocode resembles the Python code999Note the use of Python indexing conventions, and Numpy (numerical Python) broadcasting. implementing . In practice, we fix WIDTH=2:

def S1-WD-C.fwd(src=s,c=[]):
  support,logprobs = S0-WD.fwd(s)
  scores = []
  for wd in support[:WIDTH]:
  return next_word

2.3 Evaluating the Informative Translator

Our objective is to improve meaning preservation without detracting from translation quality in other regards (e.g. grammaticality). We conduct our evaluations on English to German translation.

We use cycle-consistency as a measure of meaning preservation, since the ability to recover the original sentence requires meaning distinctions not to be collapsed. In evaluating cycle-consistency, it is important to use a separate target-source translation mechanism than that used to define the . Otherwise, the system has access to the model which evaluates it and may improve cycle-consistency without producing meaningful target language sentences. For this reason, we translate German sentences (produced by or ) back to English with Google Translate. To measure cycle-consistency, we use the BLEU metric (implemented with sacreBLEU (Post, 2018)), with the original sentence as the reference.

To further ensure that translation quality is not compromised by , we evaluate BLEU scores of the German sentences it produces.

We perform both evaluations (cycle-consistency and translation) on 750 sentences101010Our implementation of was not efficient, and we could not evaluate on more sentences for reasons of time. of the 2018 English-German WMT News test-set.111111 We use greedy unrolling in all models (using beam search is a goal for future work). For (which represents the trade-off between informativity and translation quality) we use , obtained by tuning on validation data.


As shown in table (1), improves over not only in cycle-consistency, but in translation quality as well. This suggests that the goal of preserving information, in the sense defined by and approximated by , is important for translation quality.

Model Cycle Translate
43.35 37.42
47.34 38.29
Table 1: BLEU score on cycle-consistency (c) and translation (t) for WMT, across baseline and informative models. Greedy unrolling and

3 Conclusions

We identify a shortcoming of state-of-the-art translation systems and show that a version of the RSA framework’s informative speaker , adapted to the domain of translation, alleviates this problem in a way which improves not only cycle-consistency but translation quality as well. The success of on two fairly similar languages suggests the possibility of larger improvements when translating between languages in which larger scale differences exist in what information is obligatorily represented - such as evidentiality or formality marking.