Log In Sign Up

When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion

by   Elena Voita, et al.

Though machine translation errors caused by the lack of context beyond one sentence have long been acknowledged, the development of context-aware NMT systems is hampered by several problems. Firstly, standard metrics are not sensitive to improvements in consistency in document-level translations. Secondly, previous work on context-aware NMT assumed that the sentence-aligned parallel data consisted of complete documents while in most practical scenarios such document-level data constitutes only a fraction of the available parallel data. To address the first issue, we perform a human study on an English-Russian subtitles dataset and identify deixis, ellipsis and lexical cohesion as three main sources of inconsistency. We then create test sets targeting these phenomena. To address the second shortcoming, we consider a set-up in which a much larger amount of sentence-level data is available compared to that aligned at the document level. We introduce a model that is suitable for this scenario and demonstrate major gains over a context-agnostic baseline on our new benchmarks without sacrificing performance as measured with BLEU.


page 1

page 2

page 3

page 4


Diving Deep into Context-Aware Neural Machine Translation

Context-aware neural machine translation (NMT) is a promising direction ...

Context-aware Decoder for Neural Machine Translation using a Target-side Document-Level Language Model

Although many context-aware neural machine translation models have been ...

Selective Attention for Context-aware Neural Machine Translation

Despite the progress made in sentence-level NMT, current systems still f...

Document-aligned Japanese-English Conversation Parallel Corpus

Sentence-level (SL) machine translation (MT) has reached acceptable qual...

Measuring and Increasing Context Usage in Context-Aware Machine Translation

Recent work in neural machine translation has demonstrated both the nece...

Long-Short Term Masking Transformer: A Simple but Effective Baseline for Document-level Neural Machine Translation

Many document-level neural machine translation (NMT) systems have explor...

Bidirectional Context-Aware Hierarchical Attention Network for Document Understanding

The Hierarchical Attention Network (HAN) has made great strides, but it ...

Code Repositories


This is a repository with the data and code for the ACL 2019 paper "When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion"

view repo

1 Introduction

With the recent rapid progress of neural machine translation (NMT), translation mistakes and inconsistencies due to the lack of extra-sentential context are becoming more and more noticeable among otherwise adequate translations produced by standard context-agnostic NMT systems

Läubli et al. (2018). Though this problem has recently triggered a lot of attention to context-aware translation Jean et al. (2017a); Wang et al. (2017); Tiedemann and Scherrer (2017); Bawden et al. (2018); Voita et al. (2018); Maruf and Haffari (2018); Agrawal et al. (2018); Miculicich et al. (2018), the progress and wide-spread adoption of the new paradigm is hampered by several important problems. Firstly, it is highly non-trivial to design metrics which would reliably trace the progress and guide model design. Standard machine translation metrics (e.g., BLEU) do not appear appropriate as they do not sufficiently differentiate between consistent and inconsistent translations Wong and Kit (2012).111We use the term ‘inconsistency’ to refer to any violations causing good translations of isolated sentences not to work together, independently of which linguistic phenomena (e.g., ellipsis or lexical cohesion) impose the violated constraints.

For example, if multiple translations of a name are possible, forcing consistency is essentially as likely to make all occurrences of the name match the reference translation as making them all different from the reference. Second, most previous work on context-aware NMT has made the assumption that all the bilingual data is available at the document level. However, isolated parallel sentences are a lot easier to acquire and hence only a fraction of the parallel data will be at the document level in any practical scenario. In other words, a context-aware model trained only on document-level parallel data is highly unlikely to outperform a context-agnostic model estimated from much larger sentence-level parallel corpus. This work aims to address both these shortcomings.

A context-agnostic NMT system would often produce plausible translations of isolated sentences, however, when put together in a document, these translations end up being inconsistent with each other. We investigate which linguistic phenomena cause the inconsistencies using the OpenSubtitles Lison et al. (2018) corpus for the English-Russian language pair. We identify deixis, ellipsis and lexical cohesion as three main sources of the violations, together amounting to about 80% of the cases. We create test sets focusing specifically on the three identified phenomena (6000 examples in total).

We show that by using a limited amount of document-level parallel data, we can already achieve substantial improvements on these benchmarks without negatively affecting performance as measured with BLEU. Our approach is inspired by the Deliberation Networks Xia et al. (2017). In our method, the initial translation produced by a baseline context-agnostic model is refined by a context-aware system which is trained on a small document-level subset of parallel data.

The key contributions are as follows:

  • we analyze which phenomena cause context-agnostic translations to be inconsistent with each other;

  • we create test sets specifically addressing the most frequent phenomena;

  • we consider a novel and realistic set-up where a much larger amount of sentence-level data is available compared to that aligned at the document level;

  • we introduce a model suitable for this scenario, and demonstrate that it is effective on our new benchmarks without sacrificing performance as measured with BLEU.

2 Analysis

We begin with a human study, in which we:

  1. identify cases when good sentence-level translations are not good when placed in context of each other,

  2. categorize these examples according to the phenomena leading to a discrepancy in translations of consecutive sentences.

The test sets introduced in Section 3 will then target the most frequent phenomena.

2.1 Human annotation

To find what makes good context-agnostic translations incorrect when placed in context of each other, we start with pairs of consecutive sentences. We gather data with context from the publicly available OpenSubtitles2018 corpus Lison et al. (2018) for English and Russian. We train context-agnostic Transformer on 6m sentence pairs. Then we translate 2000 pairs of consecutive sentences using this model. For more details on model training and data preprocessing see Section 5.3.

Then we use human annotation to assess the adequacy of the translations without context and in the context of each other. The whole process is two-stage:

  1. sentence-level evaluation: we ask if the translation of a given sentence is good,

  2. evaluation in context: for pairs of consecutive good translations according to the first stage, we ask if the translations are good in context of each other.

In the first stage, the annotators are instructed to mark as “good” translations which (i) are fluent sentences in the target language (in our case, Russian) (ii) can be reasonable translations of a source sentence in some context.

For the second stage we only consider pairs of sentences with good sentence-level translations. The annotators are instructed to mark translations as bad in context of each other only if there is no other possible interpretation or extra additional context which could have made them appropriate. This was made to get more robust results, avoiding the influence of personal preferences of the annotators (for example, for using formal or informal speech), and excluding ambiguous cases that can only be resolved with additional context.

all one/both bad both good
bad pair good pair
2000 211 140 1649
100% 11% 7% 82%
Table 1: Human annotation statistics of pairs of consecutive translation.

The statistics of answers are provided in Table 1. We find that our annotators labelled of sentence pairs as good translations. In of cases, at least one translation was considered bad at the sentence level, and in another , the sentences were considered individually good, but bad in context of each other. This indicates that in our setting, a substantial proportion of translation errors are only recognized as such in context.

2.2 Types of phenomena

From the results of the human annotation, we take all instances of consecutive sentences with good translations which become incorrect when placed in the context of each other. For each, we identify the language phenomenon which caused a discrepancy. The results are provided in Table 2.

Below we discuss these types of phenomena, as well as problems in translation they cause, in more detail. In the scope of current work, we concentrate only on the three most frequent phenomena.

type of phenomena frequency
deixis 37%
ellipsis 29%
lexical cohesion 14%
ambiguity 09%
anaphora 06%
other 05%
Table 2: Types of phenomena causing discrepancy in context-agnostic translation of consecutive sentences when placed in the context of each other

2.2.1 Deixis

In this category, we group several types of deictic words or phrases, i.e. referential expressions whose denotation depends on context. This includes personal deixis (“I”, “you”), place deixis (“here”, “there”), and discourse deixis, where parts of the discourse are referenced (“that’s a good question.”). Most errors in our annotated corpus are related to person deixis, specifically gender marking in the Russian translation, and the T-V distinction between informal and formal you (Latin “tu” and “vos”).

type of discrepancy frequency
T-V distinction 67%
speaker/addressee gender:
          same speaker 22%
          different speaker 09%
other 02%
Table 3: Types of discrepancy in context-agnostic translation caused by deixis (excluding anaphora)

In many cases, even when having access to neighboring sentences, one cannot make a confident decision which of the forms should be used, as there are no obvious markers pointing to one form or another (e.g., for the T-V distinction, words such as “officer”, “mister” for formal and “honey”, “dude” for informal). However, when pronouns refer to the same person, the pronouns, as well as verbs that agree with them, should be translated using the same form.See Figure 1(a) for an example translation that violates T-V consistency. Figure 1(b) shows an example of inconsistent first person gender (marked on the verb) although the speaker is clearly the same.

Anaphora are a form of deixis that received a lot of attention in MT research, both from the perspective of modelling (Le Nagard and Koehn, 2010; Hardmeier and Federico, 2010; Jean et al., 2017b; Bawden et al., 2018; Voita et al., 2018, among others) and targeted evaluation  Hardmeier et al. (2015); Guillou and Hardmeier (2016); Müller et al. (2018), and we list anaphora errors separately, and will not further focus on them.

Figure 1: Examples of violation of (a) T-V form consistency, (b) gender speaker consistency.
In color: (a) red – V-form, blue – T-form; (b) red – feminine, blue – masculine.

2.2.2 Ellipsis

Ellipsis is the omission from a clause of one or more words that are nevertheless understood in the context of the remaining elements.

In machine translation, elliptical constructions in the source language pose a problem if the target language does not allow the same types of ellipsis (requiring the elided material to be predicted from context), or if the elided material affects the syntax of the sentence; for example, the grammatical function of a noun phrase and thus its inflection in Russian may depend on the elided verb (Figure 2(a)), or the verb inflection may depend on the elided subject. Our analysis focuses on ellipses that can only be understood and translated with context beyond the sentence-level. This has not been studied extensively in MT research.222Exceptions include (Yamamoto and Sumita, 1998), and work on the related phenomenon of pronoun dropping (Russo et al., 2012; Wang et al., 2016; Rios and Tuggener, 2017).

We classified ellipsis examples which lead to errors in sentence-level translations by the type of error they cause. Results are provided in Table 


type of discrepancy frequency
wrong morphological form 66%
wrong verb (VP-ellipsis) 20%
other error 14%
Table 4: Types of discrepancy in context-agnostic translation caused by ellipsis
Figure 2: Examples of discrepancies caused by ellipsis. (a) wrong morphological form, incorrectly marking the noun phrase as a subject. (b) correct meaning is “see”, but MT produces хотели (“want”).

It can be seen that the most frequent problems related to ellipsis that we find in our annotated corpus are wrong morphological forms, followed by wrongly predicted verbs in case of verb phrase ellipsis in English, which does not exist in Russian, thus requiring the prediction of the verb in the Russian translation (Figure 2(b)).

2.2.3 Lexical cohesion

Lexical cohesion has been studied previously in MT (Tiedemann, 2010; Gong et al., 2011; Wong and Kit, 2012; Kuang et al., 2018; Miculicich et al., 2018, among others).

There are various cohesion devices Morris and Hirst (1991), and a good translation should exhibit lexical cohesion beyond the sentence level. We focus on repetition with two frequent cases in our annotated corpus being reiteration of named entities (Figure 3(a)) and reiteration of more general phrase types for emphasis (Figure 3(b)) or in clarification questions.

Figure 3: Examples of lack of lexical cohesion in MT. (a) Name translation inconsistency. (b) Inconsistent translation. Using either of the highlighted translations consistently would be good.

3 Test Sets

For the most frequent phenomena from the above analysis we create test sets for targeted evaluation.

Each test set contains contrastive examples. It is specifically designed to test the ability of a system to adapt to contextual information and handle the phenomenon under consideration. Each test instance consists of a true example (sequence of sentences and their reference translation from the data) and several contrastive translations which differ from the true one only in the considered aspect. All contrastive translations we use are correct plausible translations at a sentence level, and only context reveals the errors we introduce. All the test sets are guaranteed to have the necessary context in the provided sequence of 3 sentences. The system is asked to score each candidate example, and we compute the system accuracy as the proportion of times the true translation is preferred over the contrastive ones.

Test set statistics are shown in Table 5. All test sets and scoring scripts will be made freely available at the time of publication.333

3.1 Deixis

From Table 3, we see that the most frequent error category related to deixis in our annotated corpus is the inconsistency of T-V forms when translating second person pronouns. The test set we construct for this category tests the ability of a machine translation system to produce translations with consistent level of politeness.

We semi-automatically identify sets of consecutive sentences with consistent politeness markers on pronouns and verbs (but without nominal markers such as “’Mr.” or “officer”) and switch T and V forms.444Detailed description of all these steps is provided in the supplementary material This gives us two sets of translations for each example, one consistently informal (T), and one consistently formal (V). For each, we create an inconsistent contrastive example by switching the formality of the last sentence. The symmetry of the test set ensures that any context-agnostic model has 50% accuracy on the test set.

latest relevant context
total 1st 2nd 3rd
deixis 3000 1000 1000 1000
lex. cohesion 2000 0855 0630 0515
ellipsis (infl.) 0500
ellipsis (VP) 0500
Table 5: Size of test sets: total number of test instances and with regard to the latest context sentence with politeness indication or with the named entity under consideration. For ellipsis, we distinguish whether model has to predict correct NP inflection, or correct verb sense (VP ellipsis).

3.2 Ellipsis

From Table 4, we see that the two most frequent types of ambiguity caused by the presence of an elliptical structure have different nature, hence we construct individual test sets for each of them.

Ambiguity of the first type comes from the inability to predict a morphological form of some words. We manually gather examples with such structures in a source sentence and change the morphological inflection of the relevant target phrase to create contrastive translation. Specifically, we focus on noun phrases where the verb is elided, and the ambiguity lies in how the NP is inflected.

The second type we evaluate are verb phrase ellipses. Mostly these are sentences with an auxiliary verb “do” and omitted main verb. We manually gather such examples and replace the translation of the verb which is only present on the target side, with other verbs with different meaning, but the same inflection. Verbs which are used to construct such contrastive translations are the top-10 lemmas of translations of the verb “do” which we get from the lexical table of Moses Koehn et al. (2007) induced from the training data.

3.3 Lexical cohesion

Lexical cohesion can be established for various types of phrases and can involve reiteration or other semantic relations. In the scope of the current work, we focus on the reiteration of entities, since these tend to be non-coincidental, and can be easily detected and transformed.

We identify named entities with alternative translations into Russian, find passages where they are translated consistently, and create contrastive test examples by switching the translation of some instances of the named entity. For more details, please refer to the appendix.

4 Model and Setting

4.1 Setting

Previous work on context-aware neural machine translation used data where all training instances have context. This setting limits the set of available training sets one can use: in a typical scenario, we have a lot of sentence-level parallel data and only a small fraction of document-level data. Since machine translation quality depends heavily on the amount of training data, training a context-aware model is counterproductive if this leads to ignoring the majority of available sentence-level data and sacrificing general quality. We will also show that a naive approach to combining sentence-level and document-level data leads to a drop in performance.

In this work, we argue that it is important to consider an asymmetric setting where the amount of available document-level data is much smaller than that of sentence-level data, and propose an approach specifically targeting this scenario.

4.2 Model

Figure 4: Model architecture

We introduce a two-pass framework: first, the sentence is translated with a context-agnostic model, and then this translation is refined using context of several previous sentences, including both source sentences and their translations. We expect this architecture to be suitable in the proposed setting: the baseline context-agnostic model can be trained on a large amount of sentence-level data, and the second-pass model can be estimated on a smaller subset of parallel data which includes context. As the first-pass translation is produced by a strong model, we expect no loss in general performance when training the second part on a smaller dataset.

The model is close in spirit to the Deliberation networks Xia et al. (2017)

. The first part of the model is a context-agnostic model (we refer to it as the base model), and the second one is a context-aware decoder (CADec) which refines context-agnostic translations using context. The base model is trained on sentence-level data and then fixed. It is used only to sample context-agnostic translations and to get vector representations of the source and translated sentences. CADec is trained only on data with context.

Let denote the sentence-level data with paired sentences and denote the document-level data, where is source and target sides of a sentence to be translated, are several preceding sentences along with their translations.

Base model For the baseline context-agnostic model we use the original Transformer-base Vaswani et al. (2017), trained to maximize the sentence-level log-likelihood .

Context-aware decoder (CADec) The context-aware decoder is trained to correct translations given by the base model using contextual information. Namely, we maximize the following document-level log-likelihood:

where is sampled from .

CADec is composed of a stack of identical layers and is similar to the decoder of the original Transformer. It has a masked self-attention layer and attention to encoder outputs, and additionally each layer has a block attending over decoder outputs (Figure 4). We use the states from the last layer of the base model’s encoder of the current source sentence and all context sentences as input to the first multi-head attention. For the second multi-head attention we input both last states of the base decoder and the target-side token embedding layer; this is done for translations of the source and also all context sentences. All sentence representations are produced by the base model. To encode the relative position of each sentence, we concatenate both the encoder and decoder states with one-hot vectors representing their position (0 for the source sentence, 1 for the immediately preceding one, etc). These distance embeddings are shown in blue in Figure 4.

5 Experiments

5.1 Training

At training time, we use reference translations as translations of the previous sentences. For the current sentence, we either sample a translation from the base model or use a corrupted version of the reference translation. We propose to stochastically mix objectives corresponding to these versions:

where is a corrupted version of the reference translation and

is drawn from Bernoulli distribution with parameter

, in our experiments. Reference translations are corrupted by replacing 20% of their tokens with random tokens.

5.2 Inference

As input to CADec for the current sentence, we use the translation produced by the base model. Target sides of the previous sentences are produced by our two-stage approach. We use beam search with a beam of 4 for all models.

5.3 Data and setting

We use the publicly available OpenSubtitles2018 corpus Lison et al. (2018) for English and Russian. As described in detail in the appendix, we apply data cleaning after which only a fraction of data has context of several previous sentences. We use up to 3 context sentences in this work. We randomly choose 6 million training instances from the resulting data. We randomly choose two subsets of 10k instances for development and testing and construct out contrastive test sets from 400k held-out instances from movies not encountered in training.555The resulting data sets are available here:

The hyperparameters, preprocessing and training details are provided in the supplementary material.

6 Results

We evaluate in two different ways: using BLEU for general quality and the proposed contrastive test sets for consistency. We show that models indistinguishable with BLEU can be very different in terms of consistency.

We consider two baselines. The context-agnostic baseline is Transformer-base trained on all sentence-level data. Recall that it is also used as the base model in our 2-stage approach. As the context-aware baseline, we use a simple concatenation model. It is trained on 6m sentence pairs, including 1.5m having 3 context sentences.

We randomly choose 500 out of 2000 examples from the lexical cohesion set and 500 out of 3000 from the deixis test set for validation and leave the rest for final testing. We compute BLEU on the development set as well as scores on lexical cohesion and deixis development sets. We use convergence in both metrics to decide when to stop training. The importance of using both criteria is discussed in Section 6.3. After the convergence, we average 5 checkpoints and report scores on the final test sets.

6.1 General results

BLEU scores for our model and the baselines are given in Table 6.666We use bootstrap resampling Riezler and Maxwell (2005) for significance testing. For context-aware models, all sentences in a group were translated, and then only the current sentence is evaluated. We also report BLEU for the context-agnostic baseline trained only on 1.5m dataset to show how the performance is influenced by the amount of data.

model BLEU
baseline (1.5m) 29.10
baseline (6m) 32.40
concat 31.56
CADec 32.38
Table 6: BLEU scores. CADec trained with . Scores for CADec are not statistically different from the baseline (6m).
Figure 5: BLEU and lexical cohesion accuracy on the development sets during CADec training.

We observe that our model is no worse in BLEU than the baseline despite the second-pass model being trained only on a fraction of the data. In contrast, the concatenation baseline, trained on a mixture of data with and without context is about 1 BLEU below the context-agnostic baseline and our model when using all 3 context sentences.

6.2 Consistency results

latest relevant context
total 1st 2nd 3rd
baseline 50.0 50.0 50.0 50.0
concat 83.5 88.8 85.6 76.4
CADec 81.6 84.6 84.4 75.9
lexical cohesion
baseline 45.9 46.1 45.9 45.4
concat 47.5 48.6 46.7 46.7
CADec 58.1 63.2 52.0 56.7
Table 7: Accuracy for deixis and lexical cohesion.
ellipsis (infl.) ellipsis (VP)
baseline 53.0 28.4
concat 76.2 76.6
CADec 72.2 80.0
Table 8: Accuracy on ellipsis test set.
BLEU deixis lex. c. ellipsis
32.34 84.1 48.7 65 / 75
32.31 83.3 52.4 67 / 78
32.38 81.6 58.1 72 / 80
32.45 80.0 65.0 70 / 80
Table 9:

Results for different probabilities of using corrupted reference at training time. BLEU for 3 context sentences. For ellipsis, we show inflection/VP scores.

Scores on the deixis, cohesion and ellipsis test sets are provided in Tables 7 and 8. For all tasks, we observe a large improvement from using context. For deixis, the concatenation model (concat) and CADec improve over the baseline by 33.5 and 31.6 percentage points, respectively. On the lexical cohesion test set, CADec shows a large improvement over the context-agnostic baseline (12.2 percentage points), while concat performs similarly to the baseline. For ellipsis, both models improve substantially over the baseline (by 19-51 percentage points), with concat stronger for inflection tasks and CADec stronger for VP-ellipsis.

The proposed test sets let us distinguish models which are otherwise identical in terms of BLEU: the performance of the baseline and CADec is the same when measured with BLEU, but very different in terms of handling contextual phenomena.

6.3 Context-aware stopping criteria

Figure 5 shows that for context-aware models, BLEU is not sufficient as a criterion for stopping: even when a model has converged in terms of BLEU, it continues to improve in terms of consistency. For CADec trained with , BLEU score has stabilized after 40k batches, but the lexical cohesion score continues to grow.

6.4 Ablation: using corrupted reference

Results for different values of are given in Table 9. All models have about the same BLEU, not statistically significantly different from the baseline, but they are quite different in terms of incorporating context. The denoising positively influences almost all tasks except for deixis, yielding the largest improvement on lexical cohesion.

7 Additional Related Work

In concurrent work, Xiong et al. (2018) also propose a two-pass context-aware translation model inspired by deliberation network. However, while they consider a symmetric data scenario where all available training data has document-level context, and train all components jointly on this data, we focus on an asymmetric scenario where we have a large amount of sentence-level data, used to train our first-pass model, and a smaller amount of document-level data, used to train our second-pass decoder, keeping the first-pass model fixed.

Automatic evaluation of the discourse phenomena we consider is challenging. For lexical cohesion, wong-kit:2012:EMNLP-CoNLL count the ratio between the number of repeated and lexically similar content words over the total number of content words in a target document. However, guillou:2013:DiscoMT,carpuat-simard:2012:WMT find that translations generated by a machine translation system tend to be similarly or more lexically consistent, as measured by a similar metric, than human ones, but this even holds for sentence-level systems, where the increased consistency is not due to improved cohesion, but accidental – DBLP:conf/icml/OttAGR18 show that beam search introduces a bias towards frequent words, which could be one factor explaining this finding. This means that a higher repetition rate does not mean that a translation system is in fact more cohesive, and we find that even our baseline is more repetitive than the human reference.

8 Conclusions

We analyze which phenomena cause otherwise good context-agnostic translations to be inconsistent when placed in the context of each other. Our human study on an En-Ru dataset identifies deixis, ellipsis and lexical cohesion as three main sources of inconsistency. We create test sets focusing specifically on the identified phenomena.

We consider a novel and realistic set-up where a much larger amount of sentence-level data is available compared to that aligned at the document level and introduce a model suitable for this scenario. We show that our model effectively handles contextual phenomena without sacrificing general quality as measured with BLEU despite using only a small amount of document-level data, while a naive approach to combining sentence-level and document-level data leads to a drop in performance. We show that the proposed test sets allow us to distinguish models (even though identical in BLEU) in terms of their consistency. To build context-aware machine translation systems, such targeted test sets should prove useful, for validation, early stopping and for model selection.


We would like to thank anonymous reviewers for their comments and Ekaterina Enikeeva for the help with initial phenomena classification. The authors also thank Yandex Machine Translation team for helpful discussions and inspiration. Ivan Titov acknowledges support of the European Research Council (ERC StG BroadSem 678254) and the Dutch National Science Foundation (NWO VIDI 639.022.518). Rico Sennrich acknowledges support from the Swiss National Science Foundation (105212_169888), the European Union’s Horizon 2020 research and innovation programme (grant agreement no 825460), and the Royal Society (NAF\R1\180122).


Appendix A Protocols for test sets

In this section we describe the process of constructing the test suites.

a.1 Deixis

English second person pronoun “you” may have three different interpretations important when translating into Russian: the second person singular informal (T form), the second person singular formal (V form) and second person plural (there is no T-V distinction for the plural from of second person pronouns).

Morphological forms for second person singular (V form) and second person plural pronoun are the same, that is why to automatically identify examples in the second person polite form, we look for morphological forms corresponding to second person plural pronouns.

To derive morphological tags for Russian, we use publicly available pymorphy2777 Korobov (2015).

Below, all the steps performed to obtain the test suite are described in detail.

a.1.1 Automatic identification of politeness

For each sentence we try to automatically find indications of using T or V form. Presence of the following words and morphological forms are used as indication of usage of T/V forms:

  1. second person singular or plural pronoun,

  2. verb in a form corresponding to second person singular/plural pronoun,

  3. verbs in imperative form,

  4. possessive forms of second person pronouns.

For 1-3 we used morphological tags predicted by pymorphy2, for 4th we used hand-crafted lists of forms of second person pronouns, because pymorphy2 fails to identify them.

a.1.2 Human postprocessing of identification of politeness

After examples with presence of indication of usage of T/V form are extracted automatically, we manually filter out examples where

  1. second person plural form corresponds to plural pronoun, not V form,

  2. there is a clear indication of politeness.

The first rule is needed as morphological forms for second person plural and second person singular V form pronouns and related verbs are the same, and there is no simple and reliable way to distinguish these two automatically.

The second rule is to exclude cases where there is only one appropriate level of politeness according to the relation between the speaker and the listener. Such markers include “Mr.”, “Mrs.”, “officer”, “your honour” and “sir”. For the impolite form, these include terms denoting family relationship (“mom”, “dad”), terms of endearment (“honey”, “sweetie”) and words like “dude” and “pal”.

a.1.3 Automatic change of politeness

To construct contrastive examples aiming to test the ability of a system to produce translations with consistent level of politeness, we have to produce an alternative translation by switching the formality of the reference translation. First, we do it automatically:

  1. change the grammatical number of second person pronouns, verbs, imperative verbs,

  2. change the grammatical number of possessive pronouns.

For the first transformation we use pymorphy2, for the second use manual lists of possessive second person pronouns, because pymorphy2 can not change them automatically.

a.1.4 Human postprocessing of automatic change of politeness

We manually correct the translations from the previous step. Mistakes of the described automatic change of politeness happen because of:

  1. ambiguity arising when imperative and indicative verb forms are the same,

  2. inability of pymorphy2 to inflect the singular number to some verb forms (e.g., to inflect singular number to past tense verbs),

  3. presence of related adjectives, which have to agree with the pronoun,

  4. ambiguity arising when a plural form of a pronoun may have different singular forms.

a.1.5 Human annotation: are both polite and impolite versions appropriate?

After the four previous steps, we have text fragments of several consecutive sentences with consistent level of politeness. Each fragment uses second person singular pronouns, either T form or V form, without nominal markers indicating which of the forms is the only one appropriate. For each group we have both the original version, and the version with the switched formality.

To control for appropriateness of both levels of politeness in the context of a whole text fragment we conduct a human annotation. Namely, humans are given both versions of the same text fragment corresponding to different levels of politeness, and asked if these versions are natural. The answers they can pick are the following:

  1. both appropriate,

  2. polite version is not appropriate,

  3. impolite version is not appropriate,

  4. both versions are bad.

The annotators are not given any specific guidelines, and asked to answer according to their intuition as a native speaker of the language (Russian).

There are a small number of examples where one of the versions is not appropriate and not equally natural as the other one: 4%. Cases where annotators claimed both versions to be bad come from mistakes in target translations: OpenSubtitles data is not perfect, and target sides contain translations which are not reasonable sentences in Russian. These account for 1.5% of all examples.

a.2 Lexical cohesion

The process of creating the lexical cohesion test set consists of several stages:

  1. find passages where named entities are translated consistently,

  2. identify alternative translations for these named entities for the lexical table of Moses Koehn et al. (2007) induced from the training data,

  3. construct alternative translations of each example by switching the translation of instances of the named entity,

  4. for each example construct several test instances.

a.2.1 Identification of examples with consistent translations

We look for infrequent words that are translated consistently in a text fragment. Since the target language has rich morphology, to verify that translations are the same we have to use lemmas of the translations. More precisely, we

  1. train Berkeley aligner on about 6.5m sentence pairs from both training and held-out data,

  2. find lemmas of all words in the reference translations in the held-out data using pymorphy2,

  3. find words in the source which are not in the 5000 most frequent words in our vocabulary whose translations have the same lemma.

a.2.2 Finding alternative translations

For the words under consideration, we find alternative translations which would be (i) equally appropriate in the context of the remaining sentence and text fragment (ii) possible for the model to produce. To address the first point, we focus on named entities, and we assume that all translations of a given named entity seen in the training data are appropriate. To address the second point, we choose alternative translations from the reference translations encountered in the training data, and pick only ones with a probability at least 10%.

The sequence of actions is as follows:

  1. train Moses on the training data (6m sentence pairs),

  2. for each word under consideration (from A.2.1), get possible translations from the lexical table of Moses,

  3. group possible translations by their lemma using pymorphy2,

  4. if a lemma has a probability at least 10%, we consider this lemma as possible translation for the word under consideration,

  5. leave only examples with the word under consideration having several alternative translations.

After that, more than 90% of examples are translations of named entities (incl. names of geographical objects). We manually filter the examples with named entities.

a.2.3 Constructing a test set

From the two previous steps, we have examples with named entities in context and source sentences and several alternative translations for each named entity. Then we

  1. construct alternative translations of each example by switching the translation of instances of the named entity; since the target language has rich morphology, we do it manually,

  2. for each example, construct several test instances. For each version of the translation of a named entity, we use this translation in the context, and vary the translation of the entity in the current sentence to create one consistent, and one or more inconsistent (contrastive) translation.

Appendix B Experimental setup

b.1 Data preprocessing

We use the publicly available OpenSubtitles2018 corpus Lison et al. (2018) for English and Russian.888 We pick sentence pairs with a relative time overlap of subtitle frames between source and target language subtitles of at least to reduce noise in the data. As context, we take the previous sentence if its timestamp differs from the current one by no more than 7 seconds. Each long group of consecutive sentences is split into fragments of 4 sentences, with the first 3 sentences treated as context. More precisely, from a group of consecutive sentences we get , , , . For CADec we also include and as training examples. We do not add these two groups with less context for the concatenation model, because in preliminary experiments, this performed worse both in terms of BLEU and consistency as measured on our test sets.

We use the tokenization provided by the corpus and use multi-bleu.perl999 on lowercased data to compute BLEU score. We use beam search with a beam of 4 for both base model and CADec.

Sentences were encoded using byte-pair encoding Sennrich et al. (2016), with source and target vocabularies of about 32000 tokens. Translation pairs were batched together by approximate sequence length. For the Transformer models (baselines and concatenation) each training batch contained a set of translation pairs containing approximately 16000101010This can be reached by using several of GPUs or by accumulating the gradients for several batches and then making an update. source tokens. It has been shown that Transformer’s performance depends heavily on a batch size Popel and Bojar (2018), and we chose a large value of batch size to ensure that models show their best performance. For CADec, we use batch size that contains approximately the same number of translation instances as the baseline models.

b.2 Model parameters

We follow the setup of Transformer base model Vaswani et al. (2017). More precisely, the number of layers in the base encoder, base decoder and CADed is . We employ parallel attention layers, or heads. The dimensionality of input and output is , and the inner-layer of a feed-forward networks has dimensionality .

We use regularization as described in Vaswani et al. (2017).

b.3 Optimizer

The optimizer we use is the same as in Vaswani et al. (2017). We use the Adam optimizer Kingma and Ba (2015) with , and . We vary the learning rate over the course of training, according to the formula:

We use , for the models trained on 6m data (baseline (6m) and concatenation) and for the models trained on 1.5m data (baseline (1.5m) and CADec).