Context in Neural Machine Translation: A Review of Models and Evaluations

01/25/2019 ∙ by Andrei Popescu-Belis, et al. ∙ HEIG-VD 0

This review paper discusses how context has been used in neural machine translation (NMT) in the past two years (2017-2018). Starting with a brief retrospect on the rapid evolution of NMT models, the paper then reviews studies that evaluate NMT output from various perspectives, with emphasis on those analyzing limitations of the translation of contextual phenomena. In a subsequent version, the paper will then present the main methods that were proposed to leverage context for improving translation quality, and distinguishes methods that aim to improve the translation of specific phenomena from those that consider a wider unstructured context.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Looking Back on the Past Two Years

Neural network architectures have become mainstream for machine translation (MT) in the past three years (2016–2018). This paradigm shift took a considerably shorter time than the previous one, which was from rule-based to phrase-based statistical MT models. Neural machine translation (NMT) was adopted thanks to its superior performance, and despite its higher computational cost (which has been mitigated by optimized hardware and software) or its need for very large training datasets (which has been addressed through back-translation of monolingual data and character-level translation as back-off). The NMT revolution is apparent in the burst of the numbers of related scientific publications since 2017, as well as in the increased attention MT receives from the general media, often related to visible improvements in the quality of online MT systems.

While much remains to be done, especially for low-resource language pairs or for specific domains, the quality of the most favorable cases such as English-French or German-English news translation has reached unprecedented levels, leading to claims that it achieves human parity. A remaining bottleneck, however, is the capacity to leverage contextual features when translating entire texts, especially when this is vital for correct translation. Taking textual context into account111In this review, ‘context’ refers to the sentences of a document being translated, and not to extra-textual context such as associated images. Multimodal MT is an active research problem, but is outside our present scope. means modeling long-range dependencies between words, phrases, or sentences, which are typically studied by linguistics under the topics of discourse and pragmatics. When it comes to translation, the capacity to model context may improve certain translation decisions, e.g. by favoring a better lexical choice thanks to document-level topical information, or by constraining pronoun choice thanks to knowledge about antecedents.

This review paper puts into perspective the significant amount of studies devoted in 2017 and 2018 to improve the use of context in NMT and measure these improvements. We start with a brief recap of the mainstream neural models and toolkits that have revolutionized MT (Section 2

). We then organize our perspective based on the observation that most MT studies design and implement models, run them on data, and apply evaluation metrics to obtain scores, i.e. Models + Data + Metrics = Results. Novelty claims are generally made about one or more of the left-hand side terms, claiming improved results in comparison to previous ones.

Existing MT models can be tested on new metrics and/or datasets, to highlight previously unobserved properties of these models. Therefore, in Section 3

, we review evaluation studies of NMT, which either apply existing metrics (going beyond n-gram matching) or devise new ones. We discuss these studies by increasing order of complexity of the evaluated aspects: first grammatical ones, and then semantic and discourse-level ones, including word sense disambiguation (WSD) and pronoun translation.

Most often however, new models are tested on existing data and metrics, to enable controlled comparisons with competing models. In an upcoming version of this paper, we will discuss new NMT models that extend the context span considered during translation. We will distinguish those that use unstructured text spans from those that perform structured analyses requiring context, in particular lexical disambiguation and anaphora resolution.

2 Neural MT Models and Toolkits

2.1 Mainstream Models

Early attempts to use neural networks in MT aimed to replace n-gram language models with neural network ones (Bengio et al., 2003; Schwenk et al., 2006)

. Later, feed-forward neural networks were used to enhance the phrase-based systems by rescoring the translation probability of phrases

(Devlin et al., 2014)

. Variable length input was accommodated by using recurrent neural networks (RNNs), which offered a principled way to represent sequences thanks to hidden states. One of the first “continuous” models, i.e. not using explicit memories of aligned phrases, was proposed by

Kalchbrenner and Blunsom (2013)

, with RNNs for the target language model, and a convolutional source sentence model (or a n-gram one). To address the vanishing gradient problem with RNNs, long short-term memory (LSTM) units

(Hochreiter and Schmidhuber, 1997) were used in sequence-to-sequence models (Sutskever et al., 2014)

, and further simplified as gated recurrent units (GRU)

(Cho et al., 2014; Chung et al., 2014). Such units allowed the networks to capture longer-term dependencies between words thanks to specialized gates enabling them to remember vs. forget past inputs.

Such sequence-to-sequence models were applied to MT with an encoder and a decoder RNN (Cho et al., 2014)

, but had serious difficulties in representing long sentences as a single vector

(Pouget-Abadie et al., 2014), although using bi-directional RNNs and concatenating their representations for each word could partly address this limitation. The key innovation, however, was the attention mechanism introduced by Bahdanau et al. (2015), which allows the decoder to select at each step which part of the source sentence is more useful to consider for predicting the next word.222This paper was first posted on Arxiv in September 2014, while the one by Cho et al. (2014) was posted in June of the same year. Attention is a context vector – a weighted sum over all hidden states of the encoder – than can be seen as modeling the alignment between input and output positions. The efficiency of the model was further improved, with small effects on translation quality (Luong et al., 2015; Wiseman and Rush, 2016)

. The proposal for distinguishing local vs. global attention models by

Luong et al. (2015) has yet to be incorporated in mainstream models.

The demonstration that NMT with attention-based encoder-decoder RNNs outperformed phrase-based SMT came at the 2016 news translation task of the WMT evaluations (Bojar et al., 2016). The system presented by the University of Edinburgh (Sennrich et al., 2016c) obtained the highest ranking thanks particularly to two additional improvements of the generic model. The first one was to use back-translation of monolingual target data from a state-of-the-art phrase-based SMT engine to increase the amount of parallel data available for training (Sennrich et al., 2016a). The second one was to use byte-pair encoding, allowing translation of character n-grams and thus overcoming the limited vocabulary of the encoder and decoder embeddings (Sennrich et al., 2016b). Low-level linguistic labels were shown to bring small additional benefits to translation quality (Sennrich and Haddow, 2016). The Edinburgh system was soon afterward open-sourced under the name of Nematus (Junczys-Dowmunt et al., 2016).

Research and commercial MT systems alike were quick to adopt NMT, starting with the best-resourced language pairs, such as English vs. other European languages and Chinese. Around the end of 2016, online MT offered by Bing, DeepL, Google or Systran was powered by deeper and deeper RNNs (as far as information is available). In the case of DeepL, although little information about the systems is published, its visible quality333See and the beginning of Section 3.2

for an estimate.

could be partly explained by the use of the high-quality Linguee parallel data.

An interesting development have been the claims for “bridging the gap between human and machine translation” from the Google NMT team in September 2016 on EN/FR and EN/DE (Wu et al., 2016), and for “achieving human parity on news translation” from the Microsoft NMT team in March 2018 on EN/ZH (Hassan et al., 2018). These claims have raised attention from the media, but have also been disputed by deeper evaluations (see Section 3.2).

RNNs with attention allow top performance to be reached, but at the price of a large computational cost. For instance, the largest Google NMT system from 2016 (Wu et al., 2016)

, with its 8 encoder and decoder layers of 1,024 LSTM nodes each, required training on 96 nVidia K80 GPUs for 6 days, in spite of massive parallelization (e.g. running each layer on a separate GPU). A more promising approach to decrease computational complexity is the use of convolutional neural networks for sequence to sequence modeling, as proposed by

Gehring et al. (2017) in the ConvS2S model from Facebook AI Research. This model outperformed Wu et al.’s system on WMT 2014 EN/DE and EN/FR translation “at an order of magnitude faster speed, both on GPU and CPU”. Posted in May 2017, the model was outperformed the next month by the Transformer (Vaswani et al., 2017).

The Transformer NMT model (Vaswani et al., 2017) removes sequential dependencies (recurrence) in the encoder and decoder networks, as well as the need for convolutions, and makes use of self-attention networks for positional encoding. For instance, the encoder is composed of six pairs of 512-dimensional layers; in each pair, the first layer implements multi-head self-attention, while the second is a fully-connected feed-forward layer. In the decoder, an additional layer in each pair implements multi-head attention over the encoder’s output. As a result, training on GPUs can be fully parallelized, thus substantially reducing training time, and slightly outperforms RNN models.

For these reasons, the Transformer was quickly adopted by the research community: it was used by virtually all systems for the WMT 2018 news task (Bojar et al., 2018). The model is now implemented in most NMT toolkits (see next section). While the Transformer remains the state-of-the-art at the time of writing, several of its authors have shown that RNN architectures could be improved beyond the Transformer using some of its insights, and that hybrid architectures based on RNN, CNN and the Transformer pushed the scores on WMT’14 EN/DE and EN/FR datasets even higher (Chen et al., 2018).444Furthermore, inspiration from the Transformer can be even found in the BERT language modeling technique (Bidirectional Encoder Representations from Transformers), also from Google, which reached new state-of-the-art results on 11 NLP benchmarks, including question answering or inference (Devlin et al., 2018). A deeper attention model for MT has been presented at the end of 2018, filtering attention from lower to higher levels over five layers (Zhang et al., 2018), with encouraging results.

The findings of the WMT 2018 news translation task (Bojar et al., 2018) confirmed the merits of the Transformer, though certain improvements in the architecture allowed a late-coming submission from Facebook (Ott et al., 2018) to be ranked first on EN/DE. This system trained the Transformer using reduced numeric precision, thus accelerating “training by nearly 5x on a single 8-GPU machine”. Results were further improved by the team using advances in back-translation to generate synthetic source sentences (Edunov et al., 2018), with training sets reaching hundreds of millions of sentences; the system also achieved state-of-the-art performance on WMT ’14 EN/DE test sets.

2.2 NMT Toolkits

The above NMT models are often available as open-source implementations in MT toolkits, which are built upon general-purpose machine learning frameworks supporting neural networks. Machine learning frameworks are evolving at a rapid pace and so do NMT implementations.


Among the most recent changes in the ML ecosystem, one can cite the merger of Caffe into PyTorch, the growth of PyTorch itself in comparison to the earlier LuaTorch, the end of Theano development by the University of Montreal, and the integration of Keras into the core of TensorFlow.

Most NMT toolkits are now built over TensorFlow and Torch (Lua or Python), though others are built from scratch or over other frameworks.

The NMT toolkits most frequently used in research studies, including submissions to shared tasks, are the following ones (in alphabetical order).

3 How Good is NMT? Fine-grained Evaluation Studies

Most proponents of novel NMT models evaluate them using the BLEU metric (Papineni et al., 2002) on parallel data from the WMT conferences (e.g. While this is a fairly common and accepted procedure, the significance666‘Significance’ meaning here ‘importance’ or ‘relevance’, and not statistical significance, which is often duly tested (Koehn, 2004). of small increases in BLEU is not entirely clear, especially in terms of perceived quality, given the multiplicity of quality aspects that are actually relevant to users (Hovy et al., 2002). Human rating of quality, e.g. through direct assessment (Graham et al., 2017), is generally carried out only yearly, at dedicated evaluation campaigns such as WMT or IWSLT, often without delving into specific quality attributes any further.

Therefore, a rich set of evaluation studies have attempted to shed light on the various improvements brought by NMT, often compared to SMT. These studies applied existing metrics, or devised new ones, using new or existing data sets, to assess fine-grained quality aspects of NMT output from various systems.

The studies presented in this section evaluate existing NMT systems, without propose new techniques addressing the observed limitations (such proposals are discussed in Section 4

below)). We organize this section according to the linguistic complexity of the quality aspects (or attributes) of NMT output, from words to texts.After a preliminary discussion of two studies using BLEU in various conditions, we consider evaluations of morphology, the lexicon, verb phrases, and word order (Section 

3.2). When then turn to evaluations of contextual factors, from semantic properties including word sense disambiguation and lexical choice (3.3.1), to discourse-related or document-level quality aspects, in particular the translation of pronouns (3.3.2).777Studies of other system qualities such as efficiency, adaptability, or usability have been comparatively less frequent and are not included here. Properties such as the ability to handle multilingual – as opposed to bilingual – models and to perform zero-shot translation have been examined, e.g. by Lakew et al. (2018).)

3.1 Studies Using BLEU

Claims about performance can be thoroughly analyzed even when BLEU is used as a metric. Recently, Toral et al. (2018) examined again the claim for human parity on EN/ZH translation from the Microsoft NMT team (Hassan et al., 2018) by inspecting more closely the test sets. One finding is that a significant portion of the data was originally written in English, so the system’s Chinese source was “translationese”, i.e. influenced by the target language.888This is additional empirical evidence for the need of properly constructed directional corpora, e.g. extracted from Europarl with additional speaker information (Cartoni and Meyer, 2012). If evaluation is restricted to original English source sentences, then human parity is not reached. Moreover, many reference translations have fluency problems, contain grammatical errors or mistranslated nouns. The authors also confirm the finding of Läubli et al. (2018) that professional human judges, who also have higher inter-rater agreement, still find a gap between human translations and NMT.

Koehn and Knowles (2017) analyze evaluation results of NMT and PBSMT with the BLEU metric and make several observations: the quality of NMT decreases quickly out of the training domain, and with long sentences; the attention model is not necessarily a true alignment model, and the beam search leads to acceptable results for narrow beams only. The authors infer six challenges for NMT, but discourse-level metrics point to additional challenges, in particular related to the use of context (see Section 3.3).

3.2 Grammatical and Lexical Qualities of NMT

Shortly after the NMT approach became state-of-the-art, several finer-grained evaluations than those based on BLEU were applied to it. These studies differ widely in the granularity of error classifications, and in how error metrics are applied and on what data, as we now discuss.

3.2.1 Human and Automatic Ratings

One of the first detailed analyses of the output of NMT (encoder-decoder RNNs with attention) in comparison with SMT was presented by Bentivogli et al. (2016)

. Using high-quality post-edits by professional translators on system outputs from the IWSLT EN/DE 2015 task, errors were automatically detected and classified according to several categories: morphological errors (correct lemma but wrong form), lexical errors (wrong lemma), and word order errors. The latter type was further analyzed according to POS and dependency tags; but lexical errors were not further subcategorized (see Section 

3.3.1 for such attempts).

The comparisons showed that NMT made about 20% fewer lexical or morphological mistakes than SMT, and up to 50% fewer word-order errors (especially on verb placement, which is essential in German), thus demonstrating better flexibility than SMT for language pairs with different word orders. However, NMT sometimes failed to translate all source words, such as negations, which is detrimental to adequacy and difficult to spot by the user. The authors also found that the Translation Error Rate (TER) of all systems increased similarly with sentence length, and NMT outperformed PBSMT, though by a smaller margin on longer sentences. An extended version of the study (Bentivogli et al., 2018) added an analysis of IWSLT EN/FR data which confirmed the above conclusions, and found that NMT had a better capacity to reorder nouns in EN/FR translation than PBSMT, but made more errors on proper nouns.

Popović (2017) performed error analysis on 267 EN/DE and 204 DE/EN sentences from WMT 2016 News Test, and compared the output of an NMT system (Sennrich et al., 2016c) with one from PBSMT (based on Moses), both obtained from WMT submissions. A variety of grammatical aspects were evaluated, showing that morphology (particularly word forms), English noun collocations, word order, and fluency are better for NMT than PBSMT. Still, the tested RNN-based NMT system had difficulties with polysemous English source words, and with English continuous verb tenses (on the target side). In an extended version, Popović (2018) confirmed these conclusions, and added analyses of English-Serbian translation on 267 sentences.

In an early comparison of PBSMT to NMT, Castilho et al. (2017b) required professional translators to post-edit MT output, namely 100 English sentences from MOOCs, translated into German, Portuguese, Russian and Greek. Translators ranked outputs from the two systems, and counted the time and number of operations used for post-editing; fluency and adequacy were also rated. Specific error annotation was performed as well, dividing errors into several classes: inflectional morphology, word order, and omission / addition / mistranslation. NMT globally outperformed PBSMT on these metrics, except for omission and mistranslation. It also outperformed PBSMT on post-editing time, as NMT errors were more difficult to grasp, although fewer sentences needed correction. These findings were confirmed in a subsequent article (Castilho et al., 2017a) which added two additional use cases beyond MOOCs: EN/DE translation of product ads, and ZH/EN patent translation. NMT thus appeared superior in fluency, but superiority in adequacy or post-editing effort was not observed. The use of NMT as an assistance tool for professional translators appeared as uncertain.999In the meanwhile, the switch of virtually all online MT offerings to NMT tends to indicate a consensus on the advantages of NMT for web translation.

Given the multiplicity of translation directions and domains that can be tested, it may be of no surprise that other studies followed suit. Toral and Sánchez-Cartagena (2017) evaluated PBSMT and NMT submissions to WMT 2016 for 9 translation directions (EN to/from CS, DE, FI, RO, RU, except FI/EN) and confirmed that NMT is more fluent (measured with an edit distance) and has better inflected forms, but struggles with sentences longer than 40 words. In a larger journal article submitted in August 2017, Klubička et al. (2018) apply a multidimensional quality metric (MQM) and study the statistical significance of differences between MT systems, for English to Croatian, a morphologically rich language. MQM is applied by two human raters over 100 sentences, with outputs from 3 systems, with a large taxonomy of error types, such as word order, agreement, spelling, along with omission and mistranslation. The authors found that their best system (NMT Nematus) reduced the error of their weakest one (PBSMT Moses) by about 50%, and was especially better for long-distance grammatical agreement.

Burchardt et al. (2017) created a large test suite of around 5000 EN/DE segments to evaluate MT output for 120 phenomena grouped in 15 categories (e.g. ‘ambiguity’, ‘function words’, or ‘long-distance dependency’). They used about 800 items for a comparison of 7 MT systems (rule-based, PBSMT, or neural), and reached somehow surprising conclusions, likely because scores were micro-averaged across categories of test sentences with very different sizes (e.g. 529 out of 777 test verb tense / aspect / mood). Had macro-averaging been used, the likely winner would have been the Google NMT system (Wu et al., 2016), which performed best on most error categories.

Another error taxonomy was proposed by Esperanca-Rodier et al. (2017) and was applied to PBSMT and NMT output over the BTEC-corpus, to compare reference-based metrics with explicit error annotation and study the translators’ perception of the output. Again, NMT outperformed PBSMT, albeit slightly. Similarly, Brussel et al. (2018) compared the outputs of online systems on EN/NL translation. NMT was found to be particularly fluent, although omissions remained a problem, and made fewer WSD errors but more mistranslations, which may be harder to post-edit.

3.2.2 Contrastive Pairs and Challenge Sets

Evaluation methods based on contrastive pairs require access to the probability estimates of pairs of source and target sentences from the evaluated system. These probabilities are easy to obtain from NMT systems that are not used as black boxes, but impossible to get from online systems. Moreover, these methods do not guarantee that if a systems correctly scores two candidate target sentences, then it can also find the correct translation using beam search when only the source is given.

Sennrich (2017) designed LingEval97, a test set of 97k contrastive pairs, built from reference EN/DE translations from WMT. A reference translation can be modified in five different ways to generate an incorrect counterpart, using editing rules to automate the process for a large set. Incorrect sentences are generated by (1) changing the gender of a singular determiner; (2) changing the number of a verb; (3) changing a verb particle; (4) changing aspects of sentence polarity, e.g. inserting or deleting a negation particle; (5) swapping characters in unseen names. The main findings are that character-based NMT systems (RNN-based) are better than byte-pair encoding ones on type 5 errors, but worse on types 1, 2 and 3. As for polarity, while spurious insertions of negations are well detected by all studied systems, spurious deletions are less well detected, echoing he fact that negations are sometimes omitted in NMT output.

LingEval97 was recently reused in a comparison of RNN, CNN and Transformer architectures by Tang et al. (2018a), along with ContraWSD set presented below for a semantic evaluation. Performance on LingEval97 appeared to be quite similar across architectures, with RNNs being particularly competitive for modeling long-distance agreement between subjects and verbs (in fact, detecting wrong agreements).

Isabelle et al. (2017) proposed a linguistically-motivated test suite or more exactly a challenge set, as the sentences are not accompanied by a reference translation – instead, human judges are required to evaluate whether each challenge sentence was translated correctly or not. The application cost remains moderate due to the small amount of sentences: 108 for EN/FR translation. The sentences are divided into three categories: morpho-syntactic (including agreement and subjunctive mood), lexico-syntactic (including double-object and manner-of-movement verbs), and syntactic (e.g. yes-no and tag questions, and placement of clitic pronouns). At the end of 2016, the challenge set was applied to PBSMT and NMT (Sennrich et al., 2016c; Wu et al., 2016). Later on, it was also applied to the online DeepL Translator101010 showing a 50% error reduction with respect to the best NMT system of the initial article.

3.3 Evaluation of Semantic and Discourse Phenomena in NMT Output

Categorizing MT errors as ‘semantic’ or ‘discourse’ is not always clear-cut, as it often involves an hypothesis on the cause of an error. For instance, is outputting a wrong pronoun a morpho-syntactic or a discourse error? If only its gender is wrong, then this may be attributed to ignorance of its antecedent, whereas if the case is wrong (subject vs. object), then the error can be considered as grammatical. In this section, we present NMT evaluation studies focusing on errors that can be attributed to insufficient knowledge or modeling of semantic and discourse properties, which often require considering a context made of multiple sentences.111111This contrasts with the local view of context adopted e.g. by Knowles and Koehn (2018), where context actually means the left bigram context.

We group studies into three categories: those dealing with lexical choice (including WSD and lexical coherence), those dealing with referential phenomena (anaphora and coreference), and finally those dealing with discourse structure, though no study among the latter deals with NMT.

3.3.1 Evaluation of Lexical Choice: WSD and non-WSD Errors


Word ambiguity is often cited as an obvious difficulty for translation. In reality, “ambiguity” is a complex notion, and we will focus in this section on content words (open class). Let us suppose ideally that a word may convey one or more language-independent senses as listed for instance in WordNet, and that a given occurrence of conveys only one sense at a time. Let us now consider independently a word in the source language, and three words to in the target language, with the following senses: , , , .

All these words except and can be said to be ambiguous, as they may convey more than one sense. However, for translation, only the ambiguity of actually matters. If an occurrence of conveys sense but is translated with (which cannot convey this sense), this is called a word sense disambiguation (WSD) error, regardless of how was actually found, i.e. whether or not WSD was explicitly performed on the source side. If, however, the occurrence of conveys sense , then both and can be used, in principle. Then, regardless of what a reference translation may contain, using one of or cannot be a WSD error.121212Note that if a word may convey only one sense, there is no potential for a source-side WSD error. Note also that translating a word by a word that can convey all its senses does not oblige the system to perform source-side WSD.

This representation does not account for the additional constraints that may distinguish between translations by and when conveys sense , and which may lead to non-WSD lexical errors. For instance, it may happen that is an infrequent sense of , or if they are verbs, and may have different sub-categorization frames. If a previous occurrence of was translated by , it may be the case that word repetition is necessary for cohesion, or for understanding a repeated reference, ruling out a subsequent translation by . Other constraints may come from collocations (MWEs) or terminology. Some of these factors are mere preferences, while others are strong constraints, leading to genuine mistakes if not respected. Therefore, non-WSD lexical errors may violate cohesion, coherence, sense frequency distributions, collocations, terminology, or grammatical constraints.

For instance, the test set designed by Bawden et al. (2018) for ambiguous source words (like above) equates WSD errors to coherence errors, because they generally make the output incoherent. Conversely, non-WSD errors are equated to cohesion errors, although the authors concede that “these types are not mutually exclusive and the distinction is not always so clear.” While cohesion errors per se can in principle be counted automatically (Wong and Kit, 2012), WSD errors, as well as non-WSD errors not related to cohesion (e.g. due to collocations or terminology) are more difficult to spot without human intervention. One solution is the recent trend – though not without remote ancestors (King and Falkedal, 1990) – to use contrastive pairs containing ambiguous source words, which we discuss hereafter.

Test suites with contrastive pairs.

Based on the same principle as Lingeval97 (Sennrich, 2017) mentioned above, ContraWSD is a set of contrastive pairs intended to evaluate the capacity of an MT system to translate the correct sense of a polysemous word in context (Rios Gonzales et al., 2017). About 80 word senses were selected automatically, by observing target-side variation, for each of the DE/EN and FR/EN pairs.131313See For each sense, 90 sentences are available on average, and for each reference translation, an average of 3.5 and 2.2 wrong translations are generated by replacing the target word with other observed translations of the word. As with Lingeval97, a system can be tested with ContraWSD only if it can output the probability of a source/target sentence pair, which excludes black box systems, and a good answer is a case where the system ranks a correct translation higher than an incorrect one, given the source. The authors found that a baseline NMT system (Nematus, Sennrich et al., 2017) reached about 70% accuracy, compared to 93–95% for a human. The sense-aware systems proposed in their study remained in the same accuracy range on average (see below), but scored higher when disambiguating frequent words.

This approach was pursued and proposed as a supplementary test suite at WMT18 (Rios et al., 2018), where it was formulated as a classic translation task, with a test set of 3,249 DE/EN sentence pairs (from several corpora on OPUS) which contained ambiguous German words identified in ContraWSD. The scoring is automatic in most cases, by observing the presence of a known correct vs. incorrect translation of each polysemous source word. All systems submitted to the WMT18 news translation task (Bojar et al., 2018) were also evaluated for WSD, and compared to certain 2016 systems, finding that accuracy of the best system progressed from 81% to 93%, and that the correlation with BLEU scores was strong but not perfect.

Another contrastive test set was made available in November 2017 by Bawden et al. (2018, Section 2.1) to assess lexical choice in English/French translation, but also pronoun choice (see next section). The set is thus composed of two equally sized subsets, each consisting of ‘blocks’ based on modified movie subtitles. There are 100 blocks testing lexical choice capabilities (WSD and non-WSD) (see also Bawden, 2018, Section 7.1). Formally, let us denote a block as . Each block is based on a source sentence containing a polysemous word . Two different source sentences and are provided as context, i.e. preceding . Their role is to modify the sense conveyed by the occurrence of in sentence . For each context, the block provides a correct translation of ( in the first case, in the second case), along with an incorrect one ( and respectively ). The reference translations of the context sentences ( and ) are also included. Because the source sentence is kept constant for the two contexts, a non-contextual system would provide the same answer for both contexts (i.e. same ranking of true/false candidates) and obtain 50% accuracy. Among the 100 blocks provided by Bawden et al. (2018), some are designed to test WSD capabilities, and include contexts such as indicates that conveys sense (with the notations above), so the correct translation is and the incorrect one is . Then, context indicates that conveys sense , and reverses the correctness of translations. Other blocks test non-WSD related lexical choices, which rely more on the target contexts and for deciding which translation is correct, e.g. the need to repeat the same word.

Exploring attention to context for NMT of polysemous words.

A quantitative analysis of the WSD capacities of NMT (encoder-decoder RNN with attention) was provided by Liu, Lu, and Neubig (2018) in August 2017, who opted for straightforward criteria to identify polysemous words and assess their translations. The number of senses of each EN source word (for EN/DE, EN/FR and EN/ZH NMT) was extracted from the online Cambridge Dictionary, and correct translation meant identity to the reference. Further on, to demonstrate the benefits of their proposed NMT improvements (see below), they restricted the list of polysemous words to a list of 171 English homographs found on Wikipedia.

The findings presented by Marvin and Koehn (2018) may explain why the capabilities of baseline NMT systems (RNN-based built with OpenNMT-py for EN/FR translation over Europarl and NewsComments) for WSD remain quite limited. They examined the representations of occurrences of polysemous words at various levels of the NMT encoding layers. Specifically, the tests involved 426 sentences and four polysemous words (right, like, last, and case), and showed that the encoded context seems insufficient to enable WSD in most cases.

This is confirmed by Tang et al. (2018b) who directly looked at how attention is distributed when translating polysemous words from ContraWSD. They compared RNN encoder-decoder with the Transformer model, with two ways to compute translation accuracy on polysemous words: either by comparing directly with a word-aligned reference, or by scoring the contrastive pairs as in (Rios Gonzales et al., 2017). In both cases, the Transformer clearly outperforms the RNN, though performance appears to be lower with reference-based scoring. The main findings are that attention weights are more concentrated on the “ambiguous noun itself rather than context tokens” and that “attention is not the main mechanism used by NMT models to incorporate contextual information for WSD.”

ContraWSD was again put to use by Tang et al. (2018a) for a quantitative evaluation of WSD for DE/EN and DE/FR translation. The comparison of RNNs, CNNs and Transformer showed that the latter is significantly better than the other two, likely because the network “connects distant words via shorter network paths”, but no further explanation or analysis on WSD is provided.

3.3.2 Evaluation of Pronouns and Coreference

A revival of the interest in improving discourse-level phenomena in MT has led since 2010 to several initiatives and studies to improve the evaluation of pronoun translation, i.e. to make it more accurate but also more efficient, and if possible, to automate it. With the advent of NMT, the new architectures have been submitted to the same tests and compared with PBSMT.

ParCor is a parallel EN/DE corpus first annotated with anaphoric relations, and then with coreference ones (Guillou et al., 2014; Lapshinova-Koltunski et al., 2018). It includes TED talks and EU Bookshop publications. The annotation pays special attention to the status of pronouns, and distinguishes several cases of referential vs. non referential uses. Using similar annotation guidelines, the authors designed the PROTEST test suite, which contains 250 pronouns along with their reference translations (Guillou and Hardmeier, 2016). Identity between a candidate and reference pronoun translation is scored automatically, but differences are submitted to human judgment. Indeed, depending on the pronoun systems of the source and target languages (often EN/FR and EN/DE in these evaluations), but also and crucially on the lexical choice for a pronoun’s antecedent, a variety of translations can be acceptable for pronouns. This limits the accuracy of automatic reference-based metrics such as APT (Miculicich Werlen and Popescu-Belis, 2017), as recently discussed by Guillou and Hardmeier (2018), and requires alternative strategies when evaluations must be quick, large-scale and cost-effective, e.g. for pronoun-oriented shared tasks.

Several shared tasks have been organized to assess the quality of pronoun translation, but due to evaluation difficulties, protocols have evolved from year to year. Two main approaches have been tried: (1) evaluate the accuracy of pronoun translation, though this cannot be done automatically with sufficient confidence for a shared task, and requires some form of human evaluation; (2) evaluate the accuracy of pronoun prediction given the source text and a lemmatized version of the reference translation with deleted pronouns, which can be done (semi-])automatically.141414Lemmatization prevents non-MT strategies like powerful language models from attaining high scores, as it happened at the 2015 DiscoMT shared task (Hardmeier et al., 2015). Both approaches have been tried at the DiscoMT 2015 shared task (Hardmeier et al., 2015), but only the second one was continued in the following years (Guillou et al., 2016; Loáiciga et al., 2017).

At WMT 2018, pronoun translation was evaluated for all 16 systems participating in the EN/DE news translation task using an external test suite (Guillou et al., 2018) in the PROTEST style, with 200 occurrences of it and they on the source side. These pronouns have multiple possible translations into German. Evaluation was semi-automatic, with candidates matching the reference (1,150) being ‘approved’ and the others being submitted to human judges (2,050). Seven out of 15 systems (all NMT) translate correctly more than 145 pronouns out of 200, with the best one reaching 157 (Microsoft’s Marian Junczys-Dowmunt et al., 2018). Pronoun accuracy is highly correlated with BLEU ( 0.91) and APT ( 0.89). Event references reach 81% accuracy and pleonastic it 93% on average. Intra-sentential anaphoric occurrences of it are better translated than inter-sentential ones (58% vs. 45%).

A similar method using PROTEST was applied by the same authors to EN/FR MT with PBSMT and NMT systems (Hardmeier and Guillou, 2018), with 250 occurrences of it and they. The best system, which is the Transformed-based context-aware system from Voita et al. (2018), translated 199 pronouns correctly, while the average over the 9 tested systems is 160 (64%). The study shows that the context-aware system is highly accurate on pleonastic (non-referential) pronouns (27 out of 30) and intra-sentential anaphoric it and they (35/40 and 21/25) but still struggles with inter-sentential ones (15/30 and 11/25).

The evaluation approach adopted by Voita, Serdyukov, Sennrich, and Titov (2018) for their context-aware NMT architecture is quite exemplary. Their goal is to demonstrate improvement of pronoun translation, and this is evaluated without the use of specific test suites or contrastive pairs (see below). The authors use Stanford’s CoreNLP coreference resolution system151515An unspecified component of CoreNLP (Manning et al., 2014), possibly the deterministic one (Lee et al., 2013). to identify sentences with pronouns that have their antecedent in the previous sentence; for such sentences, BLEU improves more (+1.2) than on average. Moreover, for sentences with it and a feminine antecedent, BLEU increases by 4.8 points. The attention weights of the system are compared to CoreNLP results in two ways, first by identifying the token that receives maximal attention when the pronoun is translated. This token coincides with the antecedent found by CoreNLP more often (+6 points) than for baseline methods (random, first, or last noun). The second evaluation has human raters identify the actual antecedent in 500 sentences with it

where more than one candidate antecedent exists in the previous sentence. Here, CoreNLP is correct in only 79% of the cases, while using NMT attention is 72% correct, well above the best heuristic at 54%. These arguments strongly indicate that NMT learns to perform inter-sentential anaphora resolution to some extent.

161616This analysis of performance in situations that are genuinely ambiguous is reminiscent of Linzen, Dupoux, and Goldberg (2016), who assess the capacity of LSTM networks to model syntactic dependencies such as noun-verb agreement.

Moving away from evaluations performed on test suites with reference translations, as well as from those requiring coreference resolution, contrastive pairs have also been designed for pronoun translation. The above-mentioned test set by Bawden, Sennrich, Birch, and Haddow (2018) also contains 100 blocks that aim to test the translation of personal and possessive pronouns. As for WSD, the context and source sentences are kept constant (e.g. with a pronoun it), but four alternative translations of the context sentence are generated, varying the translation of the antecedent: (a) reference translation; (b) possible translation with the opposite gender; (c) and (d), inaccurate translations, feminine and masculine. For each translation of the context sentence, the contrastive pair differs only in the translation of it, with a masculine vs. feminine French pronoun (il or elle). In situations (c) and (d), the system is expected to prefer the “contextually correct” translation, agreeing with the gender of the inaccurate translation of the antecedent. The best system designed by Bawden et al. (2018) achieves 72.5% accuracy versus 50% for a non-contextual NMT system.

Finally, a much larger but less structured set of contrastive pairs for pronouns has been presented by Müller, Rios, Voita, and Sennrich (2018).171717See The EN/DE pairs contain only source sentences occurring in the Open Subtitles, without editing of the context sentences. The key to ensure high quality automatic data selection is to focus on the English source pronoun it and its possible German translations into er, sie or es, with several constraints: automatic anaphora resolution on both EN/DE sides (with CoreNLP (Manning et al., 2014)181818Unspecified coreference component. and CorZu (Tuggener, 2016)) must find an antecedent; the antecedents on the EN and DE sides must be word aligned (with fast-align (Dyer et al., 2013)); and the EN/DE pronouns must also be aligned. With these constraints in mind, the set includes the source and target sentences containing it and its translation, and as much context as needed before the sentence. To generate the wrong alternative in the contrastive pair, the correct translation is randomly replaced with one of the two incorrect ones. The set contains 12,000 occurrences of it, with 4,000 for each possible translation; most antecedents (58%) are in the previous sentence. Using these contrastive pairs, the authors find that context-aware models outperform baselines by up to 20 percentage points, especially on the sentences where the antecedent is in the preceding sentence, while BLEU scores are only marginally improved.

3.3.3 Evaluation of Discourse Structure

Several metrics have been proposed to assess the ability to correctly translate discourse structure, but none of the studies applied them to NMT systems, as they pre-dated their advent. Discourse structure results from argumentation relations between sentences, often made explicit through discourse connectives. Although strategies to improve connective translation by PBSMT systems have been designed (Meyer and Popescu-Belis, 2012; Meyer et al., 2015), along with metrics to assess the improvements (Hajlaoui and Popescu-Belis, 2013), they have not been recently applied to NMT systems.

Similarly, metrics involving discourse structure (sentence-level RST parse trees) such as DiscoTKparty (Joty et al., 2017) have been shown to correlate positively with human judgments (for PBSMT). However, this study mostly refers to data from the WMT 2014 shared task on (meta-)evaluation of metrics, which did not include any NMT output at that time. Sim Smith and Specia (2018) designed a discourse-aware MT evaluation metric that compares embeddings of source and target connectives, which is validated on EN/FR MT outputs from 2014 and earlier, accompanied by human ratings.

A manual analysis of discourse phenomena in SMT, with quality estimation as the background objective, was presented by Scarton and Specia (2015), while other taxonomies of discourse-related errors, applied by manual analysts, have been inspired by contrastive linguistics at the discourse level, allowing comparison of cross-lingual contrasts in human and machine translation and concluding to NMT superiority (Lapshinova-Koltunski and Hardmeier, 2017; Šoštarić et al., 2018).

Indirect evidence on the capability of NMT to translate inter-sentential dependencies comes from the recent study by Läubli, Sennrich, and Volk (2018), reassessing Hassan et al.’s claim that the Bing Translator achieved human parity on ZH/EN news translation. Without examining detailed quality attributes such as word order, lexical choice, or pronouns, the authors asked human judges to rate translations at the text level rather than the sentence level, and they showed that when entire texts are considered by professional translators, the difference between human and NMT translations becomes statistically significant. One can therefore infer that there are perceptible imperfections in the NMT translation of text-level properties such as cohesion and coherence.

3.4 Synthesis

When a new system is presented in a publication, it cannot be expected from the authors that they apply a large array of existing metrics. Evaluation studies that deepen the analysis of MT output are thus welcome. As reviewed in this section, studies of NMT models from 2017–2018 have revealed significant improvements in output quality brought by NMT models, confirmed from a variety of perspectives:

  • type of metric: automated (e.g. using the TER distance to a reference translation) vs. human (e.g. judges who may post-edit, or compare, or rate absolutely one or more translations, with or without knowledge of the source language);

  • type of comparison: absolute score or comparative score (often pitching NMT against SMT);

  • type of system output: 1-best, -best, or probabilities over contrastive pairs (which require access to a system’s internals);

  • test data: large corpora from WMT, excerpts from them, domain-specific data, or test suites aimed at one or more linguistic phenomena.

One of the main observations of this review is the rather large number of assessments of document-level quality, which frequently support the need for discourse-aware MT. However, these studies also indicate that significant progress remains to be made on several discourse-level phenomena: lexical coherence, anaphora resolution. and discourse structure.

4 Increasing Context Spans in NMT

In a subsequent version of the paper, this section will review studies from 2017–2018 that attempted to improve the translation of discourse-level phenomena, and/or attempted to use larger spans of context when translating. The section will be divided in three parts: NMT systems using wider contexts; NMT models for improving WSD and lexical choice; and the processing of discourse-level phenomena, particularly pronouns, in NMT.


The author is grateful to the Swiss National Science Foundation (SNSF) for support through the DOMAT project (n. 175693): On-demand Knowledge for Document-level Machine Translation.


  • Bahdanau et al. [2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 2015. URL
  • Bawden [2018] Rachel Bawden. Contextual machine translation of dialogue: going beyond the sentence. PhD thesis, Université Paris-Saclay, LIMSI-CNRS, Orsay, France, November 2018.
  • Bawden et al. [2018] Rachel Bawden, Rico Sennrich, Alexandra Birch, and Barry Haddow. Evaluating discourse phenomena in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1304–1313. Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1118. URL
  • Bengio et al. [2003] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003.
  • Bentivogli et al. [2016] Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and Marcello Federico. Neural versus phrase-based machine translation quality: a case study. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    , pages 257–267. Association for Computational Linguistics, 2016.
    doi: 10.18653/v1/D16-1025. URL
  • Bentivogli et al. [2018] Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and Marcello Federico. Neural versus phrase-based MT quality: An in-depth analysis on english-german and english-french. Computer Speech & Language, 49:52–70, 2018. ISSN 0885-2308. doi: URL
  • Bojar et al. [2016] Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 131–198. Association for Computational Linguistics, 2016. doi: 10.18653/v1/W16-2301. URL
  • Bojar et al. [2018] Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Philipp Koehn, and Christof Monz. Findings of the 2018 conference on machine translation (wmt18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 272–303. Association for Computational Linguistics, 2018. URL
  • Brussel et al. [2018] Laura Van Brussel, Arda Tezcan, and Lieve Macken. A fine-grained error analysis of NMT, SMT and RBMT output for english-to-dutch. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018. European Language Resources Association (ELRA).
  • Burchardt et al. [2017] Aljoscha Burchardt, Vivien Macketanz, Jon Dehdari, Georg Heigold, Jan-Thorsten Peter, and Philip Williams. A linguistic evaluation of rule-based, phrase-based, and neural MT engines. The Prague Bulletin of Mathematical Linguistics, 108(1):159–170, 2017. URL
  • Cartoni and Meyer [2012] Bruno Cartoni and Thomas Meyer. Extracting directional and comparable corpora from a multilingual corpus for translation studies. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, 2012. European Language Resources Association (ELRA).
  • Castilho et al. [2017a] Sheila Castilho, Joss Moorkens, Federico Gaspari, Iacer Calixto, John Tinsley, and Andy Way. Is neural machine translation the new state of the art? The Prague Bulletin of Mathematical Linguistics, 108(1):109–120, 2017a.
  • Castilho et al. [2017b] Sheila Castilho, Joss Moorkens, Federico Gaspari, Rico Sennrich, Vilelmini Sosoni, Panayota Georgakopoulou, Pintu Lohar, Andy Way, Antonio Valerio Miceli Barone, and Maria Gialama. A comparative quality evaluation of PBSMT and NMT using professional translators. In Proceedings of Machine Translation Summit XVI, Nagoya, Japan, 2017b.
  • Chen et al. [2018] Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 76–86. Association for Computational Linguistics, 2018. URL
  • Cho et al. [2014] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734. Association for Computational Linguistics, 2014. doi: 10.3115/v1/D14-1179. URL
  • Chung et al. [2014] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Deep Learning and Representation Learning Workshop, Montreal, QC, Canada, 2014. URL
  • Devlin et al. [2014] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul. Fast and robust neural network joint models for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1370–1380. Association for Computational Linguistics, 2014. doi: 10.3115/v1/P14-1129. URL
  • Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018. URL
  • Dyer et al. [2013] Chris Dyer, Victor Chahuneau, and Noah A. Smith. A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 644–648. Association for Computational Linguistics, 2013. URL
  • Edunov et al. [2018] Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500. Association for Computational Linguistics, 2018. URL
  • Esperanca-Rodier et al. [2017] Emmanuelle Esperanca-Rodier, Caroline Rossi, Alexandre Bérard, and Laurent Besacier. Evaluation of NMT and SMT: A study on uses and perceptions. In Proceedings of the 39th Conference on Translating and the Computer, pages 11–24. Asling, 2017. URL
  • Gehring et al. [2017] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1243–1252, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL
  • Graham et al. [2017] Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. Can machine translation systems be evaluated by the crowd alone. Natural Language Engineering, 23(1):3–30, 2017.
  • Guillou and Hardmeier [2016] Liane Guillou and Christian Hardmeier. Protest: A test suite for evaluating pronouns in machine translation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, may 2016. European Language Resources Association (ELRA). ISBN 978-2-9517408-9-1.
  • Guillou and Hardmeier [2018] Liane Guillou and Christian Hardmeier. Automatic reference-based evaluation of pronoun translation misses the point. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4797–4802. Association for Computational Linguistics, 2018. URL
  • Guillou et al. [2014] Liane Guillou, Christian Hardmeier, Aaron Smith, Jörg Tiedemann, and Bonnie Webber. ParCor 1.0: A parallel pronoun-coreference corpus to support statistical MT. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA).
  • Guillou et al. [2016] Liane Guillou, Christian Hardmeier, Preslav Nakov, Sara Stymne, Jörg Tiedemann, Yannick Versley, Mauro Cettolo, Bonnie Webber, and Andrei Popescu-Belis. Findings of the 2016 WMT shared task on cross-lingual pronoun prediction. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 525–542. Association for Computational Linguistics, 2016. doi: 10.18653/v1/W16-2345. URL
  • Guillou et al. [2018] Liane Guillou, Christian Hardmeier, Ekaterina Lapshinova-Koltunski, and Sharid Loáiciga. A pronoun test suite evaluation of the english–german mt systems at wmt 2018. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 570–577. Association for Computational Linguistics, 2018. URL
  • Hajlaoui and Popescu-Belis [2013] Najeh Hajlaoui and Andrei Popescu-Belis. Assessing the accuracy of discourse connective translations: Validation of an automatic metric. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 236–247. Springer, 2013.
  • Hardmeier and Guillou [2018] Christian Hardmeier and Liane Guillou. Pronoun translation in english-french machine translation: An analysis of error types. CoRR, abs/1808.10196, 2018. URL
  • Hardmeier et al. [2015] Christian Hardmeier, Preslav Nakov, Sara Stymne, Jörg Tiedemann, Yannick Versley, and Mauro Cettolo. Pronoun-focused MT and cross-lingual pronoun prediction: Findings of the 2015 DiscoMT shared task on pronoun translation. In Proceedings of the Second Workshop on Discourse in Machine Translation, pages 1–16. Association for Computational Linguistics, 2015. doi: 10.18653/v1/W15-2501. URL
  • Hassan et al. [2018] Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. Achieving human parity on automatic chinese to english news translation. CoRR, abs/1803.05567, 2018. URL
  • Helcl and Libovickỳ [2017] Jindřich Helcl and Jindřich Libovickỳ. Neural monkey: An open-source tool for sequence learning. The Prague Bulletin of Mathematical Linguistics, 107(1):5–17, 2017.
  • Hieber et al. [2017] Felix Hieber, Tobias Domhan, Michael Denkowski, David Vilar, Artem Sokolov, Ann Clifton, and Matt Post. Sockeye: A toolkit for neural machine translation. arXiv preprint arXiv:1712.05690, 2017. URL
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • Hovy et al. [2002] Eduard Hovy, Margaret King, and Andrei Popescu-Belis. Principles of context-based machine translation evaluation. Machine Translation, 17(1):43–75, 2002.
  • Isabelle et al. [2017] Pierre Isabelle, Colin Cherry, and George Foster. A challenge set approach to evaluating machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2486–2496. Association for Computational Linguistics, 2017. doi: 10.18653/v1/D17-1263. URL
  • Joty et al. [2017] Shafiq Joty, Francisco Guzmán, Lluís Màrquez, and Preslav Nakov. Discourse structure in machine translation evaluation. Computational Linguistics, 43(4):683–722, 2017. doi: 10.1162/COLI_a_00298. URL
  • Junczys-Dowmunt et al. [2016] Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang. Is neural machine translation ready for deployment? A case study on 30 translation directions. In Proceedings of IWSLT (13th International Workshop on Spoken Language Technology), Seattle, WA, USA, 2016. URL
  • Junczys-Dowmunt et al. [2018] Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, and Alexandra Birch. Marian: Fast neural machine translation in c++. In Proceedings of ACL 2018, System Demonstrations, pages 116–121. Association for Computational Linguistics, 2018. URL
  • Kalchbrenner and Blunsom [2013] Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1700–1709. Association for Computational Linguistics, 2013. URL
  • King and Falkedal [1990] Margaret King and Kirsten Falkedal. Using test suites in evaluation of machine translation systems. In Proceedings of the 13th International Conference on Computational Linguistics (COLING), volume 2, Helsinki, Finland, 1990. URL
  • Klein et al. [2017] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, pages 67–72. Association for Computational Linguistics, 2017. URL
  • Klubička et al. [2018] Filip Klubička, Antonio Toral, and Víctor M. Sánchez-Cartagena. Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian. Machine Translation, 32(3):195–215, Sep 2018. doi: 10.1007/s10590-018-9214-x. URL
  • Knowles and Koehn [2018] Rebecca Knowles and Philipp Koehn. Context and copying in neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3034–3041. Association for Computational Linguistics, 2018. URL
  • Koehn [2004] Philipp Koehn. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 2004. URL
  • Koehn and Knowles [2017] Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39. Association for Computational Linguistics, 2017. doi: 10.18653/v1/W17-3204. URL
  • Lakew et al. [2018] Surafel Melaku Lakew, Mauro Cettolo, and Marcello Federico. A comparison of transformer and recurrent neural networks on multilingual neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, pages 641–652. Association for Computational Linguistics, 2018. URL
  • Lapshinova-Koltunski and Hardmeier [2017] Ekaterina Lapshinova-Koltunski and Christian Hardmeier. Discovery of discourse-related language contrasts through alignment discrepancies in english-german translation. In Proceedings of the Third Workshop on Discourse in Machine Translation, pages 73–81. Association for Computational Linguistics, 2017. doi: 10.18653/v1/W17-4810. URL
  • Lapshinova-Koltunski et al. [2018] Ekaterina Lapshinova-Koltunski, Christian Hardmeier, and Pauline Krielke. ParCorFull: a parallel corpus annotated with full coreference. In Proceedings of 11th Language Resources and Evaluation Conference, Miyazaki, Japan, 2018.
  • Läubli et al. [2018] Samuel Läubli, Rico Sennrich, and Martin Volk. Has machine translation achieved human parity? a case for document-level evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4791–4796. Association for Computational Linguistics, 2018. URL
  • Lee et al. [2013] Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. Deterministic coreference resolution based on entity-centric, precision-ranked rules. Computational Linguistics, 39(4), 2013. doi: 10.1162/COLI_a_00152. URL
  • Linzen et al. [2016] Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the ability of lstms to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535, 2016. URL
  • Liu et al. [2018] Frederick Liu, Han Lu, and Graham Neubig. Handling homographs in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1336–1345. Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-1121. URL
  • Loáiciga et al. [2017] Sharid Loáiciga, Sara Stymne, Preslav Nakov, Christian Hardmeier, Jörg Tiedemann, Mauro Cettolo, and Yannick Versley. Findings of the 2017 discomt shared task on cross-lingual pronoun prediction. In Proceedings of the Third Workshop on Discourse in Machine Translation, pages 1–16. Association for Computational Linguistics, 2017. doi: 10.18653/v1/W17-4801. URL
  • Luong et al. [2015] Thang Luong, Hieu Pham, and D. Christopher Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1412–1421. Association for Computational Linguistics, 2015. doi: 10.18653/v1/D15-1166. URL
  • Manning et al. [2014] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60. Association for Computational Linguistics, 2014. doi: 10.3115/v1/P14-5010. URL
  • Marvin and Koehn [2018] Rebecca Marvin and Philipp Koehn. Exploring word sense disambiguation abilities of neural machine translation systems (non-archival extended abstract). In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers), pages 125–131. Association for Machine Translation in the Americas, 2018. URL
  • Meyer and Popescu-Belis [2012] Thomas Meyer and Andrei Popescu-Belis. Using sense-labeled discourse connectives for statistical machine translation. In Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), pages 129–138. Association for Computational Linguistics, 2012. URL
  • Meyer et al. [2015] Thomas Meyer, Najeh Hajlaoui, and Andrei Popescu-Belis. Disambiguating discourse connectives for statistical machine translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(7):1184–1197, 2015.
  • Miculicich Werlen and Popescu-Belis [2017] Lesly Miculicich Werlen and Andrei Popescu-Belis. Validation of an automatic metric for the accuracy of pronoun translation (apt). In Proceedings of the Third Workshop on Discourse in Machine Translation, pages 17–25. Association for Computational Linguistics, 2017. doi: 10.18653/v1/W17-4802. URL
  • Müller et al. [2018] Mathias Müller, Annette Rios, Elena Voita, and Rico Sennrich. A large-scale test set for the evaluation of context-aware pronoun translation in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 61–72. Association for Computational Linguistics, 2018. URL
  • Ott et al. [2018] Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 1–9. Association for Computational Linguistics, 2018. URL
  • Papineni et al. [2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 2002. URL
  • Popović [2017] Maja Popović. Comparing language related issues for NMT and PBMT between German and English. The Prague Bulletin of Mathematical Linguistics, 108(1):209–220, 2017. URL
  • Popović [2018] Maja Popović. Language-related issues for NMT and PBMT for English–German and English–Serbian. Machine Translation, pages 1–17, 2018. URL
  • Pouget-Abadie et al. [2014] Jean Pouget-Abadie, Dzmitry Bahdanau, Bart van Merrienboer, Kyunghyun Cho, and Yoshua Bengio. Overcoming the curse of sentence length for neural machine translation using automatic segmentation. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 78–85. Association for Computational Linguistics, 2014. doi: 10.3115/v1/W14-4009. URL
  • Rios et al. [2018] Annette Rios, Mathias Müller, and Rico Sennrich. The word sense disambiguation test suite at WMT18. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 588–596. Association for Computational Linguistics, 2018. URL
  • Rios Gonzales et al. [2017] Annette Rios Gonzales, Laura Mascarell, and Rico Sennrich. Improving word sense disambiguation in neural machine translation with sense embeddings. In Proceedings of the Second Conference on Machine Translation, pages 11–19. Association for Computational Linguistics, 2017. doi: 10.18653/v1/W17-4702. URL
  • Scarton and Specia [2015] Carolina Scarton and Lucia Specia. A quantitative analysis of discourse phenomena in machine translation. Discours, 2015. doi: 10.4000/discours.9047. URL
  • Schwenk et al. [2006] Holger Schwenk, Daniel Dechelotte, and Jean-Luc Gauvain. Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 723–730. Association for Computational Linguistics, 2006. URL
  • Sennrich [2017] Rico Sennrich. How grammatical is character-level neural machine translation? assessing MT quality with contrastive translation pairs. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 376–382. Association for Computational Linguistics, 2017. URL
  • Sennrich and Haddow [2016] Rico Sennrich and Barry Haddow. Linguistic input features improve neural machine translation. In Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers, pages 83–91. Association for Computational Linguistics, 2016. doi: 10.18653/v1/W16-2209. URL
  • Sennrich et al. [2016a] Rico Sennrich, Barry Haddow, and Alexandra Birch. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96. Association for Computational Linguistics, 2016a. doi: 10.18653/v1/P16-1009. URL
  • Sennrich et al. [2016b] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725. Association for Computational Linguistics, 2016b. doi: 10.18653/v1/P16-1162. URL
  • Sennrich et al. [2016c] Rico Sennrich, Barry Haddow, and Alexandra Birch. Edinburgh neural machine translation systems for wmt 16. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 371–376. Association for Computational Linguistics, 2016c. doi: 10.18653/v1/W16-2323. URL
  • Sennrich et al. [2017] Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. Nematus: a toolkit for neural machine translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 65–68. Association for Computational Linguistics, 2017. URL
  • Sim Smith and Specia [2018] Karin Sim Smith and Lucia Specia. Assessing crosslingual discourse relations in machine translation. arXiv preprint arXiv:1810.03148, 2018. URL
  • Šoštarić et al. [2018] Margita Šoštarić, Christian Hardmeier, and Sara Stymne. Discourse-related language contrasts in english-croatian human and machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 36–48. Association for Computational Linguistics, 2018. URL
  • Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 3104–3112, 2014. URL -with-neural-networks.pdf.
  • Tang et al. [2018a] Gongbo Tang, Mathias Müller, Annette Rios, and Rico Sennrich. Why self-attention? a targeted evaluation of neural machine translation architectures. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4263–4272. Association for Computational Linguistics, 2018a. URL
  • Tang et al. [2018b] Gongbo Tang, Rico Sennrich, and Joakim Nivre. An analysis of attention mechanisms: The case of word sense disambiguation in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 26–35. Association for Computational Linguistics, 2018b. URL
  • Toral and Sánchez-Cartagena [2017] Antonio Toral and Víctor M. Sánchez-Cartagena. A multifaceted evaluation of neural versus phrase-based machine translation for 9 language directions. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1063–1073, Valencia, Spain, April 2017. Association for Computational Linguistics. URL
  • Toral et al. [2018] Antonio Toral, Sheila Castilho, Ke Hu, and Andy Way. Attaining the unattainable? Reassessing claims of human parity in neural machine translation. In Proceedings of the Third Conference on Machine Translation (WMT). Association for Computational Linguistics, 2018.
  • Tuggener [2016] Don Tuggener. Incremental Coreference Resolution for German. PhD thesis, University of Zurich, Zurich, Switzerland, 2016.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. URL
  • Vaswani et al. [2018] Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. Tensor2tensor for neural machine translation. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers), pages 193–199. Association for Machine Translation in the Americas, 2018. URL
  • Voita et al. [2018] Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov. Context-aware neural machine translation learns anaphora resolution. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1264–1274. Association for Computational Linguistics, 2018. URL
  • Wiseman and Rush [2016] Sam Wiseman and Alexander M. Rush. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1296–1306. Association for Computational Linguistics, 2016. doi: 10.18653/v1/D16-1137. URL
  • Wong and Kit [2012] Billy T. M. Wong and Chunyu Kit. Extending machine translation evaluation metrics with lexical cohesion to document level. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1060–1068. Association for Computational Linguistics, 2012. URL
  • Wu et al. [2016] Yonghui Wu et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016. URL
  • Zhang et al. [2018] Biao Zhang, Deyi Xiong, and Jinsong Su. Neural machine translation with deep attention. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, October 2018. doi: 10.1109/TPAMI.2018.2876404.