1 Introduction
The quality of NMT (Neural Machine Translation) models has been improving over the years and is narrowing the gap to human translation performance
Hassan et al. (2018). Until recently, most of the MT research has focused on translating and evaluating sentences in isolation, ignoring the context in which these sentences occur. Simplifying the translation task this way has its advantages: data sets are easier to create, models are computationally more efficient and human evaluations are faster111With full document context, annotation time per task increases by 68% according to Grundkiewicz et al. (2021)..While initial work failed to show significant differences in standard metrics (Tiedemann and Scherrer, 2017), the impact of ignoring context has been investigated more closely in recent years Yin et al. (2021b). Targeted testing has shown poor performance on discourse-related phenomena Müller et al. (2018); Bawden et al. (2018); Voita et al. (2019a); Jwalapuram et al. (2020b); Maruf et al. (2019b); Li et al. (2020) (see Table 3 for examples). Furthermore, without context, human evaluation fails to expose all translation errors and leads to rush conclusions on achieving human parity Läubli et al. (2018). It is thus important to start addressing the MT task in a formulation that is closer to its true complexity and bridges the gap to the real communication needs of the users.
This paper tackles the problem of context-aware translation by re-visiting a straightforward multi-sentence translation approach which is considered a baseline in the literature. Our comprehensive experiments show that by leveraging deeper transformer models in combination with knowledge distillation methods, this baseline leads to an effective and robust alternative to specialized architectures proposed in the literature. The paper’s contributions are:
-
We show that multi-sentence translation can benefit from increased-capacity transformer models and that deeper models are better at learning contextual dependencies than wider models.
-
We further show that distilled models can learn contextual dependencies from larger models, while reducing computational cost and increasing robustness to input length variations.
-
Finally, results on four language pairs confirm that the approach achieves high performance for both contextual and single-segment translation tasks.
2 Multi-segment translation models
Throughout this paper, we implement context-aware translation models as multi-segment models, as initially proposed in Tiedemann and Scherrer (2017) and further used in Fernandes et al. (2021); Lopes et al. (2020) among others.
Input | Output |
---|---|
<start >Fire? <sep >Well, put it out, why don’t you? <end > | <start >Ein Feuer? <sep >Na dann löscht er doch! <end > |
<start >Well, put it out, why don’t you? <end > | <start >Na dann löscht er doch! <end > |
Multi-segment data points
We use document-level parallel data which is transformed to contain concatenated, multi-segment input. Specifically, we restrict this work to two consecutive sentences. The source and target sides are concatenated using a special delimiter token and added to the training. While not strictly a requirement, the special token allows the extraction of the context-aware translation for the second, target sentence. Prior context-aware architectures can be categorized with respect to the use of context as using: source-side, target-side or both. As it generates both sentence translations jointly, the multi-segment approach takes advantage of both source- and target-side context at train-time. However, it does not use the context reference translation during inference and multi-segment input is simply translated as a continuous output sequence.
Training data
We aim to create single translation models which can perform both translation in-context and in isolation. For this reason, we start from a training set including context for each parallel sentence and create a duplicate of it by removing the context information. All the contextual models (Ctx) are trained on this joint single- and multi-segment data, while the sentence-level baselines (Bl) use only single sentences. Note that although the data size varies between Bl and Ctx models, the data is effectively identical and all the models are trained using the same stopping criteria, thus conferring no special advantage to any of the models. Table 1 exemplifies the training data.
3 Experimental setup
We perform experiments in four language arcs, English to German (EN-DE), English to French (EN-FR), English to Russian (EN-RU) and Chinese to English (ZH-EN).
3.1 Training
We use the WMT2019 data set for EN-DE, Open Subtitles 2018 for EN-FR and EN-RU and UN Parallel Corpus V1.0 for ZH-EN, all four containing document-level data. The data sets vary in size from 4M segments for EN-DE to 17.4M for ZH-EN (see Appendix A, Table 13 for details). Development data consists of the News task 2019 development set for DE, IWSLT 2019 for FR and newstest2019 for RU and ZH respectively. In all conditions the development data mirrors the training data, meaning that it is duplicated to contain both multi- and single segments data for contextual models, and original and distilled data for distillation experiments. In preliminary experiments we found this to play an important role.
Models use the Transformer architecture Vaswani et al. (2017a). We start with a baseline architecture of 6:2 encoder:decoder layers and 2048 feed-forward width, which subsequent experiments increase in decoder depth and feed-forward width respectively. Training is done with Sockeye Domhan et al. (2020). See Appendix A for a complete list of training parameters.
3.2 Testing
We measure performance of contextual models using both targeted and non-targeted testing.
Non-targeted tests consists of contextual, document-level data which is not selected to focus on discourse phenomena. For EN-DE we use the test set splits made available in Maruf et al. (2019a): TED (2.3k segments), News-Commentary (3k) and Europarl (5.1k). We use IWSLT15 (1k) (Cettolo et al., 2012) for EN-FR, WMT newstest2020 (4k) (Barrault et al., 2020) for EN-RU and finally WMT newstest2020 (2k) (Barrault et al., 2020) for ZH-EN. While contextual models may improve performance on these data sets, previous work suggests that the effects are minimal in high-resources scenarios with strong sentence-level baselines (Lopes et al., 2020).
Targeted tests have been developed in order to evaluate performance on discourse phenomena. Table 2 lists the test sets used in this paper. 222While highly relevant, data created by (Yin et al., 2021a) has not been released at the time of writing this paper. These test sets contain contrastive translation pairs, consisting of a correct human-generated translation, and a variant of it where a pronoun, or another linguistic unit of interest, is swapped with an incorrect one. Table 3 shows examples from these data sets.
To complement accuracy of contrastive evaluations, we also use targeted test sets and their references to measure standard translation metrics.
LP | Type | Size | Source |
EN-DE | Anaphora | 12,000 | Müller et al. (2018) |
EN-FR | Anaphora | 12,000 | Lopes et al. (2020) |
EN-RU | Deixis | 3,000 | Voita et al. (2019b) |
Lex-coh | 2,000 | ||
Ellipsis-vp | 500 | ||
Ellipsis-infl | 500 | ||
ZH-EN | Anaphora | 500 | Jwalapuram et al. (2019) |
DE | Src | I forgot to confide it to you. |
Ctx | What’s your plan? | |
Ctx-tgt | Was hast du vor? | |
Ref | Ich vergaß, es euch zu vertraun. | |
Contr | Ich vergaß, sie euch zu vertraun. | |
FR | Src | And where’s it coming from? |
Ctx | A sort of mist. | |
Ctx-tgt | Une sorte de brume. | |
Ref | Et elle vient d’où ? | |
Contr | Et il vient d’où ? | |
RU | Src | Identity theft. |
Ctx | And I solved another crime. | |
Ctx-tgt | И этим решил еще одно преступление. | |
Ref | Кражу. | |
Contr | Кража. | |
ZH | Src | 情况就是这样 |
Ctx | 斐济人就好像生来就是打 7人 | |
制橄榄球的, 而英国队仍是初出茅庐 | ||
Ctx-tgt | It was as if Fiji had been born to play 7s, | |
while GB are still learning the trade . | ||
Ref | Which is pretty much how it is. | |
Contr | That is pretty much how it is. |
4 Context-aware translation results
We begin our experiments by confirming that concatenated models are indeed able to model context dependencies (Section 4.1). We follow by testing the hypothesis that larger models are better suited for learning the more complex contextual training data (Section 4.2). In order to avoid over-fitting, we use EN-DE as a development language and subsequently test identical settings on FR, RU and ZH in Section 4.3.
4.1 Multi-segment models
For all four language arcs, Ctx models use 6 encoder layers and 2 decoder layers (44M parameters) and are trained using both segments in isolation as well as concatenated context segments. In inference, DE, FR and ZH models use one preceding context sentence, matching the training. However, over 60% of the targeted RU data exhibits longer dependencies, of up to 3 previous segments. For this reason, targeted EN-RU testing concatenates all three context sentences. Baseline (Bl) models use the same original train data and model architecture, this time trained and used to translate one segment at a time.
Results are shown in Table 4. As observed by previous work, concatenated models are considerably better than their context-ignoring counterparts, particularly on targeted test sets. In contrastive testing, accuracy increases by 20-30% in absolute values in all languages, with the exception of the lexical cohesion data set in RU and anaphora data set in ZH.
For non-targeted testing, Ctx models significantly out-perform the Bl models in 4 out of the 6 test sets. This differs from previous work, where contextual models using the concatenation approach are reported to degrade BLEU scores: Tiedemann and Scherrer (2017) measure 0.6 BLEU drop, Voita et al. (2019b) show a 0.84 drop for RU, Lopes et al. (2020), 1.2, and Junczys-Dowmunt (2019a) shows a BLEU degradation of 1.5. These results indicate that our approach to train the contextual model with both contextual and non-contextual data alleviates the issue of quality degradation.
Arc | Metric | Test set | Targeted | Bl | Ctx |
DE | BLEU | TED | 19.9 | 22.4 | |
News | 26.1 | 29.5 | |||
Europarl | 29.3 | 31.5 | |||
BLEU | ContraPro | ✓ | 20.1 | 21.1 | |
Acc | ContraPro | ✓ | 0.50 | 0.70 | |
FR | BLEU | IWSLT | 40.0 | 39.7 | |
BLEU | LCPT333Large-contrastive-pronoun-testset-EN-FR (LCPT) | ✓ | 27.9 | 32.5 | |
Acc | LCPT | ✓ | 0.74 | 0.87 | |
Acc | Anaphora | ✓ | 0.50 | 0.72 | |
RU | BLEU | WMT20 | 13.6 | 14.6 | |
Acc | Deixis | ✓ | 0.50 | 0.83 | |
Acc | Lex-coh | ✓ | 0.46 | 0.47 | |
Acc | Ellipsis-vp | ✓ | 0.20 | 0.60 | |
Acc | Ellipsis-infl | ✓ | 0.52 | 0.68 | |
ZH | BLEU | WMT20 | 21.2 | 21.4 | |
Acc | Eval-anaphora | ✓ | 0.58 | 0.61 |
4.2 Increasing model capacity
As the multi-segment models are trained on data exhibiting longer dependencies, we investigate the hypothesis that increased model capacity is needed to learn the more complex data distribution.
Starting with the baseline model used in the previous experiments (Ctx), we investigate two ways to increase its capacity: increasing the depth of the decoder or increasing the width of the feed-forward layers. We test four increased model capacities, by incrementally adding 2 decoder layers to the base model (deep models). For each deep model, we also create an equivalent wide model containing the same number of parameters. This leads to number of parameters ranging from 44M – the baseline setting – to 76M. We leave all other settings un-changed. Table 14 in the Appendix details these architectures.

Non-targeted and targeted testing results are shown in Figure 1. Results show that larger capacity models deliver increased performance across the board. In both cases most of the performance gain comes from the 52M capacity model: +2.2 BLEU gain in non-targeted and +0.4 BLEU/+5% Accuracy on the ContraPro pronoun test. However targeted metrics show subsequent improvements with increased depth, to +8% absolute accuracy gain with the 6:8 encoder:decoder configuration. While deeper and wider models perform similarly in non-targeted testing, deeper models are clearly superior on the pronoun translation task: Wider models improve accuracy from 70% to a maximum of 73% while deep models achieve 78%.
Testing configuration | Model | ||||||
Metric | Test set | Targeted | Ctx used | Bl | Ctx | Ctx-Deep | Ctx-Wide |
BLEU | TED | 19.9 | 22.1 | 24.3 | 24.3 | ||
TED | ✓ | - | 22.4 | 24.4 | 24.4 | ||
News | 26.1 | 29.7 | 31.3 | 31.3 | |||
News | ✓ | - | 29.5 | 31.8 | 31.5 | ||
Europarl | 29.3 | 31.5 | 34.4 | 34.5 | |||
Europarl | ✓ | - | 31.5 | 34.4 | 34.7 | ||
BLEU | ContraPro | ✓ | 20.1 | 19.1 | 21.2 | 20.6 | |
ContraPro | ✓ | ✓ | - | 20.4 | 22.9 | 22.3 | |
Acc | ContraPro | ✓ | 0.49 | 0.50 | 0.51 | 0.50 | |
ContraPro | ✓ | ✓ | - | 0.70 | 0.78 | 0.73 |
As noted in Section 4.1, Ctx models do not perform worse than sentence-level baselines on any of the contextual data sets, most likely due to the joint single- and multi-segment training regime. Next we investigate if this training leads to “multi-task"models, that maintain baseline performance also when used without context. These experiments use the previously trained models, and contextual models are tested with/without context at inference time. We contrast baseline single-segment models of standard capacity (Bl), similar multi-segment models (Ctx), and the best Deep and Wide models as previously determined (capacity 68M and 76M respectively).
Results are shown in Table 5. Interestingly, in non-targeted testing, multi-segment models used without context approach the optimal performance. Therefore the improvements due to the use of context are smaller than indicated by Section 4.1: +0.5 and +0.2 BLEU in News and negligible or non-existent in the other domains. 444A related observation was made in Lopes et al. (2020) where it is shown that if strong baselines are used, no contextual model tested brings any improvements over the context-ignoring baselines in IWSLT sets for De and Fr. Specifically, standard capacity Ctx models outperform Bl ones by +2/+3 BLEU points when used without context. Note that these only differ in the use of training data: Ctx duplicates the training data by concatenating two adjacent segments.
In targeted tests, the multi-segment models that ignore context do not outperform the baselines. This confirms that ContraPro is a good benchmark for isolating context-aware phenomena, as the task cannot be solved with a strong baseline.
4.3 Results on FR, RU and ZH translation
In this section we investigate whether EN-DE results carry over to the FR, RU and ZH translation tasks, by testing the optimal DE configurations without any additional language- or task-specific tuning. We test the baseline model (single-segment, standard capacity) against the best contextual model as determined on DE, the multi-segment model using 6 encoder and 8 decoder layers.
Testing configuration | Model | |||||
Arc | Metric | Test set | Targ-eted | Ctx | Bl | Ctx-Deep |
FR | BLEU | IWSLT | 40.0 | 40.8 | ||
IWSLT | ✓ | - | 40.0 | |||
BLEU | LCPT | ✓ | 27.9 | 31.5 | ||
LCPT | ✓ | ✓ | - | 32.3 | ||
Acc | LCPT | ✓ | 0.74 | 0.79 | ||
LCPT | ✓ | ✓ | - | 0.90 | ||
RU | BLEU | WMT20 | 13.6 | 18.9 | ||
WMT20 | ✓ | - | 16.5 | |||
Acc | deixis | ✓ | 0.50 | 0.51 | ||
deixis | ✓ | ✓ | - | 0.85 | ||
Acc | lex-coh | ✓ | 0.46 | 0.46 | ||
lex-coh | ✓ | ✓ | - | 0.48 | ||
Acc | ellipsis-vp | ✓ | 0.20 | 0.25 | ||
ellipsis-vp | ✓ | ✓ | - | 0.73 | ||
Acc | ellipsis-infl | ✓ | 0.52 | 0.55 | ||
ellipsis-infl | ✓ | ✓ | - | 0.80 | ||
ZH | BLEU | WMT20 | 21.2 | 22.1 | ||
WMT20 | ✓ | - | 22.4 | |||
Acc | Eval-anaphora | ✓ | 0.58 | 0.60 | ||
Eval-anaphora | ✓ | ✓ | - | 0.62 |
Table 6 shows the results. The DE results carry over to FR, RU and ZH to a large extent. Except for ZH, on non-targeted testing the best performance is obtained by the deep multi-segment models used without
context. On further analysis of the non-targeted EN-RU test set, where the drop is of over 2 BLEU points, we observed that the length divergence between training and test segments is significant; the quality drops dramatically when the segment length deviates by more than 4 standard deviation(sd). (See Appendix
C for a detailed analysis of segment length variation.).5 Distillation
Non-Trgtd | Targeted | |||
Model | BLEU | BLEU | Acc | |
EN-DE | Ctx-Deep | 32.0 | 22.9 | 0.78 |
Student | 31.4 | 23.1 | 0.73 | |
Ctx | 29.5 | 20.1 | 0.70 | |
EN-FR | Ctx-Deep | 40.0 | 32.3 | 0.90 |
Student | 40.4 | 32.1 | 0.88 | |
Ctx | 39.7 | 32.5 | 0.87 | |
EN-RU | Ctx-Deep | 16.5 | - | 0.85 |
deixis | Student | 16.2 | - | 0.84 |
Ctx | 14.6 | - | 0.83 | |
EN-RU | Ctx-Deep | - | - | 0.48 |
lex-coh | Student | - | - | 0.46 |
Ctx | - | - | 0.46 | |
EN-RU | Ctx-Deep | - | - | 0.73 |
ellipsis-vp | Student | - | - | 0.66 |
Ctx | - | - | 0.60 | |
EN-RU | Ctx-Deep | - | - | 0.80 |
ellipsis-infl | Student | - | - | 0.69 |
Ctx | - | - | 0.68 | |
ZH-EN | Ctx-Deep | 22.3 | - | 0.62 |
Student | 22.2 | - | 0.59 | |
Ctx | 21.4 | - | 0.60 |
The multi-segment models with increased capacity are computationally less efficient than standard single-segment models, with deeper decoders and longer outputs impacting latency and wider models impacting memory.
This section investigates the effectiveness of Knowledge Distillation (KD) in compressing contextual models to the original model capacity. We employ sequence-level KD as proposed in (Kim and Rush, 2016). Specifically, the Deep-Ctx models are the teachers used to translate the training/development data and students are trained on data containing both references and teacher output, as recommended in Gordon and Duh (2019) among others. Students with layers and 44M parameters are trained with the loss: where is the teacher prediction, is the target length and is the size of the vocabulary.
5.1 Results
Results in Table 7 show KD can indeed enhance a smaller model’s performance on discourse phenomena. Overall, the performance of the distilled models lies between that of standard capacity models and that of deep models.
We observe that, unexpectedly, in several test sets, the student model is better than the teacher. We analyze such a test set, EN-DE WMT19, where teacher and student achieve 30.6/31.1 BLEU respectively. We observed that some of this data is paragraph-level and not sentence-level, again leading to a train-test miss-match. We analyzed the performance on the test set against the input length distribution seen in the training data. Results (Table 8) show that the student outperforms the teacher when the input length is above 2 standard deviations from the median length.
We hypothesize that student translations are more robust to variations and less-context sensitive, due to the simpler data distribution that the student is trained on (Zhou et al. (2020) indeed show that distilled data is less complex under a complexity metric based on cross-entropy across word alignments). While we leave further exploration to future work, we perform a simple experiment to measure the variation observed in translation when context is used or ignored. We measure this irrespective of quality, as the percentage of translations that change at all when context is used. Table 9 shows that indeed student models are less context-sensitive. This is observed across the targeted test set, ContraPro, where translation changes are expected; the gap is even more pronounced on non-targeted test sets, where context-dependence is not part of the data set design.
Len. range | Ctx-Deep | Student |
---|---|---|
(0,m) | 22.7 | 23.0 |
(m,m + 2* SD] | 32.3 | 31.8 |
(m+2*SD, m+4 *SD] | 32.2 | 32.7 |
(m+4* SD,) | 25.1 | 28.1 |
Test set | Ctx-Deep | Student |
---|---|---|
Ted | 57.0% | 47.0% |
News | 60.5% | 49.7% |
Europarl | 53.6% | 42.2% |
ContraPro | 65.0% | 57.5% |
6 Human evaluation
The scoring accuracy metric used with targeted contrastive references does not measure if the system can actually generate the correct pronoun. In a recent analysis, Vamvas and Sennrich (2021) show that scoring human-written contrastive references may lead to false positives, in particular for distilled NMT models which have not been exposed to real-data distribution during training.
To address this limitation, we complement the automatic metric with human evaluation of the EN-DE teacher and student models. We sampled 250 examples from ContraPro, selecting samples where the antecedent is in the previous sentence. Translators were shown the two consecutive source sentences and their translations. They were asked to rate the quality of both sentences, the context and target, on a scale of 1 to 6 (with increments of 0.2) and to mark if the anaphoric pronoun "it"was correctly translated in the target sentence. Each translator performed two tasks: the first task was to compare the context-ignoring baseline (Bl) and the Ctx-Deep model; the second task was to compare the same baseline with the student model (Student). With this setup, we grounded the evaluation by showing the baseline translations in both tasks.
The inter-annotator agreement on ranking the baseline and teacher models wrt. generic quality of the target sentence555We compute the ranking as . is good at 0.55 Krippendorff’s Alpha (Hayes and Krippendorff, 2007). On assessing the correctness of pronoun translation, the agreement is very high for the teacher at 0.86 and high for the student 0.66.666We believe annotator fatigue contributed to the decrease in agreement, as translators first judged the teacher outputs and later the student. For the baseline model, which was judged twice, translators agree with themselves 86% of the time.
We report the pronoun translation accuracy and the generic quality scores (averaged across annotators and sentences) for the target sentences in Table 10. These results show that the Ctx-Deep is significantly more accurate at translating ambiguous pronouns than the baseline (71.8% vs 28.1%) and at the same time achieves better generic quality scores (+8% relative improvement). The student, which has the same capacity and architecture as the baseline, performs better than the automatic accuracy metric suggested: it retains most of the improved accuracy (61.6% vs 71.8%) and quality of the teacher model (+7% vs +8% ).
Table 11 shows an example of translation where the student model disambiguates the anaphoric pronoun "it"and in addition makes better word choices compared to the baseline: the two occurrences of the verb "jump"("springt") are correctly translated in both the context and target sentence.
Targeted accuracy (%) | Quality scores | |||
---|---|---|---|---|
Model | ||||
Bl | 28.1 | - | 4.43 | - |
Ctx-Deep | 71.8 | 42.0 | 4.81 | 8% |
Student | 61.6 | 31.8 | 4.76 | 7% |
Context sentence | Targeted sentence | Avg. scores | |
---|---|---|---|
Src | This wounded bird is jumping all over the place out there. | And every time it jumps. It gives us more data that we can use. | |
Bl | Dieser verwundete Vogel zieht sich dort über den gesamten Ort hinaus. | Und jedes Mal, wenn es springt, gibt es mehr Daten, die wir verwenden können. | 4.0 / 4.5 |
Ctx-Student | Dieser verwundete Vogel springt dort draußen. | Und jedes Mal, wenn er springt, gibt er uns mehr Daten, die wir verwenden können. | 4.4 / 5.4 |
7 Comparison to previous results
Comparisons to previous work are not straightforward due to the variability in training data used, number of model parameters, hyper-parameter tuning and other train-time parameters.
Reported results as well as the replicated results below show that we compare favorably to previous work, despite not using language-specific tuning or techniques. The best EN-DE ContraPro accuracy using parallel data reported in Huo et al. (2020) is 0.83 on a model trained with 22.4M segments, which is 0.05 higher than our Ctx-Deep model, trained with 4M segments. On the same test set Fernandes et al. (2021) shows an accuracy score of 0.66, Lopes et al. (2020) reports a maximum performance of 0.71 while Maruf et al. (2019a) reports a maximum of 0.73/0.69 with an offline and an online model respectively.
On EN-FR Yin et al. (2021b) report an accuracy of 0.91, to our knowledge the best reported results on this test set. We reach 0.90/0.86 accuracy on this test with the Ctx-Deep and the student model respectively. On EN-RU Voita et al. (2019b) also obtains the best results when using concatenation models, however on average these results are lower than the Ctx-Deep ones. Voita et al. (2019a) introduces DocRepair, a two-pass method which obtains optimal results but has considerable computational drawbacks compared to our proposal. DocRepair is tested on EN-RU and achieves 0.92 on deixis, 0.75/0.86 on ellipsis and an impressive 0.80 on lexical cohesion.
We were able to perform a side-by-side comparison of our proposed approach with that of Voita et al. (2019b) on EN-RU, due to availability of the implementation. We reproduce the results of the CADec model by using the use same EN-RU train data set and configuration as in the original paper Voita et al. (2019b). The train data contains 7.5 million parallel segments, out of which 1.5 million are contextual. For comparison, we train a contextual model with 6:6 encoder:decoder layers, totaling the same number of parameters as CADec. During testing, the source segment is translated using its preceding context and the context translation is subsequently stripped from the output.
The results are shown in Table 12. We report the results for both non-targeted and targeted test sets. Ctx-6:6 shows high quality, comparable to the state of the art CADec model on three of the test sets, and better in two of them (ellipsis-vp and ellipsis-infl).
Arc | Metric | Test set | Targeted | CADec | Ctx-6:6 |
---|---|---|---|---|---|
RU | BLEU | CADec test | 30.1 | 30.4 | |
Acc | Deixis | ✓ | 0.81 | 0.80 | |
Acc | Lex-coh | ✓ | 0.47 | 0.48 | |
Acc | Ellipsis-vp | ✓ | 0.69 | 0.73 | |
Acc | Ellipsis-infl | ✓ | 0.56 | 0.75 |
8 Related work
8.1 Document translation
A straight-forward way to include context proposed by Tiedemann and Scherrer (2017) is to train a standard NMT model on pseudo-document parallel data obtained by concatenating two or more consecutive sentences. As large-scale document-level parallel data is not widely available, prior work explored data augmentation: augmenting the training data with back-translated monolingual documents (Junczys-Dowmunt, 2019b) and leveraging monolingual data to train a document-level repair system (Voita et al., 2019a) .
To scale to larger context beyond the previous or next sentence, prior work proposed changes to the architecture to improve how context is compressed and attended to by the encoder and/or decoder: multi-encoder architectures (Bawden et al., 2018), hierarchical or sparse attention over context sentences (Maruf et al., 2019a; Miculicich et al., 2018; Bao et al., 2021), incorporating diverse pretrained context representations that combine local and document-level context (Zhu et al., 2020; Donato et al., 2021). However, recent work (Fernandes et al., 2021) has shown scaling up to longer contexts brings diminishing returns or can even hurt performance due to instability in training attention and positional embeddings (Nguyen et al., 2021).
8.2 Large models and Knowledge distillation
Prior work has showed that increasing model capacity in MT is beneficial particularly when increasing the size or the diversity of the training data, such as for multi-lingual models. With respect to depth versus width, Kaplan et al. (2020)
do an extensive comparison of model capacity for neural language models and find that increasing width and depth while maintaining capacity gives similar results.
Knowledge distillation Hinton et al. (2015) (KD) has been introduced as a way to compress larger models, or ensembles thereof, into smaller more computationally efficient models that reach similar performance. For sequence to sequence models, Kim and Rush (2016)
introduced sequence-level knowledge distillation for improved MT. Subsequently knowledge distillation has proved beneficial for non-autoregressive MT. In general, distillation is thought to be effective in low data regimes, small data sets, domain adaptation, transfer learning (e.g.
Currey et al. (2020)) which makes it particularly suitable for document-level translation, where parallel data is a bottleneck.A significant body of work has been devoted to understanding why distillation works, such as (Gordon and Duh, 2019; Xu et al., 2021; Zhou et al., 2020) among others. While our work does not focus on investigating why distillation works, we do contribute the observation that students prove to be very robust and out-perform teachers on out-of-distribution data when input length is considered.
9 Conclusion
In this paper we address the task of contextual translation by using multi-segment Transformer models. We show that we can successfully push the limits of this approach to achieve robust and high performance across several languages and benchmarks, without any language or task-specific tuning. This is achieved by training models to perform both contextual and single-segment translation, which has the added benefit of improving single-segment translation quality. We also show that – with fixed data conditions and model capacity – deeper models are superior to wider models in modeling contextual dependencies between pronouns and their antecedents.
Next we showed that the increased computational burden can be mitigated through distillation. Finally we observe that distilled models are more robust than their teachers on long input, which opens a new direction for improving MT models though distillation.
In this paper we have kept the training data fixed and have not investigated any data manipulations that could lead to improved performance. Particularly, our results indicate that standard document-level parallel data sets (such as the non-targeted sets used in this paper) exhibit a limited amount of discourse phenomena. In this light, multi-segment models trained on similar data may not learn to pay sufficient “attention"to context. In future work, we plan to investigate if parallel data can be improved by measuring and controlling context-sensitivity.
10 Ethical Considerations
In this work we used professional translators to evaluate the quality and accuracy of our models on publicly available data sets. The professional translators were recruited by a language service provider and were compensated according to industry standards.
11 Limitations
Every approach has its limitations, and our approach is no exclusion to that. Our contextual model adopts a simple approach of concatenating the previous sentence to the current sentence. While this improves the contextual model’s performance significantly, we have not experimented the effect of context size on model performance and have used standard context length from literature.
Also, when comparing with previous results, we have reproduced a state of the art approach (CADec) on a standard data set for EN-RU arc. For the remaining language pairs and test sets, we have cited previously reported results without reproducing them.
Finally, the targeted test sets used are limited wrt the different discourse phenomena they explore: anaphoric pronouns, lexical cohesion and verb forms. We have not investigated if our approach impacts other discourse phenomena or if it affects biases in translation.
Список литературы
-
Bao et al. (2021)
Guangsheng Bao, Yue Zhang, Zhiyang Teng, Boxing Chen, and Weihua Luo. 2021.
G-transformer for document-level machine translation.
In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages 3442–3455, Online. Association for Computational Linguistics. - Barrault et al. (2020) Loïc Barrault, Magdalena Biesialska, Ondřej Bojar, Marta R. Costa-jussà, Christian Federmann, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Matthias Huck, Eric Joanis, Tom Kocmi, Philipp Koehn, Chi-kiu Lo, Nikola Ljubešić, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Santanu Pal, Matt Post, and Marcos Zampieri. 2020. Findings of the 2020 conference on machine translation (WMT20). In Proceedings of the Fifth Conference on Machine Translation, pages 1–55, Online. Association for Computational Linguistics.
- Bawden et al. (2018) Rachel Bawden, Rico Sennrich, Alexandra Birch, and Barry Haddow. 2018. Evaluating discourse phenomena in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1304–1313, New Orleans, Louisiana. Association for Computational Linguistics.
- Cettolo et al. (2012) Mauro Cettolo, Christian Girardi, and Marcello Federico. 2012. WIT3: Web inventory of transcribed and translated talks. In Proceedings of the 16th Annual conference of the European Association for Machine Translation, pages 261–268, Trento, Italy. European Association for Machine Translation.
- Currey et al. (2020) Anna Currey, Prashant Mathur, and Georgiana Dinu. 2020. Distilling multiple domains for neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4500–4511, Online. Association for Computational Linguistics.
- Domhan et al. (2020) Tobias Domhan, Michael Denkowski, David Vilar, Xing Niu, Felix Hieber, and Kenneth Heafield. 2020. The sockeye 2 neural machine translation toolkit at AMTA 2020. In Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 110–115, Virtual. Association for Machine Translation in the Americas.
- Donato et al. (2021) Domenic Donato, Lei Yu, and Chris Dyer. 2021. Diverse pretrained context encodings improve document translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1299–1311, Online. Association for Computational Linguistics.
- Fernandes et al. (2021) Patrick Fernandes, Kayo Yin, Graham Neubig, and André F. T. Martins. 2021. Measuring and increasing context usage in context-aware machine translation. In Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), Virtual.
- Gordon and Duh (2019) Mitchell A. Gordon and Kevin Duh. 2019. Explaining sequence-level knowledge distillation as data-augmentation for neural machine translation.
- Grundkiewicz et al. (2021) Roman Grundkiewicz, Marcin Junczys-Dowmunt, Christian Federmann, and Tom Kocmi. 2021. On user interfaces for large-scale document-level human evaluation of machine translation outputs. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pages 97–106, Online. Association for Computational Linguistics.
- Hassan et al. (2018) Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. Achieving human parity on automatic chinese to english news translation. CoRR, abs/1803.05567.
- Hayes and Krippendorff (2007) Andrew Hayes and Klaus Krippendorff. 2007. Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1:77–89.
- Hinton et al. (2015) Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. CoRR, abs/1503.02531.
- Huo et al. (2020) Jingjing Huo, Christian Herold, Yingbo Gao, Leonard Dahlmann, Shahram Khadivi, and Hermann Ney. 2020. Diving deep into context-aware neural machine translation. CoRR, abs/2010.09482.
- Junczys-Dowmunt (2019a) Marcin Junczys-Dowmunt. 2019a. Microsoft translator at wmt 2019: Towards large-scale document-level neural machine translation. arXiv preprint arXiv:1907.06170.
- Junczys-Dowmunt (2019b) Marcin Junczys-Dowmunt. 2019b. Microsoft translator at WMT 2019: Towards large-scale document-level neural machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 225–233, Florence, Italy. Association for Computational Linguistics.
- Jwalapuram et al. (2020a) Prathyusha Jwalapuram, Shafiq Joty, and Youlin Shen. 2020a. Pronoun-targeted fine-tuning for nmt with hybrid losses. arXiv preprint arXiv:2010.07638.
- Jwalapuram et al. (2019) Prathyusha Jwalapuram, Shafiq Joty, Irina Temnikova, and Preslav Nakov. 2019. Evaluating pronominal anaphora in machine translation: An evaluation measure and a test suite. arXiv preprint arXiv:1909.00131.
- Jwalapuram et al. (2020b) Prathyusha Jwalapuram, Barbara Rychalska, Shafiq Joty, and Dominika Basaj. 2020b. Can your context-aware mt system pass the dip benchmark tests? : Evaluation benchmarks for discourse phenomena in machine translation.
- Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. CoRR, abs/2001.08361.
- Kim and Rush (2016) Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas. Association for Computational Linguistics.
- Läubli et al. (2018) Samuel Läubli, Rico Sennrich, and Martin Volk. 2018. Has machine translation achieved human parity? a case for document-level evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4791–4796, Brussels, Belgium. Association for Computational Linguistics.
- Li et al. (2020) Bei Li, Hui Liu, Ziyang Wang, Yufan Jiang, Tong Xiao, Jingbo Zhu, Tongran Liu, and Changliang Li. 2020. Does multi-encoder help? A case study on context-aware neural machine translation. CoRR, abs/2005.03393.
- Lopes et al. (2020) António Lopes, M. Amin Farajian, Rachel Bawden, Michael Zhang, and André F. T. Martins. 2020. Document-level neural MT: A systematic comparison. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 225–234, Lisboa, Portugal. European Association for Machine Translation.
- Maruf et al. (2019a) Sameen Maruf, André F. T. Martins, and Gholamreza Haffari. 2019a. Selective attention for context-aware neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3092–3102, Minneapolis, Minnesota. Association for Computational Linguistics.
- Maruf et al. (2019b) Sameen Maruf, Fahimeh Saleh, and Gholamreza Haffari. 2019b. A survey on document-level machine translation: Methods and evaluation. CoRR, abs/1912.08494.
- Miculicich et al. (2018) Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Henderson. 2018. Document-level neural machine translation with hierarchical attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2947–2954, Brussels, Belgium. Association for Computational Linguistics.
- Müller et al. (2018) Mathias Müller, Annette Rios, Elena Voita, and Rico Sennrich. 2018. A Large-Scale Test Set for the Evaluation of Context-Aware Pronoun Translation in Neural Machine Translation. In WMT 2018, Brussels, Belgium. Association for Computational Linguistics.
- Nguyen et al. (2021) Toan Q. Nguyen, Kenton Murray, and David Chiang. 2021. Data augmentation by concatenation for low-resource translation: A mystery and a solution. In Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), pages 287–293, Bangkok, Thailand (online). Association for Computational Linguistics.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
- Tiedemann and Scherrer (2017) Jörg Tiedemann and Yves Scherrer. 2017. Neural machine translation with extended context. In Proceedings of the Third Workshop on Discourse in Machine Translation, pages 82–92, Copenhagen, Denmark. Association for Computational Linguistics.
-
Vamvas and Sennrich (2021)
Jannis Vamvas and Rico Sennrich. 2021.
On the
limits of minimal pairs in contrastive evaluation.
In
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
, pages 58–68, Punta Cana, Dominican Republic. Association for Computational Linguistics. - Vaswani et al. (2017a) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017a. Attention is all you need. CoRR, abs/1706.03762.
- Vaswani et al. (2017b) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017b. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Voita et al. (2019a) Elena Voita, Rico Sennrich, and Ivan Titov. 2019a. Context-aware monolingual repair for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 877–886, Hong Kong, China. Association for Computational Linguistics.
- Voita et al. (2019b) Elena Voita, Rico Sennrich, and Ivan Titov. 2019b. When a good translation is wrong in context: Context-aware machine translation improves on deixis, ellipsis, and lexical cohesion. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1198–1212, Florence, Italy. Association for Computational Linguistics.
- Xu et al. (2021) Weijia Xu, Shuming Ma, Dongdong Zhang, and Marine Carpuat. 2021. How does distilled data complexity impact the quality and confidence of non-autoregressive machine translation? In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4392–4400, Online. Association for Computational Linguistics.
- Yin et al. (2021a) Kayo Yin, Patrick Fernandes, André F. T. Martins, and Graham Neubig. 2021a. When does translation require context? a data-driven, multilingual exploration.
- Yin et al. (2021b) Kayo Yin, Patrick Fernandes, Danish Pruthi, Aditi Chaudhary, André F. T. Martins, and Graham Neubig. 2021b. Do context-aware translation models pay the right attention? CoRR, abs/2105.06977.
- Zhou et al. (2020) Chunting Zhou, Jiatao Gu, and Graham Neubig. 2020. Understanding knowledge distillation in non-autoregressive machine translation. In International Conference on Learning Representations.
- Zhu et al. (2020) Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, and Tie-Yan Liu. 2020. Incorporating BERT into neural machine translation. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
Приложение A Experimental Settings
NMT models were built using the Transformer-base architecture (Vaswani et al., 2017b). The source embeddings, target embeddings, and the output layer’s weight matrix are tied. Training is done on 8 GPUs with Sockeye 2’s large batch training.
Parameters for training a 6:8 encoder:decoder model (same parameters are used for the other encoder:decoder configurations):
All data except Chinese(ZH) is tokenized using the Python Moses tokenizer at https://github.com/alvations/sacremoses. Chinese is tokenized using Jieba tokenizer at https://github.com/fxsjy/jieba. Words were segmented using BPE (Sennrich et al., 2016) with 32K operations. Source and target subwords shared the same vocabulary and training segments longer than 95 tokens were removed.
LP | #segments | Data source |
---|---|---|
EN-DE | 4M | WMT, 2019 |
EN-FR | 9.8M | Open Subtitles 2018 |
EN-RU | 8.7M | Open Subtitles 2018 |
ZH-EN | 17.4M | UN Parallel Corpus V1.0 |
Приложение B Multi-segment translation models
Our contextual model is a multi-segment model. The document-level parallel data is transformed to contain concatenated, multi-segment input for our models. Specifically, we concatenate two consecutive segments, both source and target, to create a new data point. training a model with just concatenated inputs will render the model useless for translating isolated segments, while training a model with just isolated segments will not be useful for contextual translations. Thus we aim aim to create a single translation models which can perform both translation in-context and in isolation. We do this by duplicating the training data by concatenating in-context data with isolated data. See Algorithm 1 for the pseudo-code for this concatenation process.
Приложение C Results
Comparison to previous results
The best EN-DE ContraPro accuracy using parallel data reported in Fernandes et al. (2021) is 0.66, obtained with a concatenated input method. Among all the methods tested, this is only outperformed by a model that additionally performs pre-training on a large monolingual corpus (0.80 accuracy, 0.02 higher than our Ctx-Deep model). On the same test set Lopes et al. (2020) reports a maximum performance of 0.71 while Maruf et al. (2019a) reports a maximum of 0.73/0.69 with an offline and an online model respectively.
On EN-FR Lopes et al. (2020) report an accuracy of 0.83, to our knowledge the best reported results on this test set. We reach 0.90/0.86 accuracy on this test with the Ctx-Deep and the student model respectively. On EN-RU Voita et al. (2019b) also obtains the best results when using concatenation models, however on average these results are lower than the Ctx-Deep ones. Voita et al. (2019a) introduces DocRepair, a two-pass method which obtains optimal results but has considerable computational drawbacks compared to our proposal. DocRepair is tested on EN-RU and achieves 0.92 on deixis, 0.75/0.86 on ellipsis and an impressive 0.80 on lexical cohesion.
Model | Enc. blocks | Dec. blocks | FF size | Total param. |
---|---|---|---|---|
Ctx | 6 | 2 | 2048 | 44 M |
Deep-52 | 6 | 4 | 2048 | 52 M |
Wide-52 | 6 | 2 | 3072 | |
Deep-60 | 6 | 6 | 2048 | 60 M |
Wide-60 | 6 | 2 | 4096 | |
Deep-68 | 6 | 8 | 2048 | 68 M |
Wide-68 | 6 | 2 | 5120 | |
Deep-76 | 6 | 10 | 2048 | 76 M |
Wide-76 | 6 | 2 | 6144 |
c.1 Segment Length Analysis
Non-targeted test sets shows the best performance is obtained by the deep multi-segment models used without context in Table 6. The result shows significant difference in performance between translation done with and without context in case of English (EN) to Russian (RU) arc.
We analyze the non-targeted test set with respect to segment length of the training data. We compute the median segment length of the training data is 15 with a sd of 9. We further split the test set in 4 partitions with respect to the relation to this training data median and sd. Table 15 shows the performance of each partition when translation is done with and without context. We also see when the test segment length deviates by more than 4 sd, the performance of translation with context degrades significantly and segments falls into this partition. We think this length deviation between training and test set is the reason behind the performance drop between with and without context translations.
Len. Range | #Segments | BLEU | |||
---|---|---|---|---|---|
|
|
||||
(0,m) | 22 | 5.7 | 6.5 | ||
(m,m + 2* SD] | 351 | 17.3 | 19.1 | ||
(m+2*SD, m+4 *SD] | 652 | 20.3 | 21.5 | ||
(m+4* SD,) | 3077 | 19.6 | 16.7 |