Log In Sign Up

A baseline revisited: Pushing the limits of multi-segment models for context-aware translation

This paper addresses the task of contextual translation using multi-segment models. Specifically we show that increasing model capacity further pushes the limits of this approach and that deeper models are more suited to capture context dependencies. Furthermore, improvements observed with larger models can be transferred to smaller models using knowledge distillation. Our experiments show that this approach achieves competitive performance across several languages and benchmarks, without additional language-specific tuning and task specific architectures.


page 1

page 2

page 3

page 4


Diving Deep into Context-Aware Neural Machine Translation

Context-aware neural machine translation (NMT) is a promising direction ...

Understanding Knowledge Distillation in Non-autoregressive Machine Translation

Non-autoregressive machine translation (NAT) systems predict a sequence ...

Multilingual Neural Machine Translation with Knowledge Distillation

Multilingual machine translation, which translates multiple languages wi...

Context-Aware Learning for Neural Machine Translation

Interest in larger-context neural machine translation, including documen...

Confidence Based Bidirectional Global Context Aware Training Framework for Neural Machine Translation

Most dominant neural machine translation (NMT) models are restricted to ...

Ensembling of Distilled Models from Multi-task Teachers for Constrained Resource Language Pairs

This paper describes our submission to the constrained track of WMT21 sh...

Contextual Neural Machine Translation Improves Translation of Cataphoric Pronouns

The advent of context-aware NMT has resulted in promising improvements i...

1 Introduction

The quality of NMT (Neural Machine Translation) models has been improving over the years and is narrowing the gap to human translation performance

Hassan et al. (2018). Until recently, most of the MT research has focused on translating and evaluating sentences in isolation, ignoring the context in which these sentences occur. Simplifying the translation task this way has its advantages: data sets are easier to create, models are computationally more efficient and human evaluations are faster111With full document context, annotation time per task increases by 68% according to Grundkiewicz et al. (2021)..

While initial work failed to show significant differences in standard metrics (Tiedemann and Scherrer, 2017), the impact of ignoring context has been investigated more closely in recent years Yin et al. (2021b). Targeted testing has shown poor performance on discourse-related phenomena Müller et al. (2018); Bawden et al. (2018); Voita et al. (2019a); Jwalapuram et al. (2020b); Maruf et al. (2019b); Li et al. (2020) (see Table 3 for examples). Furthermore, without context, human evaluation fails to expose all translation errors and leads to rush conclusions on achieving human parity Läubli et al. (2018). It is thus important to start addressing the MT task in a formulation that is closer to its true complexity and bridges the gap to the real communication needs of the users.

This paper tackles the problem of context-aware translation by re-visiting a straightforward multi-sentence translation approach which is considered a baseline in the literature. Our comprehensive experiments show that by leveraging deeper transformer models in combination with knowledge distillation methods, this baseline leads to an effective and robust alternative to specialized architectures proposed in the literature. The paper’s contributions are:

  • We show that multi-sentence translation can benefit from increased-capacity transformer models and that deeper models are better at learning contextual dependencies than wider models.

  • We further show that distilled models can learn contextual dependencies from larger models, while reducing computational cost and increasing robustness to input length variations.

  • Finally, results on four language pairs confirm that the approach achieves high performance for both contextual and single-segment translation tasks.

2 Multi-segment translation models

Throughout this paper, we implement context-aware translation models as multi-segment models, as initially proposed in Tiedemann and Scherrer (2017) and further used in Fernandes et al. (2021); Lopes et al. (2020) among others.

Input Output
<start >Fire? <sep >Well, put it out, why don’t you? <end > <start >Ein Feuer? <sep >Na dann löscht er doch! <end >
<start >Well, put it out, why don’t you? <end > <start >Na dann löscht er doch! <end >
Таблица 1: Parallel training data contains both segments in isolation as well as concatenated segments. Example is demonstrative, from the EN-DE anaphora test set Müller et al. (2018). At inference time, only the translations of target segments (in bold) are used.

Multi-segment data points

We use document-level parallel data which is transformed to contain concatenated, multi-segment input. Specifically, we restrict this work to two consecutive sentences. The source and target sides are concatenated using a special delimiter token and added to the training. While not strictly a requirement, the special token allows the extraction of the context-aware translation for the second, target sentence. Prior context-aware architectures can be categorized with respect to the use of context as using: source-side, target-side or both. As it generates both sentence translations jointly, the multi-segment approach takes advantage of both source- and target-side context at train-time. However, it does not use the context reference translation during inference and multi-segment input is simply translated as a continuous output sequence.

Training data

We aim to create single translation models which can perform both translation in-context and in isolation. For this reason, we start from a training set including context for each parallel sentence and create a duplicate of it by removing the context information. All the contextual models (Ctx) are trained on this joint single- and multi-segment data, while the sentence-level baselines (Bl) use only single sentences. Note that although the data size varies between Bl and Ctx models, the data is effectively identical and all the models are trained using the same stopping criteria, thus conferring no special advantage to any of the models. Table 1 exemplifies the training data.

3 Experimental setup

We perform experiments in four language arcs, English to German (EN-DE), English to French (EN-FR), English to Russian (EN-RU) and Chinese to English (ZH-EN).

3.1 Training

We use the WMT2019 data set for EN-DE, Open Subtitles 2018 for EN-FR and EN-RU and UN Parallel Corpus V1.0 for ZH-EN, all four containing document-level data. The data sets vary in size from 4M segments for EN-DE to 17.4M for ZH-EN (see Appendix A, Table 13 for details). Development data consists of the News task 2019 development set for DE, IWSLT 2019 for FR and newstest2019 for RU and ZH respectively. In all conditions the development data mirrors the training data, meaning that it is duplicated to contain both multi- and single segments data for contextual models, and original and distilled data for distillation experiments. In preliminary experiments we found this to play an important role.

Models use the Transformer architecture Vaswani et al. (2017a). We start with a baseline architecture of 6:2 encoder:decoder layers and 2048 feed-forward width, which subsequent experiments increase in decoder depth and feed-forward width respectively. Training is done with Sockeye Domhan et al. (2020). See Appendix A for a complete list of training parameters.

3.2 Testing

We measure performance of contextual models using both targeted and non-targeted testing.

Non-targeted tests consists of contextual, document-level data which is not selected to focus on discourse phenomena. For EN-DE we use the test set splits made available in Maruf et al. (2019a): TED (2.3k segments), News-Commentary (3k) and Europarl (5.1k). We use IWSLT15 (1k) (Cettolo et al., 2012) for EN-FR, WMT newstest2020 (4k)  (Barrault et al., 2020) for EN-RU and finally WMT newstest2020 (2k)  (Barrault et al., 2020) for ZH-EN. While contextual models may improve performance on these data sets, previous work suggests that the effects are minimal in high-resources scenarios with strong sentence-level baselines (Lopes et al., 2020).

Targeted tests have been developed in order to evaluate performance on discourse phenomena. Table 2 lists the test sets used in this paper. 222While highly relevant, data created by (Yin et al., 2021a) has not been released at the time of writing this paper. These test sets contain contrastive translation pairs, consisting of a correct human-generated translation, and a variant of it where a pronoun, or another linguistic unit of interest, is swapped with an incorrect one. Table 3 shows examples from these data sets.

To complement accuracy of contrastive evaluations, we also use targeted test sets and their references to measure standard translation metrics.

LP Type Size Source
EN-DE Anaphora 12,000 Müller et al. (2018)
EN-FR Anaphora 12,000 Lopes et al. (2020)
EN-RU Deixis 3,000 Voita et al. (2019b)
Lex-coh 2,000
Ellipsis-vp 500
Ellipsis-infl 500
ZH-EN Anaphora 500 Jwalapuram et al. (2019)
Таблица 2: Targeted test sets used for evaluating discourse phenomena.
DE Src I forgot to confide it to you.
Ctx What’s your plan?
Ctx-tgt Was hast du vor?
Ref Ich vergaß, es euch zu vertraun.
Contr Ich vergaß, sie euch zu vertraun.
FR Src And where’s it coming from?
Ctx A sort of mist.
Ctx-tgt Une sorte de brume.
Ref Et elle vient d’où ?
Contr Et il vient d’où ?
RU Src Identity theft.
Ctx And I solved another crime.
Ctx-tgt И этим решил еще одно преступление.
Ref Кражу.
Contr Кража.
ZH Src 情况就是这样
Ctx 斐济人就好像生来就是打 7人
制橄榄球的, 而英国队仍是初出茅庐
Ctx-tgt It was as if Fiji had been born to play 7s,
while GB are still learning the trade .
Ref Which is pretty much how it is.
Contr That is pretty much how it is.
Таблица 3: Targeted test set examples. Models are assessed as correct if they score the reference (Ref) higher than a contrastive variant (Contr), given a source segment and its context.

4 Context-aware translation results

We begin our experiments by confirming that concatenated models are indeed able to model context dependencies (Section 4.1). We follow by testing the hypothesis that larger models are better suited for learning the more complex contextual training data (Section 4.2). In order to avoid over-fitting, we use EN-DE as a development language and subsequently test identical settings on FR, RU and ZH in Section 4.3.

4.1 Multi-segment models

For all four language arcs, Ctx models use 6 encoder layers and 2 decoder layers (44M parameters) and are trained using both segments in isolation as well as concatenated context segments. In inference, DE, FR and ZH models use one preceding context sentence, matching the training. However, over 60% of the targeted RU data exhibits longer dependencies, of up to 3 previous segments. For this reason, targeted EN-RU testing concatenates all three context sentences. Baseline (Bl) models use the same original train data and model architecture, this time trained and used to translate one segment at a time.

Results are shown in Table 4. As observed by previous work, concatenated models are considerably better than their context-ignoring counterparts, particularly on targeted test sets. In contrastive testing, accuracy increases by 20-30% in absolute values in all languages, with the exception of the lexical cohesion data set in RU and anaphora data set in ZH.

For non-targeted testing, Ctx models significantly out-perform the Bl models in 4 out of the 6 test sets. This differs from previous work, where contextual models using the concatenation approach are reported to degrade BLEU scores: Tiedemann and Scherrer (2017) measure 0.6 BLEU drop, Voita et al. (2019b) show a 0.84 drop for RU, Lopes et al. (2020), 1.2, and Junczys-Dowmunt (2019a) shows a BLEU degradation of 1.5. These results indicate that our approach to train the contextual model with both contextual and non-contextual data alleviates the issue of quality degradation.

Arc Metric Test set Targeted Bl Ctx
DE BLEU TED 19.9 22.4
News 26.1 29.5
Europarl 29.3 31.5
BLEU ContraPro 20.1 21.1
Acc ContraPro 0.50 0.70
FR BLEU IWSLT 40.0 39.7
BLEU LCPT333Large-contrastive-pronoun-testset-EN-FR (LCPT) 27.9 32.5
Acc LCPT 0.74 0.87
Acc Anaphora 0.50 0.72
RU BLEU WMT20 13.6 14.6
Acc Deixis 0.50 0.83
Acc Lex-coh 0.46 0.47
Acc Ellipsis-vp 0.20 0.60
Acc Ellipsis-infl 0.52 0.68
ZH BLEU WMT20 21.2 21.4
Acc Eval-anaphora 0.58 0.61
Таблица 4: Concatenated models (Ctx) vs baseline models (Bl) of the same capacity. While all test sets have context, some are targeted towards discourse phenomena, marked as Targeted (see Section 3 for details).

4.2 Increasing model capacity

As the multi-segment models are trained on data exhibiting longer dependencies, we investigate the hypothesis that increased model capacity is needed to learn the more complex data distribution.

Starting with the baseline model used in the previous experiments (Ctx), we investigate two ways to increase its capacity: increasing the depth of the decoder or increasing the width of the feed-forward layers. We test four increased model capacities, by incrementally adding 2 decoder layers to the base model (deep models). For each deep model, we also create an equivalent wide model containing the same number of parameters. This leads to number of parameters ranging from 44M – the baseline setting – to 76M. We leave all other settings un-changed. Table 14 in the Appendix details these architectures.

Рис. 1: EN-DE, Ctx models when increasing model capacity, measured in millions of parameters: BLEU scores on three non-targeted test sets (Ted, News and Europarl) and targeted metrics (BLEU and Accuracy) on ContraPro.

Non-targeted and targeted testing results are shown in Figure 1. Results show that larger capacity models deliver increased performance across the board. In both cases most of the performance gain comes from the 52M capacity model: +2.2 BLEU gain in non-targeted and +0.4 BLEU/+5% Accuracy on the ContraPro pronoun test. However targeted metrics show subsequent improvements with increased depth, to +8% absolute accuracy gain with the 6:8 encoder:decoder configuration. While deeper and wider models perform similarly in non-targeted testing, deeper models are clearly superior on the pronoun translation task: Wider models improve accuracy from 70% to a maximum of 73% while deep models achieve 78%.

Testing configuration Model
Metric Test set Targeted Ctx used Bl Ctx Ctx-Deep Ctx-Wide
BLEU TED 19.9 22.1 24.3 24.3
TED - 22.4 24.4 24.4
News 26.1 29.7 31.3 31.3
News - 29.5 31.8 31.5
Europarl 29.3 31.5 34.4 34.5
Europarl - 31.5 34.4 34.7
BLEU ContraPro 20.1 19.1 21.2 20.6
ContraPro - 20.4 22.9 22.3
Acc ContraPro 0.49 0.50 0.51 0.50
ContraPro - 0.70 0.78 0.73
Таблица 5: EN-DE, Bl and Ctx models of standard capacity (44M), and the best Deep and Wide models (68M/76M respectively). All used with and without context at test time.

As noted in Section 4.1, Ctx models do not perform worse than sentence-level baselines on any of the contextual data sets, most likely due to the joint single- and multi-segment training regime. Next we investigate if this training leads to “multi-task"models, that maintain baseline performance also when used without context. These experiments use the previously trained models, and contextual models are tested with/without context at inference time. We contrast baseline single-segment models of standard capacity (Bl), similar multi-segment models (Ctx), and the best Deep and Wide models as previously determined (capacity 68M and 76M respectively).

Results are shown in Table 5. Interestingly, in non-targeted testing, multi-segment models used without context approach the optimal performance. Therefore the improvements due to the use of context are smaller than indicated by Section 4.1: +0.5 and +0.2 BLEU in News and negligible or non-existent in the other domains. 444A related observation was made in Lopes et al. (2020) where it is shown that if strong baselines are used, no contextual model tested brings any improvements over the context-ignoring baselines in IWSLT sets for De and Fr. Specifically, standard capacity Ctx models outperform Bl ones by +2/+3 BLEU points when used without context. Note that these only differ in the use of training data: Ctx duplicates the training data by concatenating two adjacent segments.

In targeted tests, the multi-segment models that ignore context do not outperform the baselines. This confirms that ContraPro is a good benchmark for isolating context-aware phenomena, as the task cannot be solved with a strong baseline.

4.3 Results on FR, RU and ZH translation

In this section we investigate whether EN-DE results carry over to the FR, RU and ZH translation tasks, by testing the optimal DE configurations without any additional language- or task-specific tuning. We test the baseline model (single-segment, standard capacity) against the best contextual model as determined on DE, the multi-segment model using 6 encoder and 8 decoder layers.

Testing configuration Model
Arc Metric Test set Targ-eted Ctx Bl Ctx-Deep
FR BLEU IWSLT 40.0 40.8
IWSLT - 40.0
BLEU LCPT 27.9 31.5
LCPT - 32.3
Acc LCPT 0.74 0.79
LCPT - 0.90
RU BLEU WMT20 13.6 18.9
WMT20 - 16.5
Acc deixis 0.50 0.51
deixis - 0.85
Acc lex-coh 0.46 0.46
lex-coh - 0.48
Acc ellipsis-vp 0.20 0.25
ellipsis-vp - 0.73
Acc ellipsis-infl 0.52 0.55
ellipsis-infl - 0.80
ZH BLEU WMT20 21.2 22.1
WMT20 - 22.4
Acc Eval-anaphora 0.58 0.60
Eval-anaphora - 0.62
Таблица 6: Single segment models of standard 44M capacity (Bl) and 68M deep multi-segment models (Ctx-Deep). Best results on each task in bold font.

Table 6 shows the results. The DE results carry over to FR, RU and ZH to a large extent. Except for ZH, on non-targeted testing the best performance is obtained by the deep multi-segment models used without

context. On further analysis of the non-targeted EN-RU test set, where the drop is of over 2 BLEU points, we observed that the length divergence between training and test segments is significant; the quality drops dramatically when the segment length deviates by more than 4 standard deviation(sd). (See Appendix 

C for a detailed analysis of segment length variation.).

In targeted testing, deep multi-segment models using context improve performance by a large margin, with the exception of the RU lexical cohesion test set and ZH anaphoric pronoun translation, where the improvements are minimal, corroborating prior studies Jwalapuram et al. (2020a, 2019).

5 Distillation

Non-Trgtd Targeted
EN-DE Ctx-Deep 32.0 22.9 0.78
Student 31.4 23.1 0.73
Ctx 29.5 20.1 0.70
EN-FR Ctx-Deep 40.0 32.3 0.90
Student 40.4 32.1 0.88
Ctx 39.7 32.5 0.87
EN-RU Ctx-Deep 16.5 - 0.85
deixis Student 16.2 - 0.84
Ctx 14.6 - 0.83
EN-RU Ctx-Deep - - 0.48
lex-coh Student - - 0.46
Ctx - - 0.46
EN-RU Ctx-Deep - - 0.73
ellipsis-vp Student - - 0.66
Ctx - - 0.60
EN-RU Ctx-Deep - - 0.80
ellipsis-infl Student - - 0.69
Ctx - - 0.68
ZH-EN Ctx-Deep 22.3 - 0.62
Student 22.2 - 0.59
Ctx 21.4 - 0.60
Таблица 7: Teacher (68M deep multi-segment) and student (44M multi-segment) models. Ctx is a multi-segment model of 44M capacity. Non-targeted test sets are News for DE, IWSLT for FR and WMT for RU.

The multi-segment models with increased capacity are computationally less efficient than standard single-segment models, with deeper decoders and longer outputs impacting latency and wider models impacting memory.

This section investigates the effectiveness of Knowledge Distillation (KD) in compressing contextual models to the original model capacity. We employ sequence-level KD as proposed in (Kim and Rush, 2016). Specifically, the Deep-Ctx models are the teachers used to translate the training/development data and students are trained on data containing both references and teacher output, as recommended in Gordon and Duh (2019) among others. Students with layers and 44M parameters are trained with the loss: where is the teacher prediction, is the target length and is the size of the vocabulary.

5.1 Results

Results in Table 7 show KD can indeed enhance a smaller model’s performance on discourse phenomena. Overall, the performance of the distilled models lies between that of standard capacity models and that of deep models.

We observe that, unexpectedly, in several test sets, the student model is better than the teacher. We analyze such a test set, EN-DE WMT19, where teacher and student achieve 30.6/31.1 BLEU respectively. We observed that some of this data is paragraph-level and not sentence-level, again leading to a train-test miss-match. We analyzed the performance on the test set against the input length distribution seen in the training data. Results (Table 8) show that the student outperforms the teacher when the input length is above 2 standard deviations from the median length.

We hypothesize that student translations are more robust to variations and less-context sensitive, due to the simpler data distribution that the student is trained on (Zhou et al. (2020) indeed show that distilled data is less complex under a complexity metric based on cross-entropy across word alignments). While we leave further exploration to future work, we perform a simple experiment to measure the variation observed in translation when context is used or ignored. We measure this irrespective of quality, as the percentage of translations that change at all when context is used. Table 9 shows that indeed student models are less context-sensitive. This is observed across the targeted test set, ContraPro, where translation changes are expected; the gap is even more pronounced on non-targeted test sets, where context-dependence is not part of the data set design.

Len. range Ctx-Deep Student
(0,m) 22.7 23.0
(m,m + 2* SD] 32.3 31.8
(m+2*SD, m+4 *SD] 32.2 32.7
(m+4* SD,) 25.1 28.1
Таблица 8: BLEU scores on WMT19 EN-DE test set. The test set is split into four partitions, wrt. the input length. For example, the second bin contains input of length between the median length seen in training and 2 sd.
Test set Ctx-Deep Student
Ted 57.0% 47.0%
News 60.5% 49.7%
Europarl 53.6% 42.2%
ContraPro 65.0% 57.5%
Таблица 9: EN-DE, percentage of translations that change with the addition of context (ctx used vs. ignored)

6 Human evaluation

The scoring accuracy metric used with targeted contrastive references does not measure if the system can actually generate the correct pronoun. In a recent analysis, Vamvas and Sennrich (2021) show that scoring human-written contrastive references may lead to false positives, in particular for distilled NMT models which have not been exposed to real-data distribution during training.

To address this limitation, we complement the automatic metric with human evaluation of the EN-DE teacher and student models. We sampled 250 examples from ContraPro, selecting samples where the antecedent is in the previous sentence. Translators were shown the two consecutive source sentences and their translations. They were asked to rate the quality of both sentences, the context and target, on a scale of 1 to 6 (with increments of 0.2) and to mark if the anaphoric pronoun "it"was correctly translated in the target sentence. Each translator performed two tasks: the first task was to compare the context-ignoring baseline (Bl) and the Ctx-Deep model; the second task was to compare the same baseline with the student model (Student). With this setup, we grounded the evaluation by showing the baseline translations in both tasks.

The inter-annotator agreement on ranking the baseline and teacher models wrt. generic quality of the target sentence555We compute the ranking as . is good at 0.55 Krippendorff’s Alpha (Hayes and Krippendorff, 2007). On assessing the correctness of pronoun translation, the agreement is very high for the teacher at 0.86 and high for the student 0.66.666We believe annotator fatigue contributed to the decrease in agreement, as translators first judged the teacher outputs and later the student. For the baseline model, which was judged twice, translators agree with themselves 86% of the time.

We report the pronoun translation accuracy and the generic quality scores (averaged across annotators and sentences) for the target sentences in Table 10. These results show that the Ctx-Deep is significantly more accurate at translating ambiguous pronouns than the baseline (71.8% vs 28.1%) and at the same time achieves better generic quality scores (+8% relative improvement). The student, which has the same capacity and architecture as the baseline, performs better than the automatic accuracy metric suggested: it retains most of the improved accuracy (61.6% vs 71.8%) and quality of the teacher model (+7% vs +8% ).

Table 11 shows an example of translation where the student model disambiguates the anaphoric pronoun "it"and in addition makes better word choices compared to the baseline: the two occurrences of the verb "jump"("springt") are correctly translated in both the context and target sentence.

Targeted accuracy (%) Quality scores
Bl 28.1 - 4.43 -
Ctx-Deep 71.8 42.0 4.81 8%
Student 61.6 31.8 4.76 7%
Таблица 10: Human evaluation of EN-DE contextual models versus the (context-ignoring) baseline on ContraPro.
Context sentence Targeted sentence Avg. scores
Src This wounded bird is jumping all over the place out there. And every time it jumps. It gives us more data that we can use.
Bl Dieser verwundete Vogel zieht sich dort über den gesamten Ort hinaus. Und jedes Mal, wenn es springt, gibt es mehr Daten, die wir verwenden können. 4.0 / 4.5
Ctx-Student Dieser verwundete Vogel springt dort draußen. Und jedes Mal, wenn er springt, gibt er uns mehr Daten, die wir verwenden können. 4.4 / 5.4
Таблица 11: Translation examples where the contextual student model is more accurate and has better generic quality compared to the baseline. We underline the antecedent of the ambiguous pronoun and bold words that are translated correctly by the student. We report the average quality scores for both the context and target sentences.

7 Comparison to previous results

Comparisons to previous work are not straightforward due to the variability in training data used, number of model parameters, hyper-parameter tuning and other train-time parameters.

Reported results as well as the replicated results below show that we compare favorably to previous work, despite not using language-specific tuning or techniques. The best EN-DE ContraPro accuracy using parallel data reported in Huo et al. (2020) is 0.83 on a model trained with 22.4M segments, which is 0.05 higher than our Ctx-Deep model, trained with 4M segments. On the same test set Fernandes et al. (2021) shows an accuracy score of 0.66, Lopes et al. (2020) reports a maximum performance of 0.71 while Maruf et al. (2019a) reports a maximum of 0.73/0.69 with an offline and an online model respectively.

On EN-FR Yin et al. (2021b) report an accuracy of 0.91, to our knowledge the best reported results on this test set. We reach 0.90/0.86 accuracy on this test with the Ctx-Deep and the student model respectively. On EN-RU Voita et al. (2019b) also obtains the best results when using concatenation models, however on average these results are lower than the Ctx-Deep ones. Voita et al. (2019a) introduces DocRepair, a two-pass method which obtains optimal results but has considerable computational drawbacks compared to our proposal. DocRepair is tested on EN-RU and achieves 0.92 on deixis, 0.75/0.86 on ellipsis and an impressive 0.80 on lexical cohesion.

We were able to perform a side-by-side comparison of our proposed approach with that of  Voita et al. (2019b) on EN-RU, due to availability of the implementation. We reproduce the results of the CADec model by using the use same EN-RU train data set and configuration as in the original paper Voita et al. (2019b). The train data contains 7.5 million parallel segments, out of which 1.5 million are contextual. For comparison, we train a contextual model with 6:6 encoder:decoder layers, totaling the same number of parameters as CADec. During testing, the source segment is translated using its preceding context and the context translation is subsequently stripped from the output.

The results are shown in Table 12. We report the results for both non-targeted and targeted test sets. Ctx-6:6 shows high quality, comparable to the state of the art CADec model on three of the test sets, and better in two of them (ellipsis-vp and ellipsis-infl).

Arc Metric Test set Targeted CADec Ctx-6:6
RU BLEU CADec test 30.1 30.4
Acc Deixis 0.81 0.80
Acc Lex-coh 0.47 0.48
Acc Ellipsis-vp 0.69 0.73
Acc Ellipsis-infl 0.56 0.75
Таблица 12: Comparison of concatenated contextual (Ctx-6:6) and Context Aware Decoder(CADec) models.

8 Related work

8.1 Document translation

A straight-forward way to include context proposed by Tiedemann and Scherrer (2017) is to train a standard NMT model on pseudo-document parallel data obtained by concatenating two or more consecutive sentences. As large-scale document-level parallel data is not widely available, prior work explored data augmentation: augmenting the training data with back-translated monolingual documents (Junczys-Dowmunt, 2019b) and leveraging monolingual data to train a document-level repair system (Voita et al., 2019a) .

To scale to larger context beyond the previous or next sentence, prior work proposed changes to the architecture to improve how context is compressed and attended to by the encoder and/or decoder: multi-encoder architectures (Bawden et al., 2018), hierarchical or sparse attention over context sentences (Maruf et al., 2019a; Miculicich et al., 2018; Bao et al., 2021), incorporating diverse pretrained context representations that combine local and document-level context (Zhu et al., 2020; Donato et al., 2021). However, recent work (Fernandes et al., 2021) has shown scaling up to longer contexts brings diminishing returns or can even hurt performance due to instability in training attention and positional embeddings (Nguyen et al., 2021).

8.2 Large models and Knowledge distillation

Prior work has showed that increasing model capacity in MT is beneficial particularly when increasing the size or the diversity of the training data, such as for multi-lingual models. With respect to depth versus width, Kaplan et al. (2020)

do an extensive comparison of model capacity for neural language models and find that increasing width and depth while maintaining capacity gives similar results.

Knowledge distillation Hinton et al. (2015) (KD) has been introduced as a way to compress larger models, or ensembles thereof, into smaller more computationally efficient models that reach similar performance. For sequence to sequence models, Kim and Rush (2016)

introduced sequence-level knowledge distillation for improved MT. Subsequently knowledge distillation has proved beneficial for non-autoregressive MT. In general, distillation is thought to be effective in low data regimes, small data sets, domain adaptation, transfer learning (e.g.

Currey et al. (2020)) which makes it particularly suitable for document-level translation, where parallel data is a bottleneck.

A significant body of work has been devoted to understanding why distillation works, such as (Gordon and Duh, 2019; Xu et al., 2021; Zhou et al., 2020) among others. While our work does not focus on investigating why distillation works, we do contribute the observation that students prove to be very robust and out-perform teachers on out-of-distribution data when input length is considered.

9 Conclusion

In this paper we address the task of contextual translation by using multi-segment Transformer models. We show that we can successfully push the limits of this approach to achieve robust and high performance across several languages and benchmarks, without any language or task-specific tuning. This is achieved by training models to perform both contextual and single-segment translation, which has the added benefit of improving single-segment translation quality. We also show that – with fixed data conditions and model capacity – deeper models are superior to wider models in modeling contextual dependencies between pronouns and their antecedents.

Next we showed that the increased computational burden can be mitigated through distillation. Finally we observe that distilled models are more robust than their teachers on long input, which opens a new direction for improving MT models though distillation.

In this paper we have kept the training data fixed and have not investigated any data manipulations that could lead to improved performance. Particularly, our results indicate that standard document-level parallel data sets (such as the non-targeted sets used in this paper) exhibit a limited amount of discourse phenomena. In this light, multi-segment models trained on similar data may not learn to pay sufficient “attention"to context. In future work, we plan to investigate if parallel data can be improved by measuring and controlling context-sensitivity.

10 Ethical Considerations

In this work we used professional translators to evaluate the quality and accuracy of our models on publicly available data sets. The professional translators were recruited by a language service provider and were compensated according to industry standards.

11 Limitations

Every approach has its limitations, and our approach is no exclusion to that. Our contextual model adopts a simple approach of concatenating the previous sentence to the current sentence. While this improves the contextual model’s performance significantly, we have not experimented the effect of context size on model performance and have used standard context length from literature.

Also, when comparing with previous results, we have reproduced a state of the art approach (CADec) on a standard data set for EN-RU arc. For the remaining language pairs and test sets, we have cited previously reported results without reproducing them.

Finally, the targeted test sets used are limited wrt the different discourse phenomena they explore: anaphoric pronouns, lexical cohesion and verb forms. We have not investigated if our approach impacts other discourse phenomena or if it affects biases in translation.

Список литературы

Приложение A Experimental Settings

NMT models were built using the Transformer-base architecture (Vaswani et al., 2017b). The source embeddings, target embeddings, and the output layer’s weight matrix are tied. Training is done on 8 GPUs with Sockeye 2’s large batch training.

Parameters for training a 6:8 encoder:decoder model (same parameters are used for the other encoder:decoder configurations):

Arguments: Namespace(allow_missing_params=False, amp=False, amp_scale_interval=2000, batch_sentences_multiple_of=8, batch_size=8192, batch_type='word', bucket_scaling=False, bucket_width=8, cache_last_best_params=0, cache_metric='perplexity', cache_strategy='best', checkpoint_improvement_threshold=0.0, checkpoint_interval=4000, config=None, decode_and_evaluate=-1, decode_and_evaluate_device_id=None, decoder='transformer', device_ids=[-8], disable_device_locking=False, dry_run=False, dtype='float32', embed_dropout=(0.0, 0.0), encoder='transformer', env=None, fixed_param_names=[], fixed_param_strategy=None, gradient_clipping_threshold=1.0, gradient_clipping_type='none', horovod=False, ignore_extra_params=False, initial_learning_rate=0.0002, keep_initializations=False, keep_last_params=60, kvstore='device', label_smoothing=0.1, learning_rate_reduce_factor=0.9, learning_rate_reduce_num_not_improved=8, learning_rate_scheduler_type='plateau-reduce', learning_rate_t_scale=1.0, learning_rate_warmup=0, length_task=None, length_task_layers=1, length_task_weight=1.0, lhuc=None, lock_dir='/tmp', loglevel='INFO', loglevel_secondary_workers='INFO', loss='cross-entropy-without-softmax-output', max_checkpoints=None, max_num_checkpoint_not_improved=30, max_num_epochs=None, max_samples=None, max_seconds=1036800, max_seq_len=(200, 200), max_updates=None, min_num_epochs=1, min_samples=None, min_updates=None, momentum=None, monitor_pattern=None, monitor_stat_func='mx_default', no_bucket_scaling=None, no_bucketing=False, no_hybridization=False, no_logfile=False, num_embed=(None, None), num_layers=(6, 8), num_words=(0, 0), omp_num_threads=None, optimized_metric='perplexity', optimizer='adam', optimizer_params=None, output='deep68_6_8', overwrite_output=True, pad_vocab_to_multiple_of=None, params=None, prepared_data='../data_en_de', quiet=False, quiet_secondary_workers=False, round_batch_sizes_to_multiple_of=None, seed=1, shared_vocab=False, source=None, source_factor_vocabs=[], source_factors=[], source_factors_combine=[], source_factors_num_embed=[], source_factors_share_embedding=[], source_factors_use_source_vocab=[], source_vocab=None, stop_training_on_decoder_failure=False, target=None, target_factor_vocabs=[], target_factors=[], target_factors_combine=[], target_factors_num_embed=[], target_factors_share_embedding=[], target_factors_use_target_vocab=[], target_factors_weight=[1.0], target_vocab=None, transformer_activation_type=('relu', 'relu'), transformer_attention_heads=(8, 8), transformer_dropout_act=(0.1, 0.1), transformer_dropout_attention=(0.1, 0.1), transformer_dropout_prepost=(0.1, 0.1), transformer_feed_forward_num_hidden=(2048, 2048), transformer_feed_forward_use_glu=False, transformer_model_size=(512, 512), transformer_positional_embedding_type='fixed', transformer_postprocess=('dr', 'dr'), transformer_preprocess=('n', 'n'), update_interval=1, use_cpu=False, validation_source='../data_en_de/', validation_source_factors=[], validation_target='../data_en_de/', validation_target_factors=[], weight_decay=0.0, weight_init='xavier', weight_init_scale=3.0, weight_init_xavier_factor_type='avg', weight_init_xavier_rand_type='uniform', weight_tying_type='src_trg_softmax', word_min_count=(1, 1))

All data except Chinese(ZH) is tokenized using the Python Moses tokenizer at Chinese is tokenized using Jieba tokenizer at Words were segmented using BPE (Sennrich et al., 2016) with 32K operations. Source and target subwords shared the same vocabulary and training segments longer than 95 tokens were removed.

LP #segments Data source
EN-DE 4M WMT, 2019
EN-FR 9.8M Open Subtitles 2018
EN-RU 8.7M Open Subtitles 2018
ZH-EN 17.4M UN Parallel Corpus V1.0
Таблица 13: Document parallel training data.

Приложение B Multi-segment translation models

Our contextual model is a multi-segment model. The document-level parallel data is transformed to contain concatenated, multi-segment input for our models. Specifically, we concatenate two consecutive segments, both source and target, to create a new data point. training a model with just concatenated inputs will render the model useless for translating isolated segments, while training a model with just isolated segments will not be useful for contextual translations. Thus we aim aim to create a single translation models which can perform both translation in-context and in isolation. We do this by duplicating the training data by concatenating in-context data with isolated data. See Algorithm 1 for the pseudo-code for this concatenation process.

Input : Parallel train data T
Output : Augmented parallel train data
1 for  do
2       for  do
Algorithm 1 Algorithm to transform the training data.

Приложение C Results

Comparison to previous results

The best EN-DE ContraPro accuracy using parallel data reported in Fernandes et al. (2021) is 0.66, obtained with a concatenated input method. Among all the methods tested, this is only outperformed by a model that additionally performs pre-training on a large monolingual corpus (0.80 accuracy, 0.02 higher than our Ctx-Deep model). On the same test set Lopes et al. (2020) reports a maximum performance of 0.71 while Maruf et al. (2019a) reports a maximum of 0.73/0.69 with an offline and an online model respectively.

On EN-FR Lopes et al. (2020) report an accuracy of 0.83, to our knowledge the best reported results on this test set. We reach 0.90/0.86 accuracy on this test with the Ctx-Deep and the student model respectively. On EN-RU Voita et al. (2019b) also obtains the best results when using concatenation models, however on average these results are lower than the Ctx-Deep ones. Voita et al. (2019a) introduces DocRepair, a two-pass method which obtains optimal results but has considerable computational drawbacks compared to our proposal. DocRepair is tested on EN-RU and achieves 0.92 on deixis, 0.75/0.86 on ellipsis and an impressive 0.80 on lexical cohesion.

Model Enc. blocks Dec. blocks FF  size Total param.
Ctx 6 2 2048 44 M
Deep-52 6 4 2048 52 M
Wide-52 6 2 3072
Deep-60 6 6 2048 60 M
Wide-60 6 2 4096
Deep-68 6 8 2048 68 M
Wide-68 6 2 5120
Deep-76 6 10 2048 76 M
Wide-76 6 2 6144
Таблица 14: Multi-segment models. For each increase in model capacity, measured in millions or parameters, we vary the width of the FF layer and the depth of the decoder, to obtain a deep and a wide configuration.

c.1 Segment Length Analysis

Non-targeted test sets shows the best performance is obtained by the deep multi-segment models used without context in Table 6. The result shows significant difference in performance between translation done with and without context in case of English (EN) to Russian (RU) arc.

We analyze the non-targeted test set with respect to segment length of the training data. We compute the median segment length of the training data is 15 with a sd of 9. We further split the test set in 4 partitions with respect to the relation to this training data median and sd. Table 15 shows the performance of each partition when translation is done with and without context. We also see when the test segment length deviates by more than 4 sd, the performance of translation with context degrades significantly and segments falls into this partition. We think this length deviation between training and test set is the reason behind the performance drop between with and without context translations.

Len. Range #Segments BLEU
(0,m) 22 5.7 6.5
(m,m + 2* SD] 351 17.3 19.1
(m+2*SD, m+4 *SD] 652 20.3 21.5
(m+4* SD,) 3077 19.6 16.7
Таблица 15: BLEU scores on WMT20 EN-RU test set. The test set is split into four partitions, wrt. the input length. For example, the Med - 2sd bin contains input of length between the median length seen in training and 2 standard deviations.