ape aims to improve the quality of an existing mt system by learning from human edited samples. It first started by the automatic article selection for English noun phrases Knight and Chander (1994) and continued by correcting the errors of more complex statistical mt systems Bojar et al. (2015, 2016); Chatterjee et al. (2018a). In 2018, the organizers of the WMT shared task introduced, for the first time, the automatic post-editing of neural MT systems Chatterjee et al. (2018b).
Despite its successful application to SMT systems, it has been more challenging to automatically post edit the strong nmt systems Junczys-Dowmunt and Grundkiewicz (2018). This mostly is due to the fact that high quality nmt systems make fewer mistakes, limiting the improvements obtained by state-of-the-art APE systems such as self-attentive transformer-based models Tebbifakhr et al. (2018); Junczys-Dowmunt and Grundkiewicz (2018). In spite of these findings and considering the dominance of the nmt approach in both the academic and industrial applications, the WMT shared task organizers decided to move completely to the NMT paradigm this year and ignore the SMT technology. They also provide the previous year in-domain training set (i.e. of <src,mt,pe> triplets) further increasing the difficulty of the task.
Training state-of-the-art ape systems capable of improving high quality nmt outputs requires large amounts of training data, which is not always available, in particular for this WMT shared task. Augmenting the training set with artificially synthesized data is one of the popular and effective approaches for coping with this challenge. It was first used to improve the quality of nmt systems Sennrich et al. (2016) and then it was applied to the ape task Junczys-Dowmunt and Grundkiewicz (2016). This approach, however, showed limited success on automatically post editing the high quality translations of ape systems.
Transfer learning is another solution to deal with data sparsity in such tasks. It is based on the assumption that the knowledge extracted from other well-resourced tasks can be transferred to the new tasks/domains. Recently, large models pre-trained on multiple tasks with vast amounts of data, for instance BERT and MT-DNN Devlin et al. (2018a); Liu et al. (2019), have obtained state-of-the-art results when fine-tuned over a small set of training samples. Following apebert19, in this paper we use BERT Devlin et al. (2018a) within the encoder-decoder framework (§2.1) and formulate the task of ape as generating pe which is (possibly) the modified version of mt given the original source sentence src. As discussed in §2.1, instead of using multi-encoder architecture, in this work we concatenate the src and mt with the BERT special token (i.e. [SEP] and feed them to our single encoder.
We also introduce the conservativeness penalty, a simple yet effective mechanism that controls the freedom of our ape in modifying the given MT output. As we show in §2.2, in the cases where the automatic translations are of high quality, this factor forces the ape system to do less modifications, hence avoids the well-known problem of over-correction.
Finally, we augmented our original in-domain training data with a synthetic corpus which contains around <src,mt,pe> triplets (§3.1). As discussed in §4, our system is able to improve significantly the MT outputs by TER Snover et al. (2016) and BLEU Papineni et al. (2002), achieving an ex-aequo first-place in the English-German track.
In this section we describe the main features of our ape system: the BERT-based encoder-decoder (BED) and the conservativeness penalty.
2.1 BERT-based encoder-decoder
Following Correia and Martins (2019) we adapt the BERT model to the ape task by integrating the model in an encoder-decoder architecture. To this aim we use a single BERT encoder to obtain a joint representation of the src and mt sentence and a BERT-based decoder where the multi-head context attention block is initialized with the weights of the self-attention block. Both the encoder and the decoder are initialized with the pre-trained weights of the multilingual BERT111https://github.com/google-research/bert Devlin et al. (2018b). Figure 1 depicts our BED model.
Instead of using multiple encoders to separately encode src and mt, we use BERT pre-training scheme, where the two strings after being concatenated by the [SEP] special symbol are fed to the single encoder. We treat these sentences as sentenceA and sentenceB in Devlin et al. (2018b) and assign different segment embeddings to each of them. This emulates a similar setting to Junczys-Dowmunt and Grundkiewicz (2018) where a dual-source encoder with shared parameters is used to encode both input strings.
On the target side, following Correia and Martins (2019) we use a single decoder where the context attention block is initialized with the self attention weights, and all the weights of the self-attention are shared with the respective self-attention weights in the encoder.
2.2 Conservativeness penalty
With domain specific nmt systems making relatively few translation errors, ape systems face new challenges. This means more careful decisions have to be made by the ape system, making the least possible edits to the raw mt. To this aim, we introduce our “conservativeness” penalty developed on the post editing penalty proposed by Junczys-Dowmunt and Grundkiewicz (2016). It is a simple yet effective method to penalize/reward hypotheses in the beam, at inference time, that diverge far from the original input.
More formally, let be the source and target vocabulary. We define as the conservative tokens of an APE triplet, where are the src and mt tokens, respectively. For the sake of argument we define for decoding a single ape triplet, which can be generalized to batch decoding with defined for each batch element. Given the
sized vector of candidatesat each decoding step
, we modify the score/probability of each candidateas:
where is the conservativeness penalty, penalizing (or rewarding for negative values) all tokens of not present in
. Note that, this penalty can be applied to either the raw non-normalized outputs of the model (logit) or the final probabilities (log probabilities).
As the log probabilities and logit scores have different bounds of and , respectively, is set accordingly. Hence, for positive values of conservativeness the aim is to avoid picking tokens not in the src and mt, thus, limiting the number of corrections. On the other hand, negative values enable over correction.
Moreover, in order to apply the penalty in the log probabilities, there are some considerations to take into account as we don’t renormalize after the transformation. For positive values, the factor lowers the probability of all non conservative tokens, either increasing the confidence of an already picked conservative token, or favouring these tokens that are close to the best candidate – thus being closer to scores rather than probabilities. In contrast, negative penalties might require carefully selected values and truncating at the upper boundary – we did not experiment with negative values in this work, however the Quality Estimation shared task winning system used an APE-QE system with negative conservativenessKepler et al. (2019).
In contrast with Junczys-Dowmunt and Grundkiewicz, our model takes into account both src and mt, allowing to copy either of them directly. This is beneficial to handle proper nouns as they should be preserved in the post edition without any modification. Moreover, instead of setting the penalty as a fixed value of
, we define it as a hyperparameter which enables a more dynamic control of our model’s post-editions to themt input.
|System||Beam||w/o c||best c||worst c|
|+ logprobs||4||-||14.84 ()||15.06 ()|
|6||-||14.87 ()||15.01 ()|
|+ logits||4||-||15.03 ()||15.25 ()|
|6||-||15.05 ()||15.23 ()|
This year for the English-German language pair the participants were provided an in-domain training set and the eSCAPE corpus, an artificially synthesized generic training corpus for ape Negri et al. (2018). In addition to these corpora, they were allowed to use any additional data to train their systems. Considering this, and the fact that the in-domain training set belongs to the IT domain, we decided to use our own synthetic training corpus. Thus, we trained our models on a combination of the in-domain data released by the ape task organizers and this synthetic dataset.
In-domain training set: we use the triplets of <src,mt,pe> in the IT domain without any preprocessing as they are already preprocessed by the shared task organizers. Despite the previous year where the mt side was generated either by a phrase-based or a neural mt system, this year all the source sentences were translated only by a neural mt system unknown to the participants.
Synthetic training set: instead of the eSCAPE corpus provided by the organizers we created our own synthetic corpus using the parallel data provided by the Quality Estimation shared task222Dataset can be found under Additional Resouces at http://www.statmt.org/wmt19/qe-task.html. We found this corpus closer to the IT domain which is the target domain of the ape task. To create this corpus we performed the following steps:
Split the corpus into folds .
Use OpenNMT Klein et al. (2017) to train 5 LSTM based translation models, one model for every subset created by removing fold from the training data.
Translate each fold using the translation Model .
Join the translations to get an unbiased machine translated version of the full corpus.
Remove empty lines.
The final corpus has triplets. We then oversampled the in-domain training data 20 times Junczys-Dowmunt and Grundkiewicz (2018) and used them together with our synthetic data to train our models.
3.2 BED training
We follow Correia and Martins for training our BERT-based Encoder-Decoder ape models. In particular, we set the learning rate to and use optimizer to perform steps from which are warmup steps. We set the effective batch size to tokens. Furthermore, we also use a shared matrix for the input and output token embedddings and the projection layer Press and Wolf (2017). Finally, we share the self-attention weights between the encoder and the decoder and initialize the multi-head attention of the decoder with the self-attention weights of the encoder.
Similarly to junczys2018microsoft, we apply a data weighting strategy during training. However, we use a different weighting approach, where each sample is assigned a weight, , defined as . This results in assigning higher weights to the samples with less mt errors and vice versa, which might sound counter intuitive since in the APE task the goal is to learn more from the samples with larger number of errors. However, in this task, where the translations are provided by strong nmt systems with very small number of errors, our ape system needs to be conservative and learn to perform limited number of modifications to the mt.
3.3 BED decoding
In the decoding step we perform the standard beam decoding with our conservativeness factor. We fine tuned the this factor on the dev set provided by the organizers. Furthermore, in our experiments we set restrict the search to and use beam sizes of 4 and 6. In our preliminary experiments larger beam sizes didn’t help to improve the performance further. Finally, we used the evaluation script available on the website to access the performance of our model.
4 Results and discussion
In our preliminary experiments we noticed that using the pure BED model does not improve the quality of the translations provided by strong nmt systems. As Table 1 shows, it actually degrades the performance by TER scores. Although the scores in Correia and Martins are actually closer to the baseline, we find that using the BED model only, without controlling the conservativeness to the original mt can lead to baseline level scores (on dev). Hence, we applied different conservativeness penalties during the beam decoding and as the results in Table1 show, different values for this hyperparameter significantly changes the performance of our model. For the sake of compactness, here we present only the best (i.e. best c) and worst (i.e. worst c) scores by our model, to compare the effect of this factor.
Furthermore, intuitively, logits stands as the best candidate to apply the penalty, not only it was done in a similar fashion previously Junczys-Dowmunt and Grundkiewicz (2018), but also, after the normalization of the weights, the conservative tokens should have large peaks while having a stable behaviour. However, we achieved our best scores with penalties over the log probabilities, suggesting pruning hypothesis directly after normalizing the logits leads to more conservative outputs. Nonetheless, we leave as future work further investigations on the impact of pruning before and after normalizing the logits, as well as exploring renormalization of the log probabilities. Finally, we hypothesize that not only our BED model but also other ape models could benefit from the conservativeness penalty. We, however, leave it to be explored in future work.
Regarding the performance of our model on the official test set, as the results of Table 2 show, we outperform last year’s winning systems by almost TER and BLEU, which for strong performing nmt systems is significant. In addition, our submission ranks first in the official results 333Available at http://www.statmt.org/wmt19/ape-task.html under Results., ex aequo with 3 other systems. Table 3 depicts the official results of the shared task, considering only the best submission of each team.
|Tebbifakhr et al. (2018)||16.46||75.53|
Although in this paper we did not present an ablation analysis (due to time constraints), we hypothesize that three BED training and decoding techniques used in this work were influential on the final result obtained for this task: i) the synthetic training corpus contains more IT domain samples than the generic eSCAPE corpus, making it a suitable dataset to train ape systems for this domain; ii) the data weighting mechanism enforces the system to be more conservative and learn fewer edits which is crucial for strong specialized nmt engines, and, finally, iii) the conservativeness factor is crucial to avoid the well-known problem of over-correction posed generally by ape systems over the high quality nmt outputs, guaranteeing faithfulness to the original mt.
We presented Unbabel’s submissions to the APE shared task at WMT 2019 for the English-German language pair. Our model uses the BERT pre-trained language model within the encoder-decoder framework and applies a conservative factor to control the faithfulness of ape system to the original input stream.
The result of the official evaluation show that our system is able to effectively detect and correct the few errors made by the strong nmt system, improving the score by and in terms of TER and BLEU, respectively.
Finally, using ape to improve strong in-domain nmt systems is increasingly more challenging, and ideally the editing system will tend to perform less and less modifications of the raw mt. In line with Junczys-Dowmunt and Grundkiewicz’s suggestion, studying how to apply ape to engines in generic data (domain agnostic) can be a more challenging task, as it would require more robustness and generalization of the ape system.
The authors would like to thank the anonymous reviewers for the feedback. Moreover, we would like to thank António Góis, Fábio Kepler, and Miguel Vera for the fruitful discussions and help. We would also like to thank the support provided by the EU in the context of the PT2020 project (contracts 027767 and 038510), by the European Research Council (ERC StG DeepSPIN 758969), and by the Fundação para a Ciência e Tecnologia through contract UID/EEA/50008/2019.
- Bojar et al. (2016) Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurelie Neveol, Mariana Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin Verspoor, and Marcos Zampieri. 2016. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation, pages 131–198, Berlin, Germany. Association for Computational Linguistics.
- Bojar et al. (2015) Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Carolina Scarton, Lucia Specia, and Marco Turchi. 2015. Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46, Lisbon, Portugal. Association for Computational Linguistics.
- Chatterjee et al. (2018a) Rajen Chatterjee, Matteo Negri, Raphael Rubino, and Marco Turchi. 2018a. Findings of the WMT 2018 shared task on automatic post-editing. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 710–725, Belgium, Brussels. Association for Computational Linguistics.
- Chatterjee et al. (2018b) Rajen Chatterjee, Matteo Negri, Raphael Rubino, and Marco Turchi. 2018b. Findings of the WMT 2018 shared task on automatic post-editing. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 710–725, Belgium, Brussels. Association for Computational Linguistics.
- Correia and Martins (2019) Gonçalo Correia and André Martins. 2019. A simple and effective approach to automatic post-editing with transfer learning. In Proceedings of the 57th annual meeting on association for computational linguistics. Association for Computational Linguistics.
- Devlin et al. (2018a) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018a. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Devlin et al. (2018b) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018b. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Junczys-Dowmunt (2018) Marcin Junczys-Dowmunt. 2018. Microsoft’s submission to the wmt2018 news translation task: How i learned to stop worrying and love the data. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 425–430.
- Junczys-Dowmunt and Grundkiewicz (2016) Marcin Junczys-Dowmunt and Roman Grundkiewicz. 2016. Log-linear combinations of monolingual and bilingual neural machine translation models for automatic post-editing. In Proceedings of the First Conference on Machine Translation, pages 751–758, Berlin, Germany. Association for Computational Linguistics.
- Junczys-Dowmunt and Grundkiewicz (2018) Marcin Junczys-Dowmunt and Roman Grundkiewicz. 2018. MS-UEdin submission to the WMT2018 APE shared task: Dual-source transformer for automatic post-editing. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, pages 822–826.
- Kepler et al. (2019) Fabio Kepler, Jonay Trénous, Marcos Treviso, Miguel Vera, António Góis, M. Amin Farajian, António V. Lopes, and André F. T. Martins. 2019. Unbabel’s participation in the wmt19 translation quality estimation shared task. In Proceedings of the Fourth Conference on Machine Translation, Florence, Italy. Association for Computational Linguistics.
- Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proc. ACL.
Knight and Chander (1994)
Kevin Knight and Ishwar Chander. 1994.
postediting of documents.
Proceedings of the Twelfth AAAI National Conference on Artificial Intelligence, AAAI’94, pages 779–784. AAAI Press.
- Liu et al. (2019) Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504.
- Negri et al. (2018) Matteo Negri, Marco Turchi, Rajen Chatterjee, and Nicola Bertoldi. 2018. escape: a large-scale synthetic corpus for automatic post-editing. In LREC 2018, Eleventh International Conference on Language Resources and Evaluation, pages 24–30. European Language Resources Association (ELRA).
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
- Press and Wolf (2017) Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163, Valencia, Spain. Association for Computational Linguistics.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computational Linguistics.
- Snover et al. (2016) Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2016. A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, pages Vol. 200, No. 6.
- Tebbifakhr et al. (2018) Amirhossein Tebbifakhr, Ruchit Agrawal, Matteo Negri, and Marco Turchi. 2018. Multi-source transformer with combined losses for automatic post editing. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 846–852, Belgium, Brussels. Association for Computational Linguistics.