1 Introduction
The goal of automatic post-editing (APE; Simard et al., 2007) is to automatically correct the mistakes produced by a black-box machine translation (MT) system. APE is particularly appealing for rapidly customizing MT, avoiding to train new systems from scratch. Interfaces where human translators can post-edit and improve the quality of MT sentences (Alabau et al., 2014; Federico et al., 2014; Denkowski, 2015; Hokamp, 2018) are a common data source for APE models, since they provide triplets of source sentences (src), machine translation outputs (mt), and human post-edits (pe).
Unfortunately, human post-edits are typically scarce. Existing APE systems circumvent this by generating artificial triplets (Junczys-Dowmunt and Grundkiewicz, 2016; Negri et al., 2018). However, this requires access to a high quality MT system, similar to (or better than) the one used in the black-box MT itself. This spoils the motivation of APE as an alternative to large-scale MT training in the first place: the time to train MT systems in order to extract these artificial triplets, combined with the time to train an APE system on the resulting large dataset, may well exceed the time to train a MT system from scratch.
Meanwhile, there have been many successes of transfer learning for NLP: models such as CoVe (McCann et al., 2017), ELMo (Peters et al., 2018), OpenAI GPT (Radford et al., 2018), ULMFiT (Howard and Ruder, 2018), and BERT (Devlin et al., 2019) obtain powerful representations by training large-scale language models and use them to improve performance in many sentence-level and word-level tasks. However, a language generation task such as APE presents additional challenges.
In this paper, we build upon the successes above and show that transfer learning is an effective and time-efficient strategy for APE, using a pre-trained BERT model. This is an appealing strategy in practice: while large language models like BERT are expensive to train, this step is only done once and covers many languages, reducing engineering efforts substantially. This is in contrast with the computational and time resources that creating artificial triplets for APE needs—these triplets need to be created separately for every language pair that one wishes to train an APE system for.
Current APE systems struggle to overcome the MT baseline without additional data. This baseline corresponds to leaving the MT uncorrected (“do-nothing” baseline).111If an APE system has worse performance than this baseline, it is pointless to use it. With only the small shared task dataset (23K triplets), our proposed strategy outperforms this baseline by 4.9 TER and 7.4 BLEU in the English-German WMT 2018 APE shared task, with 3 hours of training on a single GPU. Adding the artificial eSCAPE dataset (Negri et al., 2018) leads to a performance of 17.15 TER, a new state of the art.
Our main contributions are the following:
-
We combine the ability of BERT to handle sentence pair inputs together with its pre-trained multilingual model, to use both the src and mt in a cross-lingual encoder, that takes a multilingual sentence pair as input.
-
We show how pre-trained BERT models can also be used and fine-tuned as the decoder in a language generation task.
-
We make a thorough empirical evaluation of different ways of coupling BERT models in an APE system, comparing different options of parameter sharing, initialization, and fine-tuning.
2 Automatic Post-Editing with BERT
2.1 Automatic Post-Editing
APE (Simard et al., 2007) is inspired by human post-editing, in which a translator corrects mistakes made by an MT system. APE systems are trained from triplets (src, mt, pe), containing respectively the source sentence, the machine-translated sentence, and its post-edited version.
Artificial triplets.
Since there is little data available (e.g WMT 2018 APE shared task has 23K triplets), most research has focused on creating artificial triplets to achieve the scale that is needed for powerful sequence-to-sequence models to outperform the MT baseline, either from “round-trip” translations (Junczys-Dowmunt and Grundkiewicz, 2016) or starting from parallel data, as in the eSCAPE corpus of negri2018escape, which contains 8M synthetic triplets.
Dual-Source Transformer.
The current state of the art in APE uses a Transformer (Vaswani et al., 2017) with two encoders, for the src and mt, and one decoder, for pe (Junczys-Dowmunt and Grundkiewicz, 2018; Tebbifakhr et al., 2018). When concatenating human post-edited data and artificial triplets, these systems greatly improve the MT baseline. However, little successes are known using the shared task training data only.
By contrast, with transfer learning, our work outperforms this baseline considerably, even without any auxiliary synthetic dataset; and, as shown in §3, it achieves state-of-the-art results by combining it with the aforementioned artificial datasets.

2.2 BERT as a Cross-Lingual Encoder
Our transfer learning approach is based on the Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2019). This model obtains deep bidirectional representations by training a Transformer (Vaswani et al., 2017) with a large-scale dataset in a masked language modeling task where the objective is to predict missing words in a sentence. We use the BERTBASE model, which is composed of self-attention layers, hidden size , attention heads, and feed-forward inner layer size . In addition to the word and learned position embeddings, BERT also has segment embeddings to differentiate between a segment A and a segment B—this is useful for tasks such as natural language inference, which involve two sentences. In the case of APE, there is also a pair of input sentences (src, mt) which are in different languages. Since one of the released BERT models was jointly pre-trained on 104 languages,222 https://github.com/google-research/bert/blob/master/multilingual.md we use this multilingual BERT pre-trained model to encode the bilingual input pair of APE.
Therefore, the whole encoder of our APE model is the multilingual BERT: we encode both src and mt in the same encoder and use the segment embeddings to differentiate between languages (Figure 1). We reset positional embeddings when the mt starts, since it is not a continuation of src.
2.3 BERT as a Decoder
Prior work has incorporated pre-trained models in encoders, but not as decoders of sequence-to-sequence models. Doing so requires a strategy for generating fluently from the pre-trained model. Note that the bidirectionality of BERT is lost, since the model cannot look at words that have not been generated yet, and it is an open question how to learn decoder-specific blocks (e.g. context attention), which are absent in the pre-trained model.
One of our key contributions is to use BERT in the decoder by experimenting different strategies for initializing and sharing the self and context attention layers and the positionwise feed-forward layers. We tie together the encoder and decoder embeddings weights (word, position, and segment) along with the decoder output layer (transpose of the word embedding layer). We use the same segment embedding for the target sentence (pe) and the second sentence in the encoder (mt) since they are in the same language. The full architecture is shown in Figure 1. We experiment with the following strategies for coupling BERT pre-trained models in the decoder:
-
Transformer. A Transformer decoder as described in vaswani2017attention without any shared parameters, with the BERTBASE dimensions and randomly initialized weights.
-
Pre-trained BERT. This initializes the decoder with the pre-trained BERT model. The only component initialized randomly is the context attention (CA) layer, which is absent in BERT. Unlike in the original BERT model—which only encodes sentences—a mask in the self-attention is required to prevent the model from looking to subsequent tokens in the target sentence.
-
BERT initialized context attention. Instead of a random initialization, we initialize the context attention layers with the weights of the corresponding BERT self-attention layers.
-
Shared self-attention. Instead of just having the same initialization, the self-attentions (SA) in the encoder and decoder are tied during training.
-
Context attention shared with self-attention. We take a step further and tie the context attention and self attention weights—making all the attention transformation matrices (self and context) in the encoder and decoder tied.
-
Shared feed-forward. We tie the feed-forward weights (FF) between the encoder and decoder.
3 Experiments
We now describe our experimental results. Our models were implemented on a fork of OpenNMT-py (Klein et al., 2017)
using a Pytorch
(Paszke et al., 2017) re-implementation of BERT.333https://github.com/huggingface/pytorch-pretrained-BERT Our model’s implementation is publicly available.444https://github.com/deep-spin/OpenNMT-APEDatasets.
We use the data from the WMT 2018 APE shared task (Chatterjee et al., 2018) (English-German SMT), which consists of 23,000 triplets for training, 1,000 for validation, and 2,000 for testing. In some of our experiments, we also use the eSCAPE corpus (Negri et al., 2018), which comprises about 8M sentences; when doing so, we oversample 35x the shared task data to cover of the final training data. We segment words with WordPiece (Wu et al., 2016), with the same vocabulary used in the Multilingual BERT. At training time, we discard triplets with 200+ tokens in the combination of src and mt or 100+ tokens in pe. For evaluation, we use TER Snover et al. (2006) and tokenized BLEU Papineni et al. (2002).
TER | BLEU | |
---|---|---|
Transformer decoder | 20.33 | 69.31 |
Pre-trained BERT | 20.83 | 69.11 |
with CA SA | 18.91 | 71.81 |
and SA Encoder SA | 18.44 | 72.25 |
and CA SA | 18.75 | 71.83 |
and FF Encoder FF | 19.04 | 71.53 |
test 2016 | test 2017 | test 2018 | |||||
Model | Train Size | TER | BLEU | TER | BLEU | TER | BLEU |
MT baseline (Uncorrected) | 24.76 | 62.11 | 24.48 | 62.49 | 24.24 | 62.99 | |
berard2017lig | 23K | 22.89 | — | 23.08 | 65.57 | — | — |
junczys2018ms | 5M | 18.92 | 70.86 | 19.49 | 69.72 | — | — |
junczys2018ms | 18.86 | 71.04 | 19.03 | 70.46 | — | — | |
tebbifakhr2018multi | 8M | — | — | — | — | 18.62 | 71.04 |
junczys2018ms | 17.81 | 72.79 | 18.10 | 71.72 | — | — | |
junczys2018ms | 17.34 | 73.43 | 17.47 | 72.84 | 18.00 | 72.52 | |
Dual-Source Transformer | 23K | 27.80 | 60.76 | 27.73 | 59.78 | 28.00 | 59.98 |
BERT Enc. + Transformer Dec. (Ours) | 20.23 | 68.98 | 21.02 | 67.47 | 20.93 | 67.60 | |
BERT Enc. + BERT Dec. (Ours) | 18.88 | 71.61 | 19.03 | 70.66 | 19.34 | 70.41 | |
BERT Enc. + BERT Dec. (Ours) | 18.05 | 72.39 | 18.07 | 71.90 | 18.91 | 70.94 | |
BERT Enc. + BERT Dec. (Ours) | 8M | 16.91 | 74.29 | 17.26 | 73.42 | 17.71 | 72.74 |
BERT Enc. + BERT Dec. (Ours) | 16.49 | 74.98 | 16.83 | 73.94 | 17.15 | 73.60 |
Training Details.
We use Adam (Kingma and Ba, 2014) with a triangular learning rate schedule that increases linearly during the first 5,000 steps until and has a linear decay afterwards. When using BERT components, we use a weight decay of . We apply dropout Srivastava et al. (2014) with to all layers and use label smoothing with (Pereyra et al., 2017). For the small data experiments, we use a batch size of 1024 tokens and save checkpoints every 1,000 steps; when using the eSCAPE corpus, we increase this to 2048 tokens and 10,000 steps. The checkpoints are created with the exponential moving average strategy of junczys2018marian with a decay of . At test time, we select the model with best TER on the development set, and apply beam search with a beam size of 8 and average length penalty.
Initialization and Parameter Sharing.
Table 1 compares the different decoder strategies described in §2.3 on the WMT 2018 validation set. The best results were achieved by sharing the self-attention between encoder and decoder, and by initializing (but not sharing) the context attention with the same weights as the self-attention. Regarding the self-attention sharing, we hypothesize that its benefits are due to both encoder and decoder sharing a common language in their input (in the mt and pe sentence, respectively). Future work will investigate if this is still beneficial when the source and target languages are less similar. On the other hand, the initialization of the context attention with BERT’s self-attention weights is essential to reap the benefits of BERT representations in the decoder—without it, using BERT decreases performance when compared to a regular transformer decoder. This might be due to the fact that context attention and self-attention share the same neural block architecture (multi-head attention) and thus the context attention benefits from the pre-trained BERT’s better weight initialization. No benefit was observed from sharing the feed-forward weights.
Final Results.
Finally, Table 2 shows our results on the WMT 2016–18 test sets. The model named BERT Enc. + BERT Dec. corresponds to the best setting found in Table 1, while BERT Enc. + Transformer Dec. only uses BERT in the encoder. We show results for single models and ensembles of 4 independent models.
Using the small shared task dataset only (23K triplets), our single BERT Enc. + BERT Dec. model surpasses the MT baseline by a large margin (4.90 TER in test 2018). The only system we are aware to beat the MT baseline with only the shared task data is Bérard et al. (2017), which we also outperform (4.05 TER in test 2017). With only about 3 GPU-hours and on a much smaller dataset, our model reaches a performance that is comparable to an ensemble of the best WMT 2018 system with an artificial dataset of 5M triplets (0.02 TER in test 2016), which is much more expensive to train. With 4 ensembling, we get competitive results with systems trained on 8M triplets.
When adding the eSCAPE corpus (8M triplets), performance surpasses the state of the art in all test sets. By ensembling, we improve even further, achieving a final 17.15 TER score in test 2018 (0.85 TER than the previous state of the art).
4 Related Work
In their Dual-Source Transformer model, junczys2018ms also found gains by tying together encoder parameters, and the embeddings of both encoders and decoder. Our work confirms this but shows further gains by using segment embeddings and more careful sharing and initialization strategies. sachan2018parameter explore parameter sharing between Transformer layers. However, they focus on sharing decoder parameters in a one-to-many multilingual MT system. In our work, we share parameters between the encoder and the decoder.
As stated in §3, Bérard et al. (2017) also showed improved results over the MT baseline, using exclusively the shared task data. Their system outputs edit operations that decide whether to insert, keep or delete tokens from the machine translated sentence. Instead of relying on edit operations, our approach mitigates the small amount of data with transfer learning through BERT.
Our work makes use of the recent advances in transfer learning for NLP (Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019). Pre-training these large language models has largely improved the state of the art of the GLUE benchmark (Wang et al., 2018). Particularly, our work uses the BERT pre-trained model and makes use of the representations obtained not only in the encoder but also on the decoder in a language generation task.
More closely related to our work, lample2019cross pre-trained a BERT-like language model using parallel data, which they used to initialize the encoder and decoder for supervised and unsupervised MT systems. They also used segment embeddings (along with word and position embeddings) to differentiate between a pair of sentences in different languages. However, this is only used in one of the pre-training phases of the language model (translation language modelling) and not in the downstream task. In our work, we use segment embeddings during the downstream task itself, which is a perfect fit to the APE task.
lopes2019unbabels used our model on the harder English-German NMT subtask to obtain better TER performance than previous state of the art. To obtain this result, the transfer learning capabilities of BERT were not enough and further engineering effort was required. Particularly, a conservativeness factor was added during beam decoding to constrain the changes the APE system can make to the mt output. Furthermore, the authors used a data weighting method to augment the importance of data samples that have lower TER. By doing this, data samples that required less post-editing effort are assigned higher weights during the training loop. Since the NMT system does very few errors on this domain this data weighting is important for the APE model to learn to do fewer corrections to the mt output. However, their approach required the creation of an artificial dataset to obtain a performance that improved the MT baseline. We leave it for future work to investigate better methods to obtain results that improve the baseline using only real post-edited data in these smaller APE datasets.
5 Conclusion and Future Work
We proposed a transfer learning approach to APE using BERT pre-trained models and careful parameter sharing. We explored various ways for coupling BERT in the decoder for language generation. We found it beneficial to initialize the context attention of the decoder with BERT’s self-attention and to tie together the parameters of the self-attention layers between the encoder and decoder. Using a small dataset, our results are competitive with systems trained on a large amount of artificial data, with much faster training. By adding artificial data, we obtain a new state of the art in APE.
In future work, we would like to do an extensive analysis on the capabilities of BERT and transfer learning in general for different domains and language pairs in APE.
Acknowledgments
This work was supported by the European Research Council (ERC StG DeepSPIN 758969), and by the Fundação para a Ciência e Tecnologia through contracts UID/EEA/50008/2019 and CMUPERI/TIC/0046/2014 (GoLocal). We thank the anonymous reviewers for their feedback.
References
- Alabau et al. (2014) Vicent Alabau, Christian Buck, Michael Carl, Francisco Casacuberta, Mercedes García-Martínez, Ulrich Germann, Jesús González-Rubio, Robin Hill, Philipp Koehn, Luis Leiva, et al. 2014. CASMACAT: A Computer-assisted Translation Workbench. In Proceedings of the Demonstrations at EACL.
- Bérard et al. (2017) Alexandre Bérard, Laurent Besacier, and Olivier Pietquin. 2017. LIG-CRIStAL Submission for the WMT 2017 Automatic Post-Editing Task. In Proceedings of WMT17.
- Chatterjee et al. (2018) Rajen Chatterjee, Matteo Negri, Raphael Rubino, and Marco Turchi. 2018. Findings of the WMT 2018 Shared Task on Automatic Post-Editing. In Proceedings of WMT18.
- Denkowski (2015) Michael Denkowski. 2015. Machine Translation for Human Translators. Ph.D. thesis, Carnegie Mellon University.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL.
- Federico et al. (2014) Marcello Federico, Nicola Bertoldi, Mauro Cettolo, Matteo Negri, Marco Turchi, Marco Trombetti, Alessandro Cattelan, Antonio Farina, Domenico Lupinetti, Andrea Martines, et al. 2014. The MateCat Tool. In Proceedings of COLING, System Demonstrations.
- Hokamp (2018) Christopher M Hokamp. 2018. Deep Interactive Text Prediction and Quality Estimation in Translation Interfaces. Ph.D. thesis, Dublin City University.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In Proceedings of ACL.
- Junczys-Dowmunt and Grundkiewicz (2016) Marcin Junczys-Dowmunt and Roman Grundkiewicz. 2016. Log-linear Combinations of Monolingual and Bilingual Neural Machine Translation Models for Automatic Post-Editing. In Proceedings of WMT16.
- Junczys-Dowmunt and Grundkiewicz (2018) Marcin Junczys-Dowmunt and Roman Grundkiewicz. 2018. MS-UEdin Submission to the WMT2018 APE Shared Task: Dual-Source Transformer for Automatic Post-Editing. In Proceedings of WMT18.
- Junczys-Dowmunt et al. (2018) Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, et al. 2018. Marian: Fast Neural Machine Translation in C++. In Proceedings of ACL, System Demonstrations.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. preprint arXiv:1412.6980.
- Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proceedings of ACL 2017, System Demonstrations.
- Lample and Conneau (2019) Guillaume Lample and Alexis Conneau. 2019. Cross-lingual Language Model Pretraining. preprint arXiv:1901.07291.
- Lopes et al. (2019) António V. Lopes, M. Amin Farajian, Gonçalo M. Correia, Jonay Trenous, and André F. T. Martins. 2019. Unbabel’s Submission to the WMT2019 APE Shared Task: BERT-based Encoder-Decoder for Automatic Post-Editing. In Proceedings of WMT19.
- McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in Translation: Contextualized Word Vectors. In Proceedings of NeurIPS.
- Negri et al. (2018) Matteo Negri, Marco Turchi, Rajen Chatterjee, and Nicola Bertoldi. 2018. eSCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing. In Proceedings of LREC.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of ACL.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. Proceedings of NeurIPS Autodiff Workshop.
- Pereyra et al. (2017) Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. 2017. Regularizing Neural Networks by Penalizing Confident Output Distributions. preprint arXiv:1701.06548.
- Peters et al. (2018) Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of NAACL.
- Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. preprint.
- Sachan and Neubig (2018) Devendra Sachan and Graham Neubig. 2018. Parameter Sharing Methods for Multilingual Self-Attentional Translation Models. In Proceedings of WMT18.
- Simard et al. (2007) Michel Simard, Nicola Ueffing, Pierre Isabelle, and Roland Kuhn. 2007. Rule-Based Translation with Statistical Phrase-Based Post-Editing. In Proceedings of WMT07.
- Snover et al. (2006) Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of AMTA.
-
Srivastava et al. (2014)
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. 2014.
Dropout:
a simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research
, 15(1):1929–1958. - Tebbifakhr et al. (2018) Amirhossein Tebbifakhr, Ruchit Agrawal, Matteo Negri, and Marco Turchi. 2018. Multi-source Transformer with Combined Losses for Automatic Post-Editing. In Proceedings of WMT18.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Proceedings of NeurIPS.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of EMNLP Workshop BlackboxNLP.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. preprint arXiv:1609.08144.