A Simple and Effective Approach to Automatic Post-Editing with Transfer Learning

06/14/2019 ∙ by Gonçalo M. Correia, et al. ∙ Unbabel Inc. Instituto de Telecomunicações 0

Automatic post-editing (APE) seeks to automatically refine the output of a black-box machine translation (MT) system through human post-edits. APE systems are usually trained by complementing human post-edited data with large, artificial data generated through back-translations, a time-consuming process often no easier than training an MT system from scratch. In this paper, we propose an alternative where we fine-tune pre-trained BERT models on both the encoder and decoder of an APE system, exploring several parameter sharing strategies. By only training on a dataset of 23K sentences for 3 hours on a single GPU, we obtain results that are competitive with systems that were trained on 5M artificial sentences. When we add this artificial data, our method obtains state-of-the-art results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of automatic post-editing (APE; Simard et al., 2007) is to automatically correct the mistakes produced by a black-box machine translation (MT) system. APE is particularly appealing for rapidly customizing MT, avoiding to train new systems from scratch. Interfaces where human translators can post-edit and improve the quality of MT sentences (Alabau et al., 2014; Federico et al., 2014; Denkowski, 2015; Hokamp, 2018) are a common data source for APE models, since they provide triplets of source sentences (src), machine translation outputs (mt), and human post-edits (pe).

Unfortunately, human post-edits are typically scarce. Existing APE systems circumvent this by generating artificial triplets (Junczys-Dowmunt and Grundkiewicz, 2016; Negri et al., 2018). However, this requires access to a high quality MT system, similar to (or better than) the one used in the black-box MT itself. This spoils the motivation of APE as an alternative to large-scale MT training in the first place: the time to train MT systems in order to extract these artificial triplets, combined with the time to train an APE system on the resulting large dataset, may well exceed the time to train a MT system from scratch.

Meanwhile, there have been many successes of transfer learning for NLP: models such as CoVe (McCann et al., 2017), ELMo (Peters et al., 2018), OpenAI GPT (Radford et al., 2018), ULMFiT (Howard and Ruder, 2018), and BERT (Devlin et al., 2019) obtain powerful representations by training large-scale language models and use them to improve performance in many sentence-level and word-level tasks. However, a language generation task such as APE presents additional challenges.

In this paper, we build upon the successes above and show that transfer learning is an effective and time-efficient strategy for APE, using a pre-trained BERT model. This is an appealing strategy in practice: while large language models like BERT are expensive to train, this step is only done once and covers many languages, reducing engineering efforts substantially. This is in contrast with the computational and time resources that creating artificial triplets for APE needs—these triplets need to be created separately for every language pair that one wishes to train an APE system for.

Current APE systems struggle to overcome the MT baseline without additional data. This baseline corresponds to leaving the MT uncorrected (“do-nothing” baseline).111If an APE system has worse performance than this baseline, it is pointless to use it. With only the small shared task dataset (23K triplets), our proposed strategy outperforms this baseline by 4.9 TER and 7.4 BLEU in the English-German WMT 2018 APE shared task, with 3 hours of training on a single GPU. Adding the artificial eSCAPE dataset (Negri et al., 2018) leads to a performance of 17.15 TER, a new state of the art.

Our main contributions are the following:

  • We combine the ability of BERT to handle sentence pair inputs together with its pre-trained multilingual model, to use both the src and mt in a cross-lingual encoder, that takes a multilingual sentence pair as input.

  • We show how pre-trained BERT models can also be used and fine-tuned as the decoder in a language generation task.

  • We make a thorough empirical evaluation of different ways of coupling BERT models in an APE system, comparing different options of parameter sharing, initialization, and fine-tuning.

2 Automatic Post-Editing with BERT

2.1 Automatic Post-Editing

APE (Simard et al., 2007) is inspired by human post-editing, in which a translator corrects mistakes made by an MT system. APE systems are trained from triplets (src, mt, pe), containing respectively the source sentence, the machine-translated sentence, and its post-edited version.

Artificial triplets.

Since there is little data available (e.g WMT 2018 APE shared task has 23K triplets), most research has focused on creating artificial triplets to achieve the scale that is needed for powerful sequence-to-sequence models to outperform the MT baseline, either from “round-trip” translations (Junczys-Dowmunt and Grundkiewicz, 2016) or starting from parallel data, as in the eSCAPE corpus of negri2018escape, which contains 8M synthetic triplets.

Dual-Source Transformer.

The current state of the art in APE uses a Transformer (Vaswani et al., 2017) with two encoders, for the src and mt, and one decoder, for pe (Junczys-Dowmunt and Grundkiewicz, 2018; Tebbifakhr et al., 2018). When concatenating human post-edited data and artificial triplets, these systems greatly improve the MT baseline. However, little successes are known using the shared task training data only.

By contrast, with transfer learning, our work outperforms this baseline considerably, even without any auxiliary synthetic dataset; and, as shown in §3, it achieves state-of-the-art results by combining it with the aforementioned artificial datasets.

Figure 1: Dual-Source BERT. Dashed lines show shared parameters in our best configuration.

2.2 BERT as a Cross-Lingual Encoder

Our transfer learning approach is based on the Bidirectional Encoder Representations from Transformers (BERT; Devlin et al., 2019). This model obtains deep bidirectional representations by training a Transformer (Vaswani et al., 2017) with a large-scale dataset in a masked language modeling task where the objective is to predict missing words in a sentence. We use the BERTBASE model, which is composed of self-attention layers, hidden size , attention heads, and feed-forward inner layer size . In addition to the word and learned position embeddings, BERT also has segment embeddings to differentiate between a segment A and a segment B—this is useful for tasks such as natural language inference, which involve two sentences. In the case of APE, there is also a pair of input sentences (src, mt) which are in different languages. Since one of the released BERT models was jointly pre-trained on 104 languages,222 https://github.com/google-research/bert/blob/master/multilingual.md we use this multilingual BERT pre-trained model to encode the bilingual input pair of APE.

Therefore, the whole encoder of our APE model is the multilingual BERT: we encode both src and mt in the same encoder and use the segment embeddings to differentiate between languages (Figure 1). We reset positional embeddings when the mt starts, since it is not a continuation of src.

2.3 BERT as a Decoder

Prior work has incorporated pre-trained models in encoders, but not as decoders of sequence-to-sequence models. Doing so requires a strategy for generating fluently from the pre-trained model. Note that the bidirectionality of BERT is lost, since the model cannot look at words that have not been generated yet, and it is an open question how to learn decoder-specific blocks (e.g. context attention), which are absent in the pre-trained model.

One of our key contributions is to use BERT in the decoder by experimenting different strategies for initializing and sharing the self and context attention layers and the positionwise feed-forward layers. We tie together the encoder and decoder embeddings weights (word, position, and segment) along with the decoder output layer (transpose of the word embedding layer). We use the same segment embedding for the target sentence (pe) and the second sentence in the encoder (mt) since they are in the same language. The full architecture is shown in Figure 1. We experiment with the following strategies for coupling BERT pre-trained models in the decoder:

  • Transformer. A Transformer decoder as described in vaswani2017attention without any shared parameters, with the BERTBASE dimensions and randomly initialized weights.

  • Pre-trained BERT. This initializes the decoder with the pre-trained BERT model. The only component initialized randomly is the context attention (CA) layer, which is absent in BERT. Unlike in the original BERT model—which only encodes sentences—a mask in the self-attention is required to prevent the model from looking to subsequent tokens in the target sentence.

  • BERT initialized context attention. Instead of a random initialization, we initialize the context attention layers with the weights of the corresponding BERT self-attention layers.

  • Shared self-attention. Instead of just having the same initialization, the self-attentions (SA) in the encoder and decoder are tied during training.

  • Context attention shared with self-attention. We take a step further and tie the context attention and self attention weights—making all the attention transformation matrices (self and context) in the encoder and decoder tied.

  • Shared feed-forward. We tie the feed-forward weights (FF) between the encoder and decoder.

3 Experiments

We now describe our experimental results. Our models were implemented on a fork of OpenNMT-py (Klein et al., 2017)

using a Pytorch 

(Paszke et al., 2017) re-implementation of BERT.333https://github.com/huggingface/pytorch-pretrained-BERT Our model’s implementation is publicly available.444https://github.com/deep-spin/OpenNMT-APE

Datasets.

We use the data from the WMT 2018 APE shared task (Chatterjee et al., 2018) (English-German SMT), which consists of 23,000 triplets for training, 1,000 for validation, and 2,000 for testing. In some of our experiments, we also use the eSCAPE corpus (Negri et al., 2018), which comprises about 8M sentences; when doing so, we oversample 35x the shared task data to cover of the final training data. We segment words with WordPiece (Wu et al., 2016), with the same vocabulary used in the Multilingual BERT. At training time, we discard triplets with 200+ tokens in the combination of src and mt or 100+ tokens in pe. For evaluation, we use TER Snover et al. (2006) and tokenized BLEU Papineni et al. (2002).

TER BLEU
Transformer decoder 20.33 69.31
Pre-trained BERT 20.83 69.11
  with CA SA 18.91 71.81
  and SA Encoder SA 18.44 72.25
  and CA SA 18.75 71.83
  and FF Encoder FF 19.04 71.53
Table 1: Ablation study of decoder configurations, by gradually having more shared parameters between the encoder and decoder (trained without synthetic data). denotes parameter tying and an initialization.
test 2016 test 2017 test 2018
Model Train Size TER BLEU TER BLEU TER BLEU
MT baseline (Uncorrected) 24.76 62.11 24.48 62.49 24.24 62.99
berard2017lig 23K 22.89 23.08 65.57
junczys2018ms 5M 18.92 70.86 19.49 69.72
junczys2018ms 18.86 71.04 19.03 70.46
tebbifakhr2018multi 8M 18.62 71.04
junczys2018ms 17.81 72.79 18.10 71.72
junczys2018ms 17.34 73.43 17.47 72.84 18.00 72.52
Dual-Source Transformer 23K 27.80 60.76 27.73 59.78 28.00 59.98
BERT Enc. + Transformer Dec. (Ours) 20.23 68.98 21.02 67.47 20.93 67.60
BERT Enc. + BERT Dec. (Ours) 18.88 71.61 19.03 70.66 19.34 70.41
BERT Enc. + BERT Dec. (Ours) 18.05 72.39 18.07 71.90 18.91 70.94
BERT Enc. + BERT Dec. (Ours) 8M 16.91 74.29 17.26 73.42 17.71 72.74
BERT Enc. + BERT Dec. (Ours) 16.49 74.98 16.83 73.94 17.15 73.60
Table 2: Results on the WMT 2016–18 APE shared task datasets. Our single models trained on the 23K dataset took only 3h20m to converge on a single Nvidia GeForce GTX 1080 GPU, while results for models trained on 8M triplets take approximately 2 days on the same GPU. Models marked with “” are ensembles of 4 models. Dual-Source Transformer is a comparable re-implementation of junczys2018ms.

Training Details.

We use Adam (Kingma and Ba, 2014) with a triangular learning rate schedule that increases linearly during the first 5,000 steps until and has a linear decay afterwards. When using BERT components, we use a weight decay of . We apply dropout Srivastava et al. (2014) with to all layers and use label smoothing with  (Pereyra et al., 2017). For the small data experiments, we use a batch size of 1024 tokens and save checkpoints every 1,000 steps; when using the eSCAPE corpus, we increase this to 2048 tokens and 10,000 steps. The checkpoints are created with the exponential moving average strategy of junczys2018marian with a decay of . At test time, we select the model with best TER on the development set, and apply beam search with a beam size of 8 and average length penalty.

Initialization and Parameter Sharing.

Table 1 compares the different decoder strategies described in §2.3 on the WMT 2018 validation set. The best results were achieved by sharing the self-attention between encoder and decoder, and by initializing (but not sharing) the context attention with the same weights as the self-attention. Regarding the self-attention sharing, we hypothesize that its benefits are due to both encoder and decoder sharing a common language in their input (in the mt and pe sentence, respectively). Future work will investigate if this is still beneficial when the source and target languages are less similar. On the other hand, the initialization of the context attention with BERT’s self-attention weights is essential to reap the benefits of BERT representations in the decoder—without it, using BERT decreases performance when compared to a regular transformer decoder. This might be due to the fact that context attention and self-attention share the same neural block architecture (multi-head attention) and thus the context attention benefits from the pre-trained BERT’s better weight initialization. No benefit was observed from sharing the feed-forward weights.

Final Results.

Finally, Table 2 shows our results on the WMT 2016–18 test sets. The model named BERT Enc. + BERT Dec. corresponds to the best setting found in Table 1, while BERT Enc. + Transformer Dec. only uses BERT in the encoder. We show results for single models and ensembles of 4 independent models.

Using the small shared task dataset only (23K triplets), our single BERT Enc. + BERT Dec. model surpasses the MT baseline by a large margin (4.90 TER in test 2018). The only system we are aware to beat the MT baseline with only the shared task data is Bérard et al. (2017), which we also outperform (4.05 TER in test 2017). With only about 3 GPU-hours and on a much smaller dataset, our model reaches a performance that is comparable to an ensemble of the best WMT 2018 system with an artificial dataset of 5M triplets (0.02 TER in test 2016), which is much more expensive to train. With 4 ensembling, we get competitive results with systems trained on 8M triplets.

When adding the eSCAPE corpus (8M triplets), performance surpasses the state of the art in all test sets. By ensembling, we improve even further, achieving a final 17.15 TER score in test 2018 (0.85 TER than the previous state of the art).

4 Related Work

In their Dual-Source Transformer model, junczys2018ms also found gains by tying together encoder parameters, and the embeddings of both encoders and decoder. Our work confirms this but shows further gains by using segment embeddings and more careful sharing and initialization strategies. sachan2018parameter explore parameter sharing between Transformer layers. However, they focus on sharing decoder parameters in a one-to-many multilingual MT system. In our work, we share parameters between the encoder and the decoder.

As stated in §3, Bérard et al. (2017) also showed improved results over the MT baseline, using exclusively the shared task data. Their system outputs edit operations that decide whether to insert, keep or delete tokens from the machine translated sentence. Instead of relying on edit operations, our approach mitigates the small amount of data with transfer learning through BERT.

Our work makes use of the recent advances in transfer learning for NLP (Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019). Pre-training these large language models has largely improved the state of the art of the GLUE benchmark (Wang et al., 2018). Particularly, our work uses the BERT pre-trained model and makes use of the representations obtained not only in the encoder but also on the decoder in a language generation task.

More closely related to our work, lample2019cross pre-trained a BERT-like language model using parallel data, which they used to initialize the encoder and decoder for supervised and unsupervised MT systems. They also used segment embeddings (along with word and position embeddings) to differentiate between a pair of sentences in different languages. However, this is only used in one of the pre-training phases of the language model (translation language modelling) and not in the downstream task. In our work, we use segment embeddings during the downstream task itself, which is a perfect fit to the APE task.

lopes2019unbabels used our model on the harder English-German NMT subtask to obtain better TER performance than previous state of the art. To obtain this result, the transfer learning capabilities of BERT were not enough and further engineering effort was required. Particularly, a conservativeness factor was added during beam decoding to constrain the changes the APE system can make to the mt output. Furthermore, the authors used a data weighting method to augment the importance of data samples that have lower TER. By doing this, data samples that required less post-editing effort are assigned higher weights during the training loop. Since the NMT system does very few errors on this domain this data weighting is important for the APE model to learn to do fewer corrections to the mt output. However, their approach required the creation of an artificial dataset to obtain a performance that improved the MT baseline. We leave it for future work to investigate better methods to obtain results that improve the baseline using only real post-edited data in these smaller APE datasets.

5 Conclusion and Future Work

We proposed a transfer learning approach to APE using BERT pre-trained models and careful parameter sharing. We explored various ways for coupling BERT in the decoder for language generation. We found it beneficial to initialize the context attention of the decoder with BERT’s self-attention and to tie together the parameters of the self-attention layers between the encoder and decoder. Using a small dataset, our results are competitive with systems trained on a large amount of artificial data, with much faster training. By adding artificial data, we obtain a new state of the art in APE.

In future work, we would like to do an extensive analysis on the capabilities of BERT and transfer learning in general for different domains and language pairs in APE.

Acknowledgments

This work was supported by the European Research Council (ERC StG DeepSPIN 758969), and by the Fundação para a Ciência e Tecnologia through contracts UID/EEA/50008/2019 and CMUPERI/TIC/0046/2014 (GoLocal). We thank the anonymous reviewers for their feedback.

References