This study describes a segment-level metric for automatic machine translation evaluation (MTE). The MTE metrics with a high correlation with human evaluation enable the continuous integration and deployment of a machine translation (MT) system.
(Regressor Using Sentence Embeddings) that is a segment-level MTE metric using pre-trained sentence embeddings capable of capturing global information that cannot be captured by local features based on character or word N-grams. In WMT-2018 Metrics Shared TaskMa et al. (2018), RUSE was the best metric on segment-level for all to-English language pairs. This result indicates that pre-trained sentence embeddings are effective feature for automatic evaluation of machine translation.
Research related to applying pre-trained language representations to downstream tasks has been rapidly developing in recent years. In particular, BERT (Bidirectional Encoder Representations from Transformers) Devlin et al. (2019)
has achieved the best performance in many downstream tasks and is attracting attention. BERT is designed to pre-train using “masked language model” (MLM) and “next sentence prediction” (NSP) on large amounts of raw text and fine-tune for a supervised downstream task. For example, in the case of solving single sentence classification tasks such as sentiment analysis and in the case of solving sentence-pair classification tasks such as natural language inference task, fine-tuning is performed in different ways. As a result, BERT also performs well in the task of estimating the similarity between sentence pairs, which is considered to be a similar task of automatic machine translation evaluation.
Therefore, we propose the MTE metric that using BERT. The experimental results in segment-level metrics task conducted using the datasets for all to-English language pairs on WMT17 indicated that the proposed metric shows higher correlation with human evaluations than RUSE, and achieves the best performance. As a result of detailed analysis, it is clarified that the three main points of difference with RUSE, the pre-training method, the sentence-pair encoding, and the fine-tuning of the pre-trained encoder, contribute to the performance improvement of BERT.
2 Related Work
In this section, we describe the MTE metric that achieves the best performance in WMT-2017 Bojar et al. (2017) and -2018 Ma et al. (2018) Metrics Shared Task. In this task, we use direct assessment (DA) datasets of human evaluation data. DA datasets provides the absolute quality scores of hypotheses by measuring to what extent a hypothesis adequately expresses the meaning of the reference translation. Each metric estimates the quality score with the translation and reference sentence pair as input, and is evaluated by Pearson correlation with human evaluation. In this paper, we discuss the metrics task in segment-level for to-English language pairs.
2.1 Blend: the metric based on local features
Blend which achieved the best performance in WMT-2017 is an ensemble metric that incorporates 25 lexical metrics provided by the Asiya MT evaluation toolkit, as well as four other metrics. Blend is a metric that uses many features, but relies only on local information that can not simultaneously consider the whole sentence simultaneously, such as character-based editing distances and features based on word N-grams.
2.2 RUSE: the metric based on sentence embeddings
RUSE Shimanaka et al. (2018)
which achieved the best performance in WMT-2018 is a metric using sentence embeddings pre-trained on large amounts of text. Unlike previous metrics such as Blend, RUSE has the advantage of simultaneously considering the information of the whole sentence as a distributed representation.
ReVal222https://github.com/rohitguptacs/ReVal Gupta et al. (2015) is also a metric using sentence embeddings. ReVal trains sentence embeddings from labeled data in WMT Metrics Shared Task and semantic similarity estimation tasks, but can not achieve sufficient performance because it uses only small data. RUSE trains only regression models from labeled data using sentence embeddings pre-trained on large data such as Quick Thought Logeswaran and Lee (2018).
, a features are extracted by combining sentence embeddings of the two sentences, and the evaluation score is estimated by the regression model based on multi-layer perceptron (MLP).
3 BERT for MTE
In this study, we use BERT Devlin et al. (2019) for MTE. Like RUSE, BERT for MTE uses pre-trained sentence embeddings and estimates the evaluation score using the regression model based on MLP. However, as shown in the figure 1(b), in BERT for MTE, both an MT hypothesis and an reference translation are encoded simultaneously by the sentence-pair encoder. Then, the sentence-pair embedding is input to the regression model based on MLP. Unlike RUSE, the pre-trained encoder is also fine-tuning with MLP. In the following, we explain the three differences between RUSE and BERT in detail which are the pre-training method, the sentence-pair encoding, and the fine-tuning of the pre-trained encoder.
3.1 Pre-training Method
BERT is designed to pre-train using two types of unsupervised task simultaneously on large amounts of raw text.
Masked Language Model (MLM)
After replacing some tokens in the raw corpus with [MASK] tokens, we estimate the original tokens by a bidirectional language model. By this unsupervised pre-training, BERT encoder learns the relation between tokens in the sentence.
Next Sentence Prediction (NSP)
Some sentences in the raw corpus are randomly replaced with other sentences, and then binary classification is performed to determine whether two consecutive sentences are adjacent or not. By this unsupervised pre-training, BERT encoder learns the relationship between two consecutive sentences.
3.2 Sentence-pair Encoding
In BERT, instead of encoding each sentence independently, it encodes a sentence-pairs simultaneously for task of dealing with sentence pairs such as NSP and Natural Language Inference. The first token of every sequence is always a special classification token ([CLS]) and each sentence is separated with a special end-of-sentence token ([SEP]) (Figure 2). Finally, the final hidden state corresponding to a special [CLS] token is used as the aggregate sequence representation for classification tasks.
3.3 Fine-tuning of the Pre-trained Encoder
In BERT, after obtaining a sentence embedding or a sentence-pair embedding using an encoder, it is used as an input of MLP to solve applied tasks such as classification and regression. When training an MLP with labeled data of the applied task, we also fine-tune the pre-trained encoder.
|SentBLEU Bojar et al. (2017)||0.435||0.432||0.571||0.393||0.484||0.538||0.512||0.481|
|Blend Bojar et al. (2017)||0.594||0.571||0.733||0.577||0.622||0.671||0.661||0.633|
|RUSE Shimanaka et al. (2018)||0.614||0.637||0.756||0.705||0.680||0.704||0.677||0.682|
We performed experiments using the WMT-2017 Metrics Shared Task dataset to verify the performance of BERT for MTE.
Table 1 shows the number of instances in WMT Metrics Shared Task dataset (segment-level) for to-English language pairs333en: English, cs: Czech, de: German, fi: Finnish, ro: Romanian, ru: Russian, tr: Turkish, lv: Latvian, zh: Chinese used in this study. A total of 5,360 instances in WMT-2015 and WMT-2016 Metrics Shared Task datasets will be divided randomly, and 90% is used for training and 10% for development. A total of 3,920 instances (560 instances for each language pair) in WMT-2017 Metrics Shared Task dataset is used for evaluation.
As a comparison method, we use SentBLEU444https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl which is the baseline of WMT Metrics Shared Task, Blend Ma et al. (2017) which achieved the best performance in WMT-2017 Metrics Shared Task, and RUSE Shimanaka et al. (2018) which achieved the best performance in WMT-2018 Metrics Shared Task. We evaluated each metric using the Pearson correlation coefficient between the metric scores and the DA human scores.
Among the trained models published by the authors, BERT (uncased)555https://github.com/google-research/bert is used for MTE with BERT. The Hyper-parameters for fine-tuning BERT are determined through grid search in the following parameters using the development data.
|RUSE with GloVe-BoW||0.475||0.479||0.645||0.532||0.537||0.547||0.480||0.527|
|RUSE with Quick Thought||0.599||0.588||0.736||0.690||0.655||0.710||0.645||0.660|
|RUSE with BERT||0.622||0.626||0.765||0.708||0.609||0.706||0.647||0.669|
|BERT (w/o fine-tuning)||0.645||0.607||0.780||0.727||0.644||0.704||0.705||0.687|
5 Analysis: Comparison of RUSE and BERT
In order to analyze the three main points of difference between RUSE and BERT, the pre-training method, the sentence-pair encoding, and the fine-tuning of the pre-trained encoder, we conduct an experiment with the following settings.
RUSE with GloVe-BoW:
RUSE with Quick Thought:
RUSE with BERT:
A concatenation of the last four hidden layers (3,072 dimention) corresponding to the [CLS] token of BERT that takes a single sentence as input is used as the sentence embeddings in Figure 1(a).
BERT (w/o fine-tuning):
A concatenation of the last four hidden layers (3,072 dimension) corresponding to the [CLS] token of BERT that takes a sentence-pair as the input sequence is used as the input of the MLP Regressor in Figure 1(b). In this case, the part of the BERT encoder is not fine-tuned.
The last hidden layer (768 dimension) corresponding to the [CLS] token of BERT that takes a sentence-pair as the input sequence is used as the input of the MLP Regressor in Figure 1(b). In this case, the part of the BERT encoder is fine-tuned.
The Hyper-parameters for RUSE and BERT (w/o fine-tuning) are determined through grid search in the following parameters using the development data.
Table 3 presents these experimental results of the WMT-2017 Metrics Shared Task dataset.
The top three rows of Table 3 show the performance impact of the method of pre-learning in the sentence encoder. First, Quick Thought based on sentence embeddings has better performance consistently than GloVe-BoW based on word embeddings. Second, BERT pret-rained by both MLM and NSP perform better on many language pairs than Quick Thought pre-trained only by NSP. In other words, the pre-training method using Masked Language Model (MLM), which is one of the major features of BERT, is also useful for MTE.
Comparing RUSE with BERT and BERT (w/o fine-tuning) shows the impact of the sentence-pair encoding on the performance of MTE. In the case of many language pairs, the latter, which simultaneously encodes an MT hypothesis and a reference translation, has higher performance than the former, which encodes them independently. Although RUSE performs feature extraction that combines sentence embeddings of two sentences in the same way as InferSent, this is not necessarily the method of feature extraction suitable for MTE. On the other hand, the sentence-pair encoding of BERT obtains sentence embeddings considering the relation of sentence-pair without explicitly extracting the feature. In BERT, there is a possibility that the relation of sentence-pair can be trained well at the time of pre-training by NSP.
Fine-tuning of the Pre-trained Encoder
The bottom two rows of Table 3 show the performance impact of the fine-tuning of the pre-trained encoder. In the case of all language pairs, BERT, which fine-tune the pre-trained encoder with MLP, performs much better than RUSE, which only trains MLP. In other words, the fine-tuning of the pre-trained encoder, which is one of the major features of BERT, is also useful for machine translation evaluation.
In this study, we proposed the metric for automatic machine translation evaluation with BERT. Our segment-level MTE metric with BERT achieved the best performance in segment-level metrics tasks on the WMT17 dataset for all to-English language pairs. In addition, as a result of analysis based on comparison with RUSE which is our previous work, it is shown that three points of the pre-training method, the sentence-pair encoding, and the fine-tuning of the pre-trained encoder contributed to the performance improvement of BERT respectively.
Part of this research was funded by JSPS Grant-in-Aid for Scientific Research (Grant-in-Aid for Research Activity start-up, task number: 18H06465).
- Bojar et al. (2017) Ondřej Bojar, Yvette Graham, and Amir Kamran. 2017. Results of the WMT17 Metrics Shared Task. In Proceedings of the Second Conference on Machine Translation, pages 489–513.
- Bojar et al. (2016) Ondřej Bojar, Yvette Graham, Amir Kamran, and Miloš Stanojević. 2016. Results of the WMT16 Metrics Shared Task. In Proceedings of the First Conference on Machine Translation, pages 199–231.
Conneau et al. (2017)
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine
Supervised Learning of Universal Sentence Representations from
Natural Language Inference Data.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Gupta et al. (2015) Rohit Gupta, Constantin Orasan, and Josef van Genabith. 2015. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1066–1072.
- Han et al. (2013) Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield, and Jonathan Weese. 2013. UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems. In Second Joint Conference on Lexical and Computational Semantics, Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pages 44–52.
- Logeswaran and Lee (2018) Lajanugen Logeswaran and Honglak Lee. 2018. An Efficient Framework for Learning Sentence Representations. In International Conference on Learning Representations, pages 1–16.
- Ma et al. (2018) Qingsong Ma, Ondřej Bojar, and Yvette Graham. 2018. Results of the WMT18 Metrics Shared Task: Both Characters and Embeddings Achieve Good Performance. In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 682–701.
- Ma et al. (2017) Qingsong Ma, Yvette Graham, Shugen Wang, and Qun Liu. 2017. Blend: a Novel Combined MT Metric Based on Direct Assessment - CASICT-DCU submission to WMT17 Metrics Task. In Proceedings of the Second Conference on Machine Translation, pages 598–603.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532–1543.
- Shimanaka et al. (2018) Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru Komachi. 2018. RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation. In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 764–771.
- Stanojević et al. (2015) Miloš Stanojević, Philipp Koehn, and Ondřej Bojar. 2015. Results of the WMT15 Metrics Shared Task. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 256–273.
Zhu et al. (2015)
Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015.
Aligning Books and Movies: Towards Story-Like Visual Explanations by
Watching Movies and Reading Books.
2015 IEEE International Conference on Computer Vision, pages 19–27.