Machine Translation Evaluation with BERT Regressor

07/29/2019 ∙ by Hiroki Shimanaka, et al. ∙ Osaka University 0

We introduce the metric using BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2019) for automatic machine translation evaluation. The experimental results of the WMT-2017 Metrics Shared Task dataset show that our metric achieves state-of-the-art performance in segment-level metrics task for all to-English language pairs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This study describes a segment-level metric for automatic machine translation evaluation (MTE). The MTE metrics with a high correlation with human evaluation enable the continuous integration and deployment of a machine translation (MT) system.

In our previous study Shimanaka et al. (2018), we proposed RUSE111https://github.com/Shi-ma/RUSE

(Regressor Using Sentence Embeddings) that is a segment-level MTE metric using pre-trained sentence embeddings capable of capturing global information that cannot be captured by local features based on character or word N-grams. In WMT-2018 Metrics Shared Task 

Ma et al. (2018), RUSE was the best metric on segment-level for all to-English language pairs. This result indicates that pre-trained sentence embeddings are effective feature for automatic evaluation of machine translation.

Research related to applying pre-trained language representations to downstream tasks has been rapidly developing in recent years. In particular, BERT (Bidirectional Encoder Representations from Transformers) Devlin et al. (2019)

has achieved the best performance in many downstream tasks and is attracting attention. BERT is designed to pre-train using “masked language model” (MLM) and “next sentence prediction” (NSP) on large amounts of raw text and fine-tune for a supervised downstream task. For example, in the case of solving single sentence classification tasks such as sentiment analysis and in the case of solving sentence-pair classification tasks such as natural language inference task, fine-tuning is performed in different ways. As a result, BERT also performs well in the task of estimating the similarity between sentence pairs, which is considered to be a similar task of automatic machine translation evaluation.

Therefore, we propose the MTE metric that using BERT. The experimental results in segment-level metrics task conducted using the datasets for all to-English language pairs on WMT17 indicated that the proposed metric shows higher correlation with human evaluations than RUSE, and achieves the best performance. As a result of detailed analysis, it is clarified that the three main points of difference with RUSE, the pre-training method, the sentence-pair encoding, and the fine-tuning of the pre-trained encoder, contribute to the performance improvement of BERT.

(a) MTE with RUSE.
(b) MTE with BERT.
Figure 1: Outline of each metric. Blue is training but red is fixed.

2 Related Work

In this section, we describe the MTE metric that achieves the best performance in WMT-2017 Bojar et al. (2017) and -2018 Ma et al. (2018) Metrics Shared Task. In this task, we use direct assessment (DA) datasets of human evaluation data. DA datasets provides the absolute quality scores of hypotheses by measuring to what extent a hypothesis adequately expresses the meaning of the reference translation. Each metric estimates the quality score with the translation and reference sentence pair as input, and is evaluated by Pearson correlation with human evaluation. In this paper, we discuss the metrics task in segment-level for to-English language pairs.

2.1 Blend: the metric based on local features

Blend which achieved the best performance in WMT-2017 is an ensemble metric that incorporates 25 lexical metrics provided by the Asiya MT evaluation toolkit, as well as four other metrics. Blend is a metric that uses many features, but relies only on local information that can not simultaneously consider the whole sentence simultaneously, such as character-based editing distances and features based on word N-grams.

2.2 RUSE: the metric based on sentence embeddings

RUSE Shimanaka et al. (2018)

which achieved the best performance in WMT-2018 is a metric using sentence embeddings pre-trained on large amounts of text. Unlike previous metrics such as Blend, RUSE has the advantage of simultaneously considering the information of the whole sentence as a distributed representation.

ReVal222https://github.com/rohitguptacs/ReVal Gupta et al. (2015) is also a metric using sentence embeddings. ReVal trains sentence embeddings from labeled data in WMT Metrics Shared Task and semantic similarity estimation tasks, but can not achieve sufficient performance because it uses only small data. RUSE trains only regression models from labeled data using sentence embeddings pre-trained on large data such as Quick Thought Logeswaran and Lee (2018).

As shown in Figure 1(a), RUSE encodes an MT hypothesis and an reference translation by a sentence encoder, respectively. Then, following InferSent Conneau et al. (2017)

, a features are extracted by combining sentence embeddings of the two sentences, and the evaluation score is estimated by the regression model based on multi-layer perceptron (MLP).

Figure 2: BERT sentence-pair encoding.

3 BERT for MTE

In this study, we use BERT Devlin et al. (2019) for MTE. Like RUSE, BERT for MTE uses pre-trained sentence embeddings and estimates the evaluation score using the regression model based on MLP. However, as shown in the figure 1(b), in BERT for MTE, both an MT hypothesis and an reference translation are encoded simultaneously by the sentence-pair encoder. Then, the sentence-pair embedding is input to the regression model based on MLP. Unlike RUSE, the pre-trained encoder is also fine-tuning with MLP. In the following, we explain the three differences between RUSE and BERT in detail which are the pre-training method, the sentence-pair encoding, and the fine-tuning of the pre-trained encoder.

3.1 Pre-training Method

BERT is designed to pre-train using two types of unsupervised task simultaneously on large amounts of raw text.

Masked Language Model (MLM)

After replacing some tokens in the raw corpus with [MASK] tokens, we estimate the original tokens by a bidirectional language model. By this unsupervised pre-training, BERT encoder learns the relation between tokens in the sentence.

Next Sentence Prediction (NSP)

Some sentences in the raw corpus are randomly replaced with other sentences, and then binary classification is performed to determine whether two consecutive sentences are adjacent or not. By this unsupervised pre-training, BERT encoder learns the relationship between two consecutive sentences.

3.2 Sentence-pair Encoding

In BERT, instead of encoding each sentence independently, it encodes a sentence-pairs simultaneously for task of dealing with sentence pairs such as NSP and Natural Language Inference. The first token of every sequence is always a special classification token ([CLS]) and each sentence is separated with a special end-of-sentence token ([SEP]) (Figure 2). Finally, the final hidden state corresponding to a special [CLS] token is used as the aggregate sequence representation for classification tasks.

3.3 Fine-tuning of the Pre-trained Encoder

In BERT, after obtaining a sentence embedding or a sentence-pair embedding using an encoder, it is used as an input of MLP to solve applied tasks such as classification and regression. When training an MLP with labeled data of the applied task, we also fine-tune the pre-trained encoder.

cs-en de-en fi-en lv-en ro-en ru-en tr-en zh-en
WMT-2015 500 500 500 - - 500 - -
WMT-2016 560 560 560 - 560 560 560 -
WMT-2017 560 560 560 560 - 560 560 560
Table 1: Number of segment-level DA human evaluation datasets for to-English language pairs in WMT-2015 Stanojević et al. (2015), WMT-2016 Bojar et al. (2016), and WMT-2017 Metrics Shared Task Bojar et al. (2017).
cs-en de-en fi-en lv-en ru-en tr-en zh-en avg.
SentBLEU Bojar et al. (2017) 0.435 0.432 0.571 0.393 0.484 0.538 0.512 0.481
Blend Bojar et al. (2017) 0.594 0.571 0.733 0.577 0.622 0.671 0.661 0.633
RUSE Shimanaka et al. (2018) 0.614 0.637 0.756 0.705 0.680 0.704 0.677 0.682
BERT 0.720 0.761 0.857 0.828 0.788 0.798 0.763 0.788
Table 2: Segment-level Pearson correlation of metric scores and DA human evaluation scores for to-English language pairs in WMT-2017 Metrics Shared Task.

4 Experiments

We performed experiments using the WMT-2017 Metrics Shared Task dataset to verify the performance of BERT for MTE.

4.1 Settings

Table 1 shows the number of instances in WMT Metrics Shared Task dataset (segment-level) for to-English language pairs333en: English, cs: Czech, de: German, fi: Finnish, ro: Romanian, ru: Russian, tr: Turkish, lv: Latvian, zh: Chinese used in this study. A total of 5,360 instances in WMT-2015 and WMT-2016 Metrics Shared Task datasets will be divided randomly, and 90% is used for training and 10% for development. A total of 3,920 instances (560 instances for each language pair) in WMT-2017 Metrics Shared Task dataset is used for evaluation.

As a comparison method, we use SentBLEU444https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl which is the baseline of WMT Metrics Shared Task, Blend Ma et al. (2017) which achieved the best performance in WMT-2017 Metrics Shared Task, and RUSE Shimanaka et al. (2018) which achieved the best performance in WMT-2018 Metrics Shared Task. We evaluated each metric using the Pearson correlation coefficient between the metric scores and the DA human scores.

Among the trained models published by the authors, BERT (uncased)555https://github.com/google-research/bert is used for MTE with BERT. The Hyper-parameters for fine-tuning BERT are determined through grid search in the following parameters using the development data.

cs-en de-en fi-en lv-en ru-en tr-en zh-en avg.
RUSE with GloVe-BoW 0.475 0.479 0.645 0.532 0.537 0.547 0.480 0.527
RUSE with Quick Thought 0.599 0.588 0.736 0.690 0.655 0.710 0.645 0.660
RUSE with BERT 0.622 0.626 0.765 0.708 0.609 0.706 0.647 0.669
BERT (w/o fine-tuning) 0.645 0.607 0.780 0.727 0.644 0.704 0.705 0.687
BERT 0.720 0.761 0.857 0.828 0.788 0.798 0.763 0.788
Table 3: Comparison of RUSE and BERT in WMT-2017 Metrics Shared Task (segment-level, to-English language pairs).

4.2 Results

Table 2 presents the experimental results of the WMT-2017 Metrics Shared Task dataset. BERT for MTE achieved the best per-formance in all to-English language pairs. In Section 5, we compare RUSE and BERT and do a detailed analysis.

5 Analysis: Comparison of RUSE and BERT

In order to analyze the three main points of difference between RUSE and BERT, the pre-training method, the sentence-pair encoding, and the fine-tuning of the pre-trained encoder, we conduct an experiment with the following settings.

RUSE with GloVe-BoW:

The mean vector of word embeddings of GloVe 

Pennington et al. (2014)(glove.840B.300d666https://nlp.stanford.edu/projects/glove) (300 dimension) in each sentence is used as the sentence embeddings in Figure 1(a).

RUSE with Quick Thought:

Quick Thought Logeswaran and Lee (2018) pre-trained on both 45 million sentences in the BookCorpus Zhu et al. (2015) and about 130 million sentences in UMBC WebBase coupus Han et al. (2013) is used as the sentence encoder in Figure 1(a).

RUSE with BERT:

A concatenation of the last four hidden layers (3,072 dimention) corresponding to the [CLS] token of BERT that takes a single sentence as input is used as the sentence embeddings in Figure 1(a).

BERT (w/o fine-tuning):

A concatenation of the last four hidden layers (3,072 dimension) corresponding to the [CLS] token of BERT that takes a sentence-pair as the input sequence is used as the input of the MLP Regressor in Figure 1(b). In this case, the part of the BERT encoder is not fine-tuned.

Bert:

The last hidden layer (768 dimension) corresponding to the [CLS] token of BERT that takes a sentence-pair as the input sequence is used as the input of the MLP Regressor in Figure 1(b). In this case, the part of the BERT encoder is fine-tuned.

The Hyper-parameters for RUSE and BERT (w/o fine-tuning) are determined through grid search in the following parameters using the development data.

Table 3 presents these experimental results of the WMT-2017 Metrics Shared Task dataset.

Pre-training Method

The top three rows of Table 3 show the performance impact of the method of pre-learning in the sentence encoder. First, Quick Thought based on sentence embeddings has better performance consistently than GloVe-BoW based on word embeddings. Second, BERT pret-rained by both MLM and NSP perform better on many language pairs than Quick Thought pre-trained only by NSP. In other words, the pre-training method using Masked Language Model (MLM), which is one of the major features of BERT, is also useful for MTE.

Sentence-pair Encoding

Comparing RUSE with BERT and BERT (w/o fine-tuning) shows the impact of the sentence-pair encoding on the performance of MTE. In the case of many language pairs, the latter, which simultaneously encodes an MT hypothesis and a reference translation, has higher performance than the former, which encodes them independently. Although RUSE performs feature extraction that combines sentence embeddings of two sentences in the same way as InferSent, this is not necessarily the method of feature extraction suitable for MTE. On the other hand, the sentence-pair encoding of BERT obtains sentence embeddings considering the relation of sentence-pair without explicitly extracting the feature. In BERT, there is a possibility that the relation of sentence-pair can be trained well at the time of pre-training by NSP.

Fine-tuning of the Pre-trained Encoder

The bottom two rows of Table 3 show the performance impact of the fine-tuning of the pre-trained encoder. In the case of all language pairs, BERT, which fine-tune the pre-trained encoder with MLP, performs much better than RUSE, which only trains MLP. In other words, the fine-tuning of the pre-trained encoder, which is one of the major features of BERT, is also useful for machine translation evaluation.

6 Conclusion

In this study, we proposed the metric for automatic machine translation evaluation with BERT. Our segment-level MTE metric with BERT achieved the best performance in segment-level metrics tasks on the WMT17 dataset for all to-English language pairs. In addition, as a result of analysis based on comparison with RUSE which is our previous work, it is shown that three points of the pre-training method, the sentence-pair encoding, and the fine-tuning of the pre-trained encoder contributed to the performance improvement of BERT respectively.

Acknowledgement

Part of this research was funded by JSPS Grant-in-Aid for Scientific Research (Grant-in-Aid for Research Activity start-up, task number: 18H06465).

References

  • Bojar et al. (2017) Ondřej Bojar, Yvette Graham, and Amir Kamran. 2017. Results of the WMT17 Metrics Shared Task. In Proceedings of the Second Conference on Machine Translation, pages 489–513.
  • Bojar et al. (2016) Ondřej Bojar, Yvette Graham, Amir Kamran, and Miloš Stanojević. 2016. Results of the WMT16 Metrics Shared Task. In Proceedings of the First Conference on Machine Translation, pages 199–231.
  • Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    , pages 670–680.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  • Gupta et al. (2015) Rohit Gupta, Constantin Orasan, and Josef van Genabith. 2015.

    ReVal: A Simple and Effective Machine Translation Evaluation Metric Based on Recurrent Neural Networks.

    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1066–1072.
  • Han et al. (2013) Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield, and Jonathan Weese. 2013. UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems. In Second Joint Conference on Lexical and Computational Semantics, Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pages 44–52.
  • Logeswaran and Lee (2018) Lajanugen Logeswaran and Honglak Lee. 2018. An Efficient Framework for Learning Sentence Representations. In International Conference on Learning Representations, pages 1–16.
  • Ma et al. (2018) Qingsong Ma, Ondřej Bojar, and Yvette Graham. 2018. Results of the WMT18 Metrics Shared Task: Both Characters and Embeddings Achieve Good Performance. In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 682–701.
  • Ma et al. (2017) Qingsong Ma, Yvette Graham, Shugen Wang, and Qun Liu. 2017. Blend: a Novel Combined MT Metric Based on Direct Assessment - CASICT-DCU submission to WMT17 Metrics Task. In Proceedings of the Second Conference on Machine Translation, pages 598–603.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1532–1543.
  • Shimanaka et al. (2018) Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru Komachi. 2018. RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation. In Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers, pages 764–771.
  • Stanojević et al. (2015) Miloš Stanojević, Philipp Koehn, and Ondřej Bojar. 2015. Results of the WMT15 Metrics Shared Task. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 256–273.
  • Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books.

    2015 IEEE International Conference on Computer Vision

    , pages 19–27.