Metric for Automatic Machine Translation Evaluation based on Universal Sentence Representations

05/18/2018 ∙ by Hiroki Shimanaka, et al. ∙ Osaka University 0

Sentence representations can capture a wide range of information that cannot be captured by local features based on character or word N-grams. This paper examines the usefulness of universal sentence representations for evaluating the quality of machine translation. Although it is difficult to train sentence representations using small-scale translation datasets with manual evaluation, sentence representations trained from large-scale data in other tasks can improve the automatic evaluation of machine translation. Experimental results of the WMT-2016 dataset show that the proposed method achieves state-of-the-art performance with sentence representation features only.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper describes a segment-level metric for automatic machine translation evaluation (MTE). MTE metrics having a high correlation with human evaluation enable the continuous integration and deployment of a machine translation (MT) system. Various MTE metrics have been proposed in the metrics task of the Workshops on Statistical Machine Translation (WMT) that was started in 2008. However, most MTE metrics are obtained by computing the similarity between an MT hypothesis and a reference translation based on character N-grams or word N-grams, such as SentBLEU Lin and Och (2004), which is a smoothed version of BLEU Papineni et al. (2002), Blend Ma et al. (2017), MEANT 2.0 Lo (2017), and chrF++ Popović (2017), which achieved excellent results in the WMT-2017 Metrics task Bojar et al. (2017). Therefore, they can exploit only limited information for segment-level MTE. In other words, MTE metrics based on character N-grams or word N-grams cannot make full use of sentence representations; they only check for word matches.

We propose a segment-level MTE metric by using universal sentence representations capable of capturing information that cannot be captured by local features based on character or word N-grams. The results of an experiment in segment-level MTE conducted using the datasets for to-English language pairs on WMT-2016 indicated that the proposed regression model using sentence representations achieves the best performance.

The main contributions of the study are summarized below:

  • We propose a novel supervised regression model for segment-level MTE based on universal sentence representations.

  • We achieved state-of-the-art performance on the WMT-2016 dataset for to-English language pairs without using any complex features and models.

Figure 1: Outline of Skip-Thought.
Figure 2: Outline of InferSent.
Figure 3: Outline of our metric.

2 Related Work

DPMF Yu et al. (2015a) achieved the best performance in the WMT-2016 Metrics task Bojar et al. (2016). It incorporates 55 default metrics provided by the Asiya MT evaluation toolkit111http://asiya.lsi.upc.edu/ Giménez and Màrquez (2010), as well as three other metrics, namely, DPMF Yu et al. (2015b), REDp Yu et al. (2015a), and ENTFp Yu et al. (2015a), using ranking SVM to train parameters of each metric score. DPMF evaluates the syntactic similarity between an MT hypothesis and a reference translation. REDp evaluates an MT hypothesis based on the dependency tree of the reference translation that comprises both lexical and syntactic information. ENTFp Yu et al. (2015a) evaluates the fluency of an MT hypothesis.

After the success of DPMF, Blend222http://github.com/qingsongma/blend Ma et al. (2017) achieved the best performance in the WMT-2017 Metrics task Bojar et al. (2017). Similar to DPMF, Blend is essentially an SVR (RBF kernel) model that uses the scores of various metrics as features. It incorporates 25 lexical metrics provided by the Asiya MT evaluation toolkit, as well as four other metrics, namely, BEER Stanojević and Sima’an (2015), CharacTER Wang et al. (2016), DPMF and ENTFp. BEER Stanojević and Sima’an (2015) is a linear model based on character N-grams and replacement trees. CharacTER Wang et al. (2016) evaluates an MT hypothesis based on character-level edit distance.

DPMF is trained through relative ranking of human evaluation data in terms of relative ranking (RR). The quality of five MT hypotheses of the same source segment are ranked from 1 to 5 via comparison with the reference translation. In contrast, Blend is trained through direct assessment (DA) of human evaluation data. DA provides the absolute quality scores of hypotheses, by measuring to what extent a hypothesis adequately expresses the meaning of the reference translation. The results of the experiments in segment-level MTE conducted using the datasets for to-English language pairs on WMT-2016 showed that Blend achieved a better performance than DPMF (Table 2). In this study, as with Blend, we propose a supervised regression model trained using DA human evaluation data.

Instead of using local and lexical features, ReVal333https://github.com/rohitguptacs/ReVal Gupta et al. (2015a, b) proposes using sentence-level features. It is a metric using Tree-LSTM (Tai et al., 2015) for training and capturing the holistic information of sentences. It is trained using datasets of pseudo similarity scores, which is generated by translating RR data, and out-domain datasets of similarity scores of SICK444http://clic.cimec.unitn.it/composes/sick.html. However, the training dataset used in this metric consists of approximately 21,000 sentences; thus, the learning of Tree-LSTM is unstable and accurate learning is difficult (Table 2). The proposed metric uses sentence representations trained using LSTM as sentence information. Further, we apply universal sentence representations to this task; these representations were trained using large-scale data obtained in other tasks. Therefore, the proposed approach avoids the problem of using a small dataset for training sentence representations.

3 Regression Model for MTE Using Universal Sentence Representations

The proposed metric evaluates MT results with universal sentence representations trained using large-scale data obtained in other tasks. First, we explain two types of sentence representations used in the proposed metric in Section 3.1

. Then, we explain the proposed regression model and feature extraction for MTE in Section 

3.2.

cs-en de-en fi-en ro-en ru-en tr-en
WMT-2015 500 500 500 - 500 -
WMT-2016 560 560 560 560 560 560
Table 1: Number of DA human evaluation datasets for to-English language pairs888en: English, cs: Czech, de: German, fi: Finnish, ro: Romanian, ru: Russian, tr: Turkish in WMT-2015 Stanojević et al. (2015) and WMT-2016 Bojar et al. (2016).

3.1 Universal Sentence Representations

Several approaches have been proposed to learn sentence representations. These sentence representations are learned through large-scale data so that they constitute potentially useful features for MTE. These have been proved effective in various NLP tasks such as document classification and measurement of semantic textual similarity, and we call them universal sentence representations.

First, Skip-Thought555https://github.com/ryankiros/skip-thoughts Kiros et al. (2015) builds an unsupervised model of universal sentence representations trained using three consecutive sentences, such as , , and . It is an encoder-decoder model that encodes sentence and predicts previous and next sentences and from its sentence representation (Figure 3). As a result of training, this encoder can produce sentence representations. Skip-Thought demonstrates high performance, especially when applied to document classification tasks.

Second, InferSent666https://github.com/facebookresearch/InferSent Conneau et al. (2017) constructs a supervised model computing universal sentence representations trained using Stanford Natural Language Inference (SNLI) datasets777https://nlp.stanford.edu/projects/snli/ Bowman et al. (2015). The Natural Language Inference task is a classification task of sentence pairs with three labels, entailment, contradiction and neutral; thus, InferSent can train sentence representations that are sensitive to differences in meaning. This model encodes sentence pairs and and generates features by sentence representations and

with a bi-directional LSTM architecture with max pooling (Figure 

3). InferSent demonstrates high performance across various document classification and semantic textual similarity tasks.

3.2 Regression Model for MTE

In this paper, we propose a segment-level MTE metric for to-English language pairs. This problem can be treated as a regression problem that estimates translation quality as a real number from an MT hypothesis

and a reference translation . Once

-dimensional sentence vectors

and are generated, the proposed model applies the following three matching methods to extract relations between and (Figure 3).

  • Concatenation:

  • Element-wise product:

  • Absolute element-wise difference:

Thus, we perform regression using -dimensional features of , , and .

4 Experiments of Segment-Level MTE for To-English Language Pairs

We performed experiments using evaluation datasets of the WMT Metrics task to verify the performance of the proposed metric.

cs-en de-en fi-en ro-en ru-en tr-en Avg.
SentBLEU Bojar et al. (2016) 0.557 0.448 0.484 0.499 0.502 0.532 0.504
Blend Ma et al. (2017) 0.709 0.601 0.584 0.636 0.633 0.675 0.640
DPMF Bojar et al. (2016) 0.713 0.584 0.598 0.627 0.615 0.663 0.633
ReVal Bojar et al. (2016) 0.577 0.528 0.471 0.547 0.528 0.531 0.530
SVR with Skip-Thought 0.665 0.571 0.609 0.677 0.608 0.599 0.622
SVR with InferSent 0.679 0.604 0.617 0.640 0.644 0.630 0.636
SVR with InferSent + Skip-Thought 0.686 0.611 0.633 0.660 0.649 0.646 0.648
Table 2: Segment-level Pearson correlation of metric scores and DA human evaluations scores for to-English language pairs in WMT-2016 (newstest2016).

4.1 Setups

Datasets.

We used datasets for to-English language pairs from the WMT-2016 Metrics task Bojar et al. (2016) as summarized in Table 8. Following Ma et al. Ma et al. (2017), we employed all other to-English DA data as training data (4,800 sentences) for testing on each to-English language pair (560 sentences) in WMT-2016.

Features.

Publicly available pre-trained sentence representations such as Skip-Thought55footnotemark: 5 and InferSent66footnotemark: 6 were used as the features mentioned in Section 3. Skip-Thought is a collection of 4,800-dimensional sentence representations trained on 74 million sentences of the BookCorpus dataset Zhu et al. (2015). InferSent is a collection of 4,096-dimensional sentence representations trained on both 560,000 sentences of the SNLI dataset Bowman et al. (2015) and 433,000 sentences of the MultiNLI dataset Williams et al. (2017).

Model.

Our regression model used SVR with the RBF kernel from scikit-learn999http://scikit-learn.org/stable/. Hyper-parameters were determined through 10-fold cross validation using the training data. We examined all combinations of hyper-parameters among , , and .

There are three comparison methods: Blend Ma et al. (2017), DPMF Yu et al. (2015a), and ReVal Gupta et al. (2015a, b), as described in Section 2. Blend and DPMF are MTE metrics that exhibited the best performance in the WMT-2017 Metrics task Bojar et al. (2017) and WMT-2016 Metrics task, respectively. We compared the Pearson correlation of each metric score and DA human evaluation scores.

4.2 Result

As can be seen in Table 2, the proposed metric, which combines InferSent and Skip-Thought representations, surpasses the best performance in three out of six to-English languages pairs and achieves state-of-the-art performance on average.

4.3 Discussion

These results indicate that it is possible to adopt universal sentence representations in MTE by training a regression model using DA human evaluation data. Since Blend is an ensemble method using combinations of various MTE metrics as features, our results show that universal sentence representations can consider information more abundantly than a complex model. Since ReVal is also based on sentence representations, we conclude that universal sentence representations trained on a large-scale dataset are more effective for MTE tasks than sentence representations trained on a small or limited in-domain dataset.

4.4 Error Analysis

We re-implemented Blend101010http://github.com/qingsongma/blend Ma et al. (2017) and compared the evaluation results with the proposed metric.111111The average Pearson correlation of all language pairs after re-implementing Blend was 0.636, which is a little lower than the value reported in their paper. However, we judged that the following discussion will not be affected by this difference.

We analyzed 20% of the pairs of MT hypotheses and reference translations (112 sentence pairs 6 languages = 672 sentence pairs) in descending order of DA human score in each language pair. In other words, the top 20% of MT hypotheses that were close to the meaning of the reference translations for each language pair were analyzed. Among these, only Blend estimates the translation quality as high for 70 sentence pairs, and only our metric estimates the translation quality as high for 88 sentence pairs.

Surface.

Among pairs estimated to have high translation quality by each method, there were 26 pairs in Blend and 42 pairs in the proposed method with a low word surface matching rate between MT hypotheses and reference translations. This result shows that the proposed metric can evaluate a wide range of sentence information that cannot be captured by Blend.

Unknown words.

There were 26 MT hypotheses consisting of words that were treated as unknown words in Skip-Thought or InferSent that were correctly evaluated in Blend. On the other hand, there were 26 MT hypotheses that were correctly evaluated in the proposed metric. This result shows that the proposed metric is affected by unknown words. However, it is also true that there are some MT hypotheses containing unknown words that can be correctly evaluated. Therefore, we analyzed further by focusing on sentence length. There were 17 MT hypotheses consisting of words that were treated as unknown words by either Skip-Thought or InferSent with a short length (15 words or less) that were correctly evaluated in Blend. However, in the proposed metric, there were only two MT hypotheses that were correctly evaluated. This result indicates that the shorter the sentence, the more likely is the proposed metric to be affected by unknown words.

5 Conclusions

In this study, we tried to apply universal sentence representation to MTE based on the DA of human evaluation data. Our segment-level MTE metric achieved the best performance on the WMT-2016 dataset. We conclude that:

  • Universal sentence representations can consider information more comprehensively than an ensemble metric using combinations of various MTE metrics based on features of character or word N-grams.

  • Universal sentence representations trained on a large-scale dataset are more effective than sentence representations trained on a small or limited in-domain dataset.

  • Although a metric based on SVR with universal sentence representations is not good at handling unknown words, it correctly estimates the translation quality of MT hypotheses with a low word matching rate with reference translations.

Following the success of InferSent Conneau et al. (2017), many works Wieting and Gimpel (2017); Cer et al. (2018); Subramanian et al. (2018) on universal sentence representations have been published. Based on the results of our work, we expect that the MTE metric will be further improved using these better universal sentence representations.

References