The human labor needed to evaluate machine translation (MT) evaluation is expensive. To alleviate this, various automatic evaluation metrics are continuously being introduced to correlate with human judgements. Unfortunately, cutting-edge MT systems are too close in performance and generation style for such metrics to rank systems. Even for a metric whose correlation is reliable in most cases, empirical research has shown that it poorly correlates with human ratings when evaluating competitive systems ma-etal-2019-results; mathur-etal-2020-tangled, limiting the development of MT systems.
Current MT evaluation still faces the challenge of how to better evaluate the overlap between the reference and the model hypothesis taking into consideration adequacy and fluency, where all the evaluation units are treated the same, i.e., all the matching scores have an equal weighting. However, in real-world examinations, the questions vary in their difficulty. Those questions which are easily answered by most subjects tend to have low weightings, while those which are hard to answer have high weightings. A subject who is able to solve the more difficult questions can receive a high final score and gain a better ranking. MT evaluation is also a kind of examination. For bridging the gap between human examination and MT evaluation, it is advisable to incorporate a difficulty dimension into the MT evaluation metric.
In this paper, we take translation difficulty into account in MT evaluation and test the effectiveness on a representative MT metric BERTScore (zhang2019bertscore) to verify the feasibility. More specifically, the difficulty is first determined across the systems with the help of pairwise similarity, and then exploited as the weight in the final score function for distinguishing the contribution of different sub-units. Experimental results on the WMT19 EnglishGerman evaluation task show that difficulty-aware BERTScore has a better correlation than do the existing metrics. Moreover, it agrees very well with the human rankings when evaluating competitive systems.
2 Related Work
The existing MT evaluation metrics can be categorized into the following types according to their underlying matching sub-units: -gram based papineni-etal-2002-bleu; 10.5555/1289189.1289273; lin-och-2004-automatic; han-etal-2012-lepor; popovic-2015-chrf, edit-distance based Snover06astudy; leusch-etal-2006-cder, alignment-based banerjee-lavie-2005-meteor, embedding-based zhang2019bertscore; chow-etal-2019-wmdo; lo-2019-yisi and end-to-end based sellam-etal-2020-bleurt. BLEU papineni-etal-2002-bleu
is widely used as a vital criterion in the comparison of MT system performance but its reliability has been doubted on entering neural machine translation ageshterionov2018human; mathur-etal-2020-tangled. Due to the fact that BLEU and its variants only assess surface linguistic features, some metrics leveraging contextual embedding and end-to-end training bring semantic information into the evaluation, which further improves the correlation with human judgement. Among them, BERTScore zhang2019bertscore has achieved a remarkable performance across MT evaluation benchmarks balancing speed and correlation. In this paper, we choose BERTScore as our testbed.
3 Our Proposed Method
In real-world examinations, the questions are empirically divided into various levels of difficulty. Since the difficulty varies from question to question, the corresponding role a question plays in the evaluation does also. Simple question, which can be answered by most of the subjects, usually receive of a low weighting. But a difficult question, which has more discriminative power, can only be answered by a small number of good subjects, and thus receives a higher weighting.
Motivated by this evaluation mechanism, we measure difficulty of a translation by viewing the MT systems and sub-units of the sentence as the subjects and questions, respectively. From this perspective, the impact of the sentence-level sub-units on the evaluation results supported a differentiation. Those sub-units that may be incorrectly translated by most systems (e.g., polysemy) should have a higher weight in the assessment, while easier-to-translate sub-units (e.g., the definite article) should receive less weight.
|Metric||EnDe (All)||EnDe (Top 30%)||DeEn (All)||DeEn (Top 30%)|
3.2 Difficulty-Aware BERTScore
In this part, we aim to answer two questions: 1) how to automatically collect the translation difficulty from BERTScore; and 2) how to integrate the difficulty into the score function. Figure 1 presents an overall illustration.
Traditional -gram overlap cannot extract semantic similarity, word embedding provides a means of quantifying the degree of overlap, which allows obtaining more accurate difficulty information. Since BERT is a strong language model, it can be utilized as a contextual embedding (i.e., the output of BERT) for obtaining the representations of the reference and the hypothesis . Given a specific hypothesis token and reference token , the similarity score is computed as follows:
Subsequently, a similarity matrix is constructed by pairwise calculating the token similarity. Then the token-level matching score is obtained by greedily searching for the maximal similarity in the matrix, which will be further taken into account in sentence-level score aggregation.
The calculation of difficulty can be tailored for different metrics based on the overlap matching score. In this case, BERTScore evaluates the token-level overlap status by the pairwise semantic similarity, thus the token-level similarity is viewed as the bedrock of difficulty calculation. For instance, if one token (like “cat”) in the reference may only find identical or synonymous substitutions in a few MT system outputs, then the corresponding translation difficulty weight ought to be larger than for other reference tokens, which further indicates that it is more valuable for evaluating the translation capability. Combined with BERTScore mechanism, it is implemented by averaging the token similarities across systems. Given systems and their corresponding generated hypotheses , the difficulty of a specific token in the reference is formulated as
An example is shown in Figure 1: the entity “cat” is improperly translated to “monkey” and “puppy”, resulting in a lower pairwise similarity of the token “cat”, which indicates higher translation difficulty. Therefore, by incorporating the translation difficulty into the evaluation process, the token “cat” is more contributive while the other words like “cute” are less important in the overall score.
Due to the fact that the translation generated by a current NMT model is fluent enough but not adequate yet, -score which takes into account the Precision and Recall
, is more appropriate to aggregate the matching scores, instead of only considering precision. We thus follow vanilla BERTScore in using F-score as the final score. The proposed method directly assigns difficulty weights to the counterpart of the similarity score
without any hyperparameter:
For any , we simply let , i.e., retaining the original calculation. The motivation is that the human assessor keeps their initial matching judgement if the test taker produces a unique but reasonable alternative answer. We regard as the DA-BERTScore in the following part.
There are many variants of our proposed method: 1) designing more elaborate difficulty function liu-etal-2020-norm; zhan-etal-2021-metacl; 2) applying a smoothing function to the difficulty distribution; and 3) using other kinds of -score, e.g., -score. The aim of this paper is not to explore this whole space but simply to show that a straightforward implementation works well for MT evaluation.
|Facebook.6862||0.4364 (5)||0.4692 (5)||0.6077 (3)||0.7219 (4)||0.1555 (0)||0.347|
|Microsoft.sd.6974||0.4477 (1)||0.4583 (1)||0.6056 (3)||0.7263 (0)||0.1539 (1)||0.311|
|Microsoft.dl.6808||0.4483 (1)||0.4591 (1)||0.6132 (1)||0.7260 (0)||0.1544 (1)||0.296|
|MSRA.6926||0.4603 (3)||0.4504 (3)||0.6187 (3)||0.7267 (3)||0.1525 (0)||0.214|
|UCAM.6731||0.4413 (0)||0.4636 (0)||0.6047 (1)||0.7190 (1)||0.1519 (1)||0.213|
|NEU.6763||0.4460 (2)||0.4563 (4)||0.6083 (3)||0.7229 (2)||0.1521 (1)||0.208|
The WMT19 EnglishGerman (EnDe) evaluation tasks are challenging due to the large discrepancy between human and automated assessments in terms of reporting the best system bojar-etal-2018-findings; barrault-etal-2019-findings; freitag-etal-2020-bleu. To sufficiently validate the effectiveness of our approach, we choose these tasks as our evaluation subjects. There are 22 systems for EnDe and 16 for DeEn. Each system has its corresponding human assessment results. The experiments were centered on the correlation with system-level human ratings.
In order to compare with the metrics that have different underlying evaluation mechanism, four representative metrics: BLEU papineni-etal-2002-bleu, TER Snover06astudy, METEOR banerjee-lavie-2005-meteor; denkowski:lavie:meteor-wmt:2014, BERTScore zhang2019bertscore, which are correspondingly driven by -gram, edit distance, word alignment and embedding similarity, are involved in the comparison experiments without losing popularity. For ensuring reproducibility, the original111https://www.cs.cmu.edu/ alavie/METEOR/index.html222https://github.com/Tiiiger/bert_score and widely used implementation333https://github.com/mjpost/sacrebleu was used in the experiments.
Following the correlation criterion adopted by the WMT official organization, Pearson’s correlation is used for validating the system-level correlation with human ratings. In addition, two rank-correlations Spearman’s and original Kendall’s are also used to examine the agreement with human ranking, as has been done in recent research freitag-etal-2020-bleu. Table 1 lists the results. DA-BERTScore achieves competitive correlation results and further improves the correlation of BERTScore. In addition to the results on all systems, we also present the results on the top 30% systems where the calculated difficulty is more reliable and our approach should be more effective. The result confirms our intuition that DA-BERTScore can significantly improve the correlations under the competitive scenario, e.g., improving the score from 0.204 to 0.974 on EnDe and 0.271 to 0.693 on DeEn.
|Src||-||-||“I’m standing right here in front of you,” one woman said.|
|Ref||-||-||„Ich stehe genau hier vor Ihnen ”, sagte eine Frau.|
|MSRA||0.9656||0.0924||„Ich stehe hier vor Ihnen ”, sagte eine Frau.|
|0.9591||0.1092||„Ich stehe hier direkt vor Ihnen ”, sagte eine Frau.|
|Src||-||-||France has more than 1,000 troops on the ground in the war-wracked country.|
|Ref||-||-||Frankreich hat über 1.000 Bodensoldaten in dem kriegszerstörten Land im Einsatz.|
|MSRA||0.6885||0.2123||Frankreich hat mehr als 1.000 Soldaten vor Ort in dem kriegsgeplagten Land.|
Effect of Top- Systems
Figure 2 compares the Kendall’s correlation variation of the top- systems. Echoing previous research, the vast majority of metrics fail to correlate with human ranking and even perform negative correlation when is lower than , meaning that the current metrics are ineffective when facing competitive systems. With the help of difficulty weights, the degradation in the correlation is alleviated, e.g., improving score from 0.07 to 0.73 for BERTScore (). These results indicate the effectiveness of our approach, establishing the necessity for adding difficulty.
Case Study of Ranking
Table 2 presents a case study on the EnDe task. Existing metrics consistently select MSRA’s system as the best system, which shows a large divergence from human judgement. DA-BERTScore ranks it the same as human (4th) because most of its translations have low difficulty, thus lower weights are applied in the scores. Encouragingly, DA-BERTScore ranks Facebook’s system as the best one, which implies that it overcomes more challenging translation difficulties. This testifies to the importance and effectiveness of considering translation difficulty in MT evaluation.
Case Study of Token-Level Difficulty
Table 3 presents two cases, illustrating that our proposed difficulty-aware method successfully identifies the omission errors ignored by BERTScore. In the first case, the Facebook’s system correctly translates the token “right”, and in the second case, uses the substitute “Soldaten am Boden” which is lexically similar to the ground-truth token “Bodensoldaten”. Although the MSRA’s system suffers word omissions in the two cases, its hypotheses receive the higher ranking given by BERTScore, which is inconsistent with human judgements. The reason might be that the semantic of the hypothesis is highly close to the reference, thus the slight lexical difference is hard to be found when calculating the similarity score. By distinguishing the difficulty of the reference tokens, DA-BERTScore successfully makes the evaluation focus on the difficult parts, and eventually correct the score of the Facebook’s system, thus giving the right rankings.
Distribution of Difficulty Weights
The difficulty weights can reflect the translation ability of a group of MT systems. If the systems in a group are of higher translation ability, the calculated difficulty weights will be smaller. Starting from this intuition, we visualize the distribution of difficulty weights as shown in Figure 3
. Clearly, we can see that the difficulty weights are centrally distributed at lower values, indicating that most of the tokens can be correctly translated by all the MT systems. For the difficulty weights calculated on the top 30% systems, the whole distribution skews to zero since these competitive systems have better translation ability and thus most of the translations are easy for them. This confirms that the difficulty weight produced by our approach is reasonable.
5 Conclusion and Future Work
This paper introduces the conception of difficulty into machine translation evaluation, and verifies our assumption with a representative metric BERTScore. Experimental results on the WMT19 EnglishGerman metric tasks show that our approach achieves a remarkable correlation with human assessment, especially for evaluating competitive systems, revealing the importance of incorporating difficulty into machine translation evaluation. Further analyses show that our proposed difficulty-aware BERTScore can strengthen the evaluation of word omission problems and generate reasonable distributions of difficulty weights.
Future works include: 1) optimizing the difficulty calculation zhan2021variance
; 2) applying to other MT metrics; and 3) testing on other generation tasks, e.g., text summarization.
This work was supported in part by the Science and Technology Development Fund, Macau SAR (Grant No. 0101/2019/A2), and the Multi-year Research Grant from the University of Macau (Grant No. MYRG2020-00054-FST). We thank the anonymous reviewers for their insightful comments.