Does Summary Evaluation Survive Translation to Other Languages?

The creation of a large summarization quality dataset is a considerable, expensive, time-consuming effort, requiring careful planning and setup. It includes producing human-written and machine-generated summaries and evaluation of the summaries by humans, preferably by linguistic experts, and by automatic evaluation tools. If such effort is made in one language, it would be beneficial to be able to use it in other languages. To investigate how much we can trust the translation of such dataset without repeating human annotations in another language, we translated an existing English summarization dataset, SummEval dataset, to four different languages and analyzed the scores from the automatic evaluation metrics in translated languages, as well as their correlation with human annotations in the source language. Our results reveal that although translation changes the absolute value of automatic scores, the scores keep the same rank order and approximately the same correlations with human annotations.



There are no comments yet.


page 2

page 4


Towards Human-Free Automatic Quality Evaluation of German Summarization

Evaluating large summarization corpora using humans has proven to be exp...

A Multilingual Study of Compressive Cross-Language Text Summarization

Cross-Language Text Summarization (CLTS) generates summaries in a langua...

Evaluating the Efficacy of Summarization Evaluation across Languages

While automatic summarization evaluation methods developed for English a...

A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods

The quality of a summarization evaluation metric is quantified by calcul...

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

How reliably an automatic summarization evaluation metric replicates hum...

Are Mutually Intelligible Languages Easier to Translate?

Two languages are considered mutually intelligible if their native speak...

How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation

Manual evaluation is essential to judge progress on automatic text summa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A large summarization dataset includes thousands of texts and human-written summaries of the texts (for example, CNN/Daily Mail Hermann et al. (2015)). In order to make it applicable for wider research, it may also contain machine-generated summaries by many models, accompanied by human and machine evaluations of the quality of the generated summaries Fabbri et al. (2020). Even the human annotation alone is a complicated effort, requiring careful planning and setup Iskender et al. (2021); Kryscinski et al. (2020). Can we reuse such a dataset in other languages, at least for some research purposes?

In this paper, we focus on answering this question with respect to automated summarization evaluation by using the dataset SummEval Fabbri et al. (2020) as an example. We translate this dataset from English to four other languages - French, German, Italian, and Spanish - and keep the original human annotations. We consider then correlations between automated summary evaluation measures and human annotations and show that the relative ranking of different measures does not change with translation. This means that at this point, translation can be trusted for purposes of researching and using different summarization evaluation measures in other languages, with the additional benefit of avoiding human re-annotation of the summaries after translation.

2 Methods

We are using the English summarization dataset SummEval111 Fabbri et al. (2020)

, specifically the human-annotated part. The part consists of 100 texts, and each text is accompanied by 11 human-written reference summaries and 16 machine-generated summaries produced by 16 different models. Each machine-generated summary is annotated by three experts and five crowd workers using a 5-point scale for four quality measures: consistency, relevance, coherence, and fluency. In our analysis, we took the average of the expert scores for each quality of each annotated text-summary pair.

For our analysis, we translated all this part of the dataset - 100 source texts, 1100 human reference summaries, and 1600 machine-generated summaries - into four languages, French, German, Italian and Spanish, using translation models from the transformers library Wolf et al. (2020) trained as Helsinki-NLP models222 - opus-mt-en-de, opus-mt-en-fr, opus-mt-en-it, and opus-mt-en-es. Further, we used the same original human annotations provided by the SummEval dataset as annotations for the four translated languages.

Figure 1: PCA 2D plot of normalized summary quality scores, reduced from the original dimension 1600 - the number of summaries scored. The human scores are in black: the average of 3 experts is square, and the average of 5 crowd workers is a triangle. The automated scores are differed by the language version; the five languages are differentiated here by colors. The automated scores shown are BertScore (F1), BLANC, BLEU, Jensen-Shannon, and ROUGE (1,2 and L).

In each language version of the dataset, we evaluate machine-generated summaries with a few common or promising automated evaluation measures. Since not all the automatic evaluation metrics are available for the five languages that we investigate, we only focus on the evaluation measures that are reasonably easy to apply in different languages. We calculated the following reference-based automatic evaluation metrics: BLEU Papineni et al. (2002) with NLTK, BERTScore-F1333 Zhang et al. (2020), and ROUGE Lin (2004) as ROUGE-1,2,3,L with ROUGE-L as rougeLsum444 Additionally, we also calculated two reference-free automatic evaluation metrics: Jensen-Shannon Louis and Nenkova (2009), and BLANC Vasilyev et al. (2020) as BLANC-help555 with underlying BERT model chosen from transformer library Wolf et al. (2020) accordingly to the language: bert-base-uncased for English, dbmdz-bert-base-french-europeana-cased for French, bert-base-german-dbmdz-cased for German, dbmdz-bert-base-italian-cased for Italian and dccuchile-bert-base-spanish-wwm-uncased for Spanish.

In each language version, we obtained correlations of the calculated automated evaluation measures with the human annotations provided by the dataset. Our inquiry is: are these correlations reasonably independent of the language? In other words, can we rely on such correlations in a language other than English, knowing that the dataset was translated, and the human annotations are the original annotations that were obtained in English?

Spearman’s Correlation Kendall’s Correlation
ROUGE-1 0.851 0.785 0.816 0.8 0.663 0.594 0.624 0.608
ROUGE-2 0.802 0.762 0.792 0.781 0.611 0.567 0.596 0.588
ROUGE-L 0.848 0.817 0.813 0.823 0.665 0.624 0.62 0.635
BLEU 0.816 0.828 0.823 0.816 0.635 0.651 0.646 0.637
BERTScore 0.702 0.761 0.739 0.702 0.512 0.566 0.545 0.515
JS 0.852 0.897 0.861 0.848 0.661 0.719 0.669 0.658
BLANC 0.624 0.547 0.791 0.638 0.451 0.388 0.594 0.455
p 0.001 for all correlations
Table 1: Spearman’s and Kendall’s correlations between ROUGE, BLEU, BERTScore, JS (Jensen-Shannon), and BLANC in English and in translated languages German, French, Spanish, Italian

3 Results

In order to get an intuitive confirmation of the language independence that we seek, or at least to get a motivation to proceed further, we considered the scores for 1600 summaries with reduced dimensionality. The example shown in Figure 1

projects each 1600-dimensional vector to a 2D point. There are eight vectors of human scores, corresponding to two types of annotators (experts and crowd workers) and four human evaluation measures: coherence, consistency, fluency, and relevance. Each score is averaged over its annotators - three annotators for experts and five annotators for crowd workers. Each automated measure (for example, ROUGE-2) produced five 1600-dimensional vectors corresponding to five languages. With the seven measures, this makes 35 vectors. Figure

1 shows that the scores do not split far apart by the languages, thus providing us with a motivation for a real analysis.

(a) Coherence
(b) Consistency
(c) Fluency
(d) Relevance
Figure 2: Kendall’s correlations of ROUGE-1, ROUGE-2, ROUGE-L, BLEU, BERTScore, JS (Jensen-Shannon), BLANC with coherency (fig. 1(a)), consistency (fig. 1(b)), fluency (fig. 1(c)), and relevance (fig. 1(d)) for English (EN), German (DE), French (FR), Spanish (ES), and Italian (IT). SD means that the correlation is significantly different from the correlation in English.

In our correlation analysis, we investigated two aspects: 1) correlation of automatic metrics for English with each of the automatic metrics for the translated languages, and 2) correlation of human annotations (coherence, consistency, fluency, and relevance) with each of the automatic metrics in five languages. We considered the mean expert annotations for English summaries in the SummEval dataset as the gold standard evaluation; therefore, we did not use the crowd annotations for our correlation analysis with human annotations.

To explore if the average of automatic metrics changes after translation, we calculated the non-parametric ANOVA test, Kruskal-Wallis test, between the ROUGE-1, ROUGE-2, ROUGE-L, BLEU, BERTScore, JS, and BLANC scores in five languages. The test results revealed a statistically significant difference between these automatic metrics, showing that the machine translation has influenced their absolute values.

Further, we calculated the correlation between these automatic metrics to determine if their rank order is affected by the translations. Table 1 shows the Spearman’s and Kendall’s correlation coefficients of automatic metrics for English with the automatic metrics for translations. Here, we observe that the Spearman’s correlation coefficients range from strong () to very strong (), as well as Kendall’s tau correlations are at strong level for all automatic metrics (). This result shows that the machine translations do not influence the rank order of the automatic metrics, and translated summarization datasets can be used for training multilingual summarization tools.

Next, we calculated Kendall’s correlation coefficients of mean expert coherence, consistency, fluency, and relevance ratings with the automatic metrics in five languages. In order to make a reliable judgment, we consider correlations between each automated measure 1600-dimensional vector of scores and each of four human experts 1600-dimensional vector of scores. Figure 2 shows the correlation coefficients as bar plots. Here, we observe that the correlation coefficients range from very weak () to moderate (

). To determine if the correlation differences are statistically significant between languages, we applied Zou’s confidence intervals test for dependent and overlapping variables

Zou (2007).

For coherence (see figure 1(a)), we found statistically significant difference for ROUGE-2 in Spanish, for BertScore in German, Spanish and Italian, and for BLANC in French. For consistency (figure 1(b)), all the correlation coefficients were not significantly different except for the correlation of BLANC in French. There were no significant significant differences for fluency (figure 1(c)). There was significant difference for relevance (figure 1(d)) for ROUGE-2 in German.

These seven exceptions - statistically significant differences - are discussed in Appendix A, where we argue that at least two of them are certainly not the fault of the translation, and that the remaining cases may be not the fault of the translation either. Overall, the correlation of automatic metrics in translated languages with expert annotations did not differ significantly from the correlations in English for a high proportion of the correlations, indicating that the translation did not cause a significant difference when correlating with human annotations.

4 Conclusion

In this paper, we explored how automated evaluation of summarization quality may depend on translation of the texts and summaries to another language without repeating the human annotation in the new language. To do so, we focused on SummEval dataset Fabbri et al. (2020) as our example and considered its translation (by Helsinki-NLP models666 to French, German, Italian and Spanish.

Our analysis in Section 3 demonstrates that although the translation changed the absolute value of automatic metrics, their rank-order and correlations with human annotations stay the same for a very high proportion of correlations. In other words, summary evaluation mostly survives translation to another language, at least to not a very different language.

We tentatively suggest that a human annotated English summarization dataset can be translated and used for evaluating summaries and summarization systems in other languages. In this sense, the translation quality is currently higher than the quality of summary evaluation, and summarization research can benefit from it. In future it would be interesting to consider translation to more languages and to review human-metrics correlations for more different metrics. As translation to other languages can be more challenging than to the languages we considered here, it would be interesting if one of evaluation metrics would be the best to give a reliable warning about relying on the original human scores for a specific quality (like coherence).


  • A. R. Fabbri, W. Kryściński, B. McCann, C. Xiong, R. Socher, and D. Radev (2020) SummEval: re-evaluating summarization evaluation.. arXiv arXiv:2007.12626v4. External Links: Link Cited by: §1, §1, §2, §4.
  • K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching machines to read and comprehend.. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1693–1701. Cited by: §1.
  • N. Iskender, T. Polzehl, and S. Möller (2021) Reliability of human evaluation for text summarization: lessons learned and challenges ahead.. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pp. 86–96. External Links: Link Cited by: §1.
  • W. Kryscinski, B. McCann, C. Xiong, and R. Socher (2020) Evaluating the factual consistency of abstractive text summarization.. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing

    , B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.),
    pp. 9332–9346. External Links: Link Cited by: §1.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries.. In Proceedings of Workshop on Text Summarization Branches Out, pp. 74–81. External Links: Link Cited by: §2.
  • A. Louis and A. Nenkova (2009) Automatically evaluating content selection in summarization without human models.. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, P. Koehn and R. Mihalcea (Eds.), pp. 306–314. External Links: Link Cited by: §2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation.. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, pp. 311–318. External Links: Link Cited by: §2.
  • O. Vasilyev, V. Dharnidharka, and J. Bohannon (2020)

    Fill in the BLANC: human-free quality estimation of document summaries.

    In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, S. Eger, Y. Gao, M. Peyrard, W. Zhao, and E. Hovy (Eds.), pp. 11–20. External Links: Link Cited by: §2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush (2020) Transformers: state-of-the-art natural language processing.. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. External Links: Link Cited by: §2, §2.
  • T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2020)

    BERTScore: evaluating text generation with bert.

    arXiv arXiv:1904.09675v3. External Links: Link Cited by: §2.
  • G. Y. Zou (2007) Toward using confidence intervals to compare correlations.. Psychological methods 12 (4), pp. 399. Cited by: §3.

Appendix A Exceptional cases

As we have found (Figure 2), there were 7 cases where a translation resulted in statistically significant changes of correlation of evaluation measure with original human scores. These cases are summarized in Table 2.

Lang Quality Metrics Change
DE Coherence BertScore decreased
Relevance ROUGE-2 decreased
FR Coherence BLANC increased
Consistency BLANC decreased
ES Coherence ROUGE-2 increased
Coherence BertScore decreased
IT Coherence BertScore decreased
Table 2: Exceptional cases: each row represents the case when the correlation of automated metrics (third column) with original human scores (second column) changed statistically significantly after translation from English to the language of the first column. The last column shows how the correlation changed.

Any such case can be a consequence of either one or both of the following situations:

  1. Quality of translation is not high enough for using existing summary evaluation measures.

  2. Evaluation metrics involved has different quality in the considered language vs English. The reason could be in the language itself, or/and in different quality of the underlying model of the metrics. The latter may applicable to BertScore, BLANC.

The two exceptional cases, highlighted in Table 2, - FR-Coherence-BLANC and ES-Coherence-ROUGE-2 - can be explained only by the second situation, because the quality of evaluation measure after translation only increased. Thus, we got the remaining 5 suspect cases, out of 112 comparisons (4 languages x 4 qualities x 7 measures). We have no definitive resolution about these 5 cases, but we can consider the following arguments.

If the translation lowers some summary quality (for purposes of using our evaluation measures), then it would be likely that not one but several measures would have significantly lower correlations with the original human scores. However, for each of the 5 cases we have only a single metrics per a Language-Quality pair. In 4 cases the metrics (BertScore or BLANC) uses an underlying model, and it is possible that the underlying models for BertScore in German, Spanish, Italian and for BLANC in French are not of the same quality as in English. It is also possible that the principles on which a metrics is based (whether the metrics is ROUGE-2 or BertScore or BLANC) are successful to different degrees in different languages.