1 Introduction
A large summarization dataset includes thousands of texts and human-written summaries of the texts (for example, CNN/Daily Mail Hermann et al. (2015)). In order to make it applicable for wider research, it may also contain machine-generated summaries by many models, accompanied by human and machine evaluations of the quality of the generated summaries Fabbri et al. (2020). Even the human annotation alone is a complicated effort, requiring careful planning and setup Iskender et al. (2021); Kryscinski et al. (2020). Can we reuse such a dataset in other languages, at least for some research purposes?
In this paper, we focus on answering this question with respect to automated summarization evaluation by using the dataset SummEval Fabbri et al. (2020) as an example. We translate this dataset from English to four other languages - French, German, Italian, and Spanish - and keep the original human annotations. We consider then correlations between automated summary evaluation measures and human annotations and show that the relative ranking of different measures does not change with translation. This means that at this point, translation can be trusted for purposes of researching and using different summarization evaluation measures in other languages, with the additional benefit of avoiding human re-annotation of the summaries after translation.
2 Methods
We are using the English summarization dataset SummEval111https://github.com/Yale-LILY/SummEval Fabbri et al. (2020)
, specifically the human-annotated part. The part consists of 100 texts, and each text is accompanied by 11 human-written reference summaries and 16 machine-generated summaries produced by 16 different models. Each machine-generated summary is annotated by three experts and five crowd workers using a 5-point scale for four quality measures: consistency, relevance, coherence, and fluency. In our analysis, we took the average of the expert scores for each quality of each annotated text-summary pair.
For our analysis, we translated all this part of the dataset - 100 source texts, 1100 human reference summaries, and 1600 machine-generated summaries - into four languages, French, German, Italian and Spanish, using translation models from the transformers library Wolf et al. (2020) trained as Helsinki-NLP models222https://huggingface.co/Helsinki-NLP - opus-mt-en-de, opus-mt-en-fr, opus-mt-en-it, and opus-mt-en-es. Further, we used the same original human annotations provided by the SummEval dataset as annotations for the four translated languages.

In each language version of the dataset, we evaluate machine-generated summaries with a few common or promising automated evaluation measures. Since not all the automatic evaluation metrics are available for the five languages that we investigate, we only focus on the evaluation measures that are reasonably easy to apply in different languages. We calculated the following reference-based automatic evaluation metrics: BLEU Papineni et al. (2002) with NLTK, BERTScore-F1333https://github.com/Tiiiger/bert_score Zhang et al. (2020), and ROUGE Lin (2004) as ROUGE-1,2,3,L with ROUGE-L as rougeLsum444https://github.com/google-research/google-research/tree/master/rouge. Additionally, we also calculated two reference-free automatic evaluation metrics: Jensen-Shannon Louis and Nenkova (2009), and BLANC Vasilyev et al. (2020) as BLANC-help555https://github.com/PrimerAI/blanc with underlying BERT model chosen from transformer library Wolf et al. (2020) accordingly to the language: bert-base-uncased for English, dbmdz-bert-base-french-europeana-cased for French, bert-base-german-dbmdz-cased for German, dbmdz-bert-base-italian-cased for Italian and dccuchile-bert-base-spanish-wwm-uncased for Spanish.
In each language version, we obtained correlations of the calculated automated evaluation measures with the human annotations provided by the dataset. Our inquiry is: are these correlations reasonably independent of the language? In other words, can we rely on such correlations in a language other than English, knowing that the dataset was translated, and the human annotations are the original annotations that were obtained in English?
Spearman’s Correlation | Kendall’s Correlation | |||||||
EN-DE | EN-FR | EN-ES | EN-IT | EN-DE | EN-FR | EN-ES | EN-IT | |
ROUGE-1 | 0.851 | 0.785 | 0.816 | 0.8 | 0.663 | 0.594 | 0.624 | 0.608 |
ROUGE-2 | 0.802 | 0.762 | 0.792 | 0.781 | 0.611 | 0.567 | 0.596 | 0.588 |
ROUGE-L | 0.848 | 0.817 | 0.813 | 0.823 | 0.665 | 0.624 | 0.62 | 0.635 |
BLEU | 0.816 | 0.828 | 0.823 | 0.816 | 0.635 | 0.651 | 0.646 | 0.637 |
BERTScore | 0.702 | 0.761 | 0.739 | 0.702 | 0.512 | 0.566 | 0.545 | 0.515 |
JS | 0.852 | 0.897 | 0.861 | 0.848 | 0.661 | 0.719 | 0.669 | 0.658 |
BLANC | 0.624 | 0.547 | 0.791 | 0.638 | 0.451 | 0.388 | 0.594 | 0.455 |
p 0.001 for all correlations |
3 Results
In order to get an intuitive confirmation of the language independence that we seek, or at least to get a motivation to proceed further, we considered the scores for 1600 summaries with reduced dimensionality. The example shown in Figure 1
projects each 1600-dimensional vector to a 2D point. There are eight vectors of human scores, corresponding to two types of annotators (experts and crowd workers) and four human evaluation measures: coherence, consistency, fluency, and relevance. Each score is averaged over its annotators - three annotators for experts and five annotators for crowd workers. Each automated measure (for example, ROUGE-2) produced five 1600-dimensional vectors corresponding to five languages. With the seven measures, this makes 35 vectors. Figure
1 shows that the scores do not split far apart by the languages, thus providing us with a motivation for a real analysis.
![]() |
![]() |
![]() |
![]() |
In our correlation analysis, we investigated two aspects: 1) correlation of automatic metrics for English with each of the automatic metrics for the translated languages, and 2) correlation of human annotations (coherence, consistency, fluency, and relevance) with each of the automatic metrics in five languages. We considered the mean expert annotations for English summaries in the SummEval dataset as the gold standard evaluation; therefore, we did not use the crowd annotations for our correlation analysis with human annotations.
To explore if the average of automatic metrics changes after translation, we calculated the non-parametric ANOVA test, Kruskal-Wallis test, between the ROUGE-1, ROUGE-2, ROUGE-L, BLEU, BERTScore, JS, and BLANC scores in five languages. The test results revealed a statistically significant difference between these automatic metrics, showing that the machine translation has influenced their absolute values.
Further, we calculated the correlation between these automatic metrics to determine if their rank order is affected by the translations. Table 1 shows the Spearman’s and Kendall’s correlation coefficients of automatic metrics for English with the automatic metrics for translations. Here, we observe that the Spearman’s correlation coefficients range from strong () to very strong (), as well as Kendall’s tau correlations are at strong level for all automatic metrics (). This result shows that the machine translations do not influence the rank order of the automatic metrics, and translated summarization datasets can be used for training multilingual summarization tools.
Next, we calculated Kendall’s correlation coefficients of mean expert coherence, consistency, fluency, and relevance ratings with the automatic metrics in five languages. In order to make a reliable judgment, we consider correlations between each automated measure 1600-dimensional vector of scores and each of four human experts 1600-dimensional vector of scores. Figure 2 shows the correlation coefficients as bar plots. Here, we observe that the correlation coefficients range from very weak () to moderate (
). To determine if the correlation differences are statistically significant between languages, we applied Zou’s confidence intervals test for dependent and overlapping variables
Zou (2007).For coherence (see figure 1(a)), we found statistically significant difference for ROUGE-2 in Spanish, for BertScore in German, Spanish and Italian, and for BLANC in French. For consistency (figure 1(b)), all the correlation coefficients were not significantly different except for the correlation of BLANC in French. There were no significant significant differences for fluency (figure 1(c)). There was significant difference for relevance (figure 1(d)) for ROUGE-2 in German.
These seven exceptions - statistically significant differences - are discussed in Appendix A, where we argue that at least two of them are certainly not the fault of the translation, and that the remaining cases may be not the fault of the translation either. Overall, the correlation of automatic metrics in translated languages with expert annotations did not differ significantly from the correlations in English for a high proportion of the correlations, indicating that the translation did not cause a significant difference when correlating with human annotations.
4 Conclusion
In this paper, we explored how automated evaluation of summarization quality may depend on translation of the texts and summaries to another language without repeating the human annotation in the new language. To do so, we focused on SummEval dataset Fabbri et al. (2020) as our example and considered its translation (by Helsinki-NLP models666https://huggingface.co/Helsinki-NLP) to French, German, Italian and Spanish.
Our analysis in Section 3 demonstrates that although the translation changed the absolute value of automatic metrics, their rank-order and correlations with human annotations stay the same for a very high proportion of correlations. In other words, summary evaluation mostly survives translation to another language, at least to not a very different language.
We tentatively suggest that a human annotated English summarization dataset can be translated and used for evaluating summaries and summarization systems in other languages. In this sense, the translation quality is currently higher than the quality of summary evaluation, and summarization research can benefit from it. In future it would be interesting to consider translation to more languages and to review human-metrics correlations for more different metrics. As translation to other languages can be more challenging than to the languages we considered here, it would be interesting if one of evaluation metrics would be the best to give a reliable warning about relying on the original human scores for a specific quality (like coherence).
References
- SummEval: re-evaluating summarization evaluation.. arXiv arXiv:2007.12626v4. External Links: Link Cited by: §1, §1, §2, §4.
- Teaching machines to read and comprehend.. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1693–1701. Cited by: §1.
- Reliability of human evaluation for text summarization: lessons learned and challenges ahead.. In Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval), pp. 86–96. External Links: Link Cited by: §1.
-
Evaluating the factual consistency of abstractive text summarization..
In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
, B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 9332–9346. External Links: Link Cited by: §1. - ROUGE: a package for automatic evaluation of summaries.. In Proceedings of Workshop on Text Summarization Branches Out, pp. 74–81. External Links: Link Cited by: §2.
- Automatically evaluating content selection in summarization without human models.. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, P. Koehn and R. Mihalcea (Eds.), pp. 306–314. External Links: Link Cited by: §2.
- BLEU: a method for automatic evaluation of machine translation.. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, pp. 311–318. External Links: Link Cited by: §2.
-
Fill in the BLANC: human-free quality estimation of document summaries.
. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, S. Eger, Y. Gao, M. Peyrard, W. Zhao, and E. Hovy (Eds.), pp. 11–20. External Links: Link Cited by: §2. - Transformers: state-of-the-art natural language processing.. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. External Links: Link Cited by: §2, §2.
-
BERTScore: evaluating text generation with bert.
. arXiv arXiv:1904.09675v3. External Links: Link Cited by: §2. - Toward using confidence intervals to compare correlations.. Psychological methods 12 (4), pp. 399. Cited by: §3.
Appendix A Exceptional cases
As we have found (Figure 2), there were 7 cases where a translation resulted in statistically significant changes of correlation of evaluation measure with original human scores. These cases are summarized in Table 2.
Lang | Quality | Metrics | Change |
DE | Coherence | BertScore | decreased |
Relevance | ROUGE-2 | decreased | |
FR | Coherence | BLANC | increased |
Consistency | BLANC | decreased | |
ES | Coherence | ROUGE-2 | increased |
Coherence | BertScore | decreased | |
IT | Coherence | BertScore | decreased |
Any such case can be a consequence of either one or both of the following situations:
-
Quality of translation is not high enough for using existing summary evaluation measures.
-
Evaluation metrics involved has different quality in the considered language vs English. The reason could be in the language itself, or/and in different quality of the underlying model of the metrics. The latter may applicable to BertScore, BLANC.
The two exceptional cases, highlighted in Table 2, - FR-Coherence-BLANC and ES-Coherence-ROUGE-2 - can be explained only by the second situation, because the quality of evaluation measure after translation only increased. Thus, we got the remaining 5 suspect cases, out of 112 comparisons (4 languages x 4 qualities x 7 measures). We have no definitive resolution about these 5 cases, but we can consider the following arguments.
If the translation lowers some summary quality (for purposes of using our evaluation measures), then it would be likely that not one but several measures would have significantly lower correlations with the original human scores. However, for each of the 5 cases we have only a single metrics per a Language-Quality pair. In 4 cases the metrics (BertScore or BLANC) uses an underlying model, and it is possible that the underlying models for BertScore in German, Spanish, Italian and for BLANC in French are not of the same quality as in English. It is also possible that the principles on which a metrics is based (whether the metrics is ROUGE-2 or BertScore or BLANC) are successful to different degrees in different languages.
Comments
There are no comments yet.