MTG: A Benchmarking Suite for Multilingual Text Generation

08/13/2021 ∙ by Yiran Chen, et al. ∙ ByteDance Inc. 0

We introduce MTG, a new benchmark suite for training and evaluating multilingual text generation. It is the first and largest text generation benchmark with 120k human-annotated multi-way parallel data for three tasks (story generation, question generation, and title generation) across four languages (English, German, French, and Spanish). Based on it, we set various evaluation scenarios and make a deep analysis of several popular multilingual generation models from different aspects. Our benchmark suite will encourage the multilingualism for text generation community with more human-annotated parallel data and more diverse generation scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Natural language generation (NLG) is the task to automatically generate meaningful texts, typically using a non-linguistic or textual representation of information as input Covington (2001). These texts generally aim to realize an underlying communicative goal while remaining coherent with the input information and keeping grammatically correct. Multilingual text generation extends the natural language generation task to produce texts in different languages, which is important to overcome language barriers and enable universal information access for the world’s citizens Artetxe et al. (2020); Arivazhagan et al. (2019).

However, most of the released text generation datasets are limited to English Rajpurkar et al. (2018); Ritter et al. (2011); Akoury et al. (2020). This limits the ability of researchers to understand many aspects of multilingual text generation that are related to open vocabulary and complex morphology Kageura (2012). To this end, some works have proposed text generation datasets in non-English languages, such as multilingual question answering dataset Longpre et al. (2020), cross-lingual summarization dataset Ladhak et al. (2020) or monolingual text generation datasets with plenty of languages pairs Liang et al. (2020); Gehrmann et al. (2021). However, they either apply to a single task Longpre et al. (2020); Ladhak et al. (2020), or limited to non-parallel data across different languages Liang et al. (2020); Gehrmann et al. (2021), which leads to constrained scenarios. A benchmark that enables the comprehensive evaluation of both parallel and non-parallel multilingual testing scenarios on a diverse range of languages and generation tasks is still missing.

In this paper, we propose MTG, a new benchmark suite for multilingual text generation and evaluation, to address the above problems. MTG is a human-annotated multi-way parallel dataset with three text generation tasks (story generation, question generation, and title generation) across four languages (English, German, French, and Spanish). The multi-way parallel feature enables fully parallel across all four languages, which can provide 12 parallel pairs and 4 monolingual pairs for each task. With the dataset in hand, we evaluate several strong, representative pre-trained models, including multilingual BERT (M-BERT) Devlin et al. (2019), XLM Lample and Conneau (2019), mBART Liu et al. (2020) and mT5 Xue et al. (2020) under six evaluation scenarios, which covers monolingual training, multilingual training, monolingual-multitask training, multilingual-multitask training, cross-lingual generation, and zero-shot transfer.

In summary, the contributions of this paper are listed as follows: (i) We release a new text generation benchmark suit MTG covering three tasks across four languages, with the human-annotated multi-way parallel data for each task. (ii) Based on the multi-way parallel characteristic, we provide six different test scenarios, which include monolingual, multilingual, monolingual-multitask, multilingual-multitask, cross-lingual training and zero-shot transfer. We evaluate several representative pre-trained models for different tasks and scenarios, and give an extensive analysis of the experimental results from different aspects. We further conduct qualitative experiments to verify the advantages of human-annotated multi-way parallel data. iii) We propose a new evaluation metric and prove that it has higher correlation score with human scores compared with other automatic metrics.

2 Related Work

2.1 Multilingual Dataset

BENG Moussallem et al. (2020) is a benchmarking platform for natural language generation and knowledge extraction system, which is limited to English data. XTREME Hu et al. (2020) is a multilingual understanding benchmark across languages and tasks, but it does not cover any generation task. Jiang et al. (2020) propose X-FACTR, which is a cross-lingual factual retrieval benchmark. Longpre et al. (2020) propose MKQA, an open-dowmain question answering evaluation dataset covering diverse languages. Ladhak et al. (2020) present WikiLingual, which is a large-scale, multilingual dataset for cross-lingual abstractive summarization systems. Wiki-Guo et al. (2020) is a multilingual language model dataset across languages. Although these datasets cover multiple languages, they belong to a single, specific generation task, which limits researchers to obtain general findings considering a set of tasks. XGLUE Liang et al. (2020) is a cross-lingual benchmark dataset for nine understanding tasks and two generation tasks. GEM Gehrmann et al. (2021) is a newly-presented natural language generation benchmark covering tasks. A marked difference of our MTG from them is that MTG is parallel across all languages, which stimulates more testing scenarios.

2.2 Multilingual Modeling

Multilingual pre-training models can bring better performance for downstream tasks. Multilingual BERT (M-BERT) Devlin et al. (2019) is a single language model pre-trained from monolingual corpora in languages using Masked Language Modeling task. XLM Lample and Conneau (2019) is pre-trained simultaneously with Masked Language Model task and Translation Language Model task, which is then extended to a RoBERTa-version called XLM-R Conneau et al. (2019). Unicoder Huang et al. (2019) further leverages more cross-lingual pretraining tasks and achieves better results on XNLI than XLM. Multilingual BART Liu et al. (2020) is a pre-trained encoder-decoder model using denoising auto-encoding objective on monolingual data over languages. Multilingual T is a multilingual variant of T leveraging a text-to-text format. mT is pre-trained on a span-corruption version of Masked Language Modeling objective over languages.

3 Dataset Collection and Methodology

In this section, we will introduce how to create the benchmarking suite for multilingual text generation (MTG). First, several criteria selecting the tasks and datasets are described. Then one language is chosen as the starting language and the multi-way dataset is constructed by translating from the starting language to other languages. The other languages will be properly selected based on some principles. Finally, the data annotation process will be described in detail to ensure the quality of the dataset.

Task Corpus Domain Format Goal
Story Generation ROCStories Daily life <story> Generate the end of the story
Question Generation SQUAD 1.0 Wikipedia <passage, answer, question> Generate the question of the answer
Title Generation ByteCup News <article, title> Generate the title of the document
Table 1: The description of tasks and English datasets included in MTG.

3.1 Task and Dataset Selection

There are plenty of generation tasks in monolingual text generation. It is important to select suitable tasks for our MTG benchmark to make it diverse and challenging. Thus, we define several criteria during the task selection:

Task Definition Tasks should be well-defined, which means that humans can easily determine whether the generated results meet the task requirements. Besides, these tasks should have been well-studied in one language and rarely been studied in multilingual scenarios.

Task Difficulty Tasks should be solvable by most college-educated speakers. Beyond that, they should be challenging to current models, the performance of which in various test scenarios falls short of human performance.

Task Diversity Tasks should cover a wide range of relevant generation challenges that allow for findings to be as general as possible.

Input Format The input format of the tasks needs to be as simple as possible in order to reduce the difficulty of data processing. Besides, it should not contain anything but text (e.g., images or videos).

In order to meet the above criteria, 8 domain experts are asked to vote from 10 typical generation tasks111

These generation tasks are story generation, commonsense generation, style transform, question generation, question answering, dialogue generation, title generation, text summarization, image caption, and data-to-text.

. Finally, three generation tasks are selected for MTG, which are story generation, question generation, and title generation. Story generation (SG) aims to generate the end of a given story context, which requires the model to understand the story context and generate a reasonable and fluent ending Guan et al. (2019). Question generation (QG) targets at generating a correct question for a given passage and its answer Duan et al. (2017). For the same passage with different answers, the system should be able to generate different questions. Title generation (TG) converts a given article into a condensed sentence while preserving its main idea Jin and Hauptmann (2002). The title should be faithful to the original document while encourage users to read the news. These three tasks are very different from each other and focus on different generative abilities.

After determining the tasks, the next step is to choose the dataset for each task. The two selection principles are listed as follows:(1) License: Task data must be available under licenses that allow using and redistributing for research purposes. The dataset should be freely available for download. (2) Quality: The dataset size should be as large as possible and the quality should be checked.

Based on the accessibility and quality concerns, English is selected as the starting language and English datasets for the three tasks are gathered. We choose ROCStoriesMostafazadeh et al. (2016) for story generation, SQUAD Rajpurkar et al. (2016) for question generation and ByteCup 222https://www.biendata.xyz/competition/bytecup2018/ for title generation. These datasets are popular in the corresponding fields and have been verified to be high-quality by many works. Moreover, they are all under a permissive license. A specific example of question generation is shown in Figure 1. An overview of all task datasets is shown in Table 1.

Figure 1: An example of question generation task.

3.2 Language Selection

The original datasets are only in English (en) and we want to extend them into multi-way parallel scenarios. This means that all English text should be translated into other languages, which will cause an expensive annotation cost. Thus, a state-of-the-art translator is used to do the translation and then annotators are asked to correct the translated text. Based on it, MTG should contain languages which (1) have good English-to-X translators and (2) are diverse on language family. Finally, German (de), French (fr) and Spanish (es) are chosen. German is from the same language branch with English while French and Spanish are from the different ones. In the future, more distant languages will be covered into MTG to discover more interesting results across various languages, such as Chinese.

3.3 Data Collection

After determining the tasks and languages, we introduce the data annotation process to get the MTG. Firstly, the Google Translator333https://translate.google.com/

is used to translate the English datasets to the selected languages. Then, the same translator is used to translate the results back to English. If the n-gram overlap ratio between the original English text and the back-translated one is less than the setting threshold, the example will be removed. Different threshold values (from

to with as step length) are tried and if the threshold is set to , the training data size of QG will drop more than . Thus we decide to use as the threshold number to improve the quality of the filtered data while still maintain more than of the original training data.444The detailed numbers of the filtered datasets with respect to different thresholds are included in the appendix. All four languages are aligned to ensure the dataset to be multi-way parallel. Afterward, 87k, 72k, and 280k rough parallel training data for SG, QG, and TG are constructed.

samples of each task are randomly chosen from the merged development and test sets as data for annotation. The annotators are required to further check the translated results based on the following rules: (1) Relevance Whether the target text is meaningful and relevant to the source text. (2) Fluency Whether the translated text is grammatically correct. (3) Style Whether the translation follows local culture, language conventions, and gender-related words.

If the translated text bothers any of the above rules, annotators will correct it accordingly. The annotated data is then split to 5k/2k/3k as training/development/test subset for training, validation, and test.

Then the multi-way parallel generation benchmark MTG is finally completed. It contains three different generation tasks (SG, QG, TG) in four languages (en, de, fr, es). Each example is fully parallel across the four languages, which means we can take the input in one language and use the output in another language. The statistic of MTG is shown in Table 2.

Task SG, QG, TG
For each language
Rough training size 87k/72k/280k
Annotated training size 5k/5k/5k
Annotated development size 2k/2k/2k
Annotated test size 3k/3k/3k
For four languages (en, de, fr, es)
Total Annotated size 120k
Total dataset size 1.87m
Table 2: The statistic of MTG. MTG consists of four subsets: rough training, annotated training, development and test set. The rough training set is filtered by back-translating across four languages. The annotated training, development and test sets are corrected by human experts.

3.4 Annotation Process

A team of 10 full-time experts555There are 3 language experts for German, 3 for French, and 4 for Spanish. are hired to do the annotation, who are paid daily. Some part-time workers666There are 16 part-time workers who participated in the German annotation, 39 for French and 4 for Spanish. are also employed to increase the annotations throughput, who are paid by the number of annotations. Each annotator is an expert in at least two languages (English and another target language). They are firstly trained on how to correct error translations according to the above rules and then annotate a small number of examples as test. These examples will be rechecked by us and give feedback to help them understand the tasks. After this annotation training process, the annotators will start annotating the dataset. For quality control, we sample from the generated annotations and arranged 9 experts to recheck them. Each example will be assigned to two other experts and the data will be qualified only if both of them agree on the annotation777The grammar, expressions, punctuation of the annotated text are completely correct and the expressions are in accordance with the foreign language.. If more than of the annotations fail, then all the data of that annotator for that day will be re-checked.

It takes days to finish the annotation process. Full-time experts are paid dollar per day and part-time annotators are paid per example888Full-time employees work at most hours per day, and the local minimum hourly wage is . The part-time annotators can produce at least examples per hour..

4 Experiments

In this section, we conduct extensive experiments to benchmark the difficulty of our proposed MTG via several state-of-the-art multilingual models under different scenarios.

Human score AdaBoost Bagging DecisionTree ExtraTree GradientBoosting Kneighbors Linear RandomForest SVR
Grammar 0.203 0.235 0.065 0.312 0.319 0.162 0.255 0.248 0.280
Fluency 0.166 0.265 0.087 0.299 0.333 0.288 0.255 0.233 0.254
Logical 0.243 0.136 -0.021 0.190 0.315 0.037 0.184 0.059 0.239
Avg 0.260 0.251 0.110 0.183 0.437 0.176 0.316 0.215 0.299
Table 3: The pearson correlation scores between different regressors and human annotated scores. Avg means the adverage human annotated score of Grammar, Fluency and Logical.

4.1 Evaluation Models

The performance of the following four most popular multilingual pre-trained models are explored:

M-BERT Multilingual BERT (M-BERT) Devlin et al. (2019) is a single language model pre-trained from monolingual corpora in languages using Masked Language Modeling (MLM) task. M-BERT leverages a shared vocabulary of k WordPiece tokens and has layers with M parameters totally.

XLM The Cross-Lingual Language Model (XLM) Lample and Conneau (2019) is pre-trained simultaneously with Masked Language Model (MLM) task using monolingual data and Translation Language Model (TLM) task using parallel data. XLM has a shared vocabulary of k byte-pair encoded (BPE) subwords Sennrich et al. (2016) and layers totaling M parameters.

mBART Multilingual BART (mBART) Liu et al. (2020) is a pre-trained encoder-decoder model using denoising auto-encoding objective on monolingual data over languages. mBART has a shared vocabulary of k tokens leveraging Sentence Piece tokenization scheme. mBAR T consists of -layer encoder and -layer decoder with a total of M parameters.

mT Multilingual T (mTXue et al. (2020) is a multilingual variant of T Raffel et al. (2020) leveraging a text-to-text format. mT is pre-trained on a span-corruption version of Masked Language Modeling objective over languages. It is composed of -encoder layers and decoder layers with B parameters.

4.2 Evaluation Scenarios

A salient feature of MTG is that it is multi-way parallel across all languages. Thus, we conduct experiments in various scenarios to demonstrate the wide usability of MTG.

Monolingual fine-tuning The pre-trained model is trained for a downstream task using the training data for a specific language and evaluates on the test set for the same language. The input and output of the downstream task are in the same language.

Monolingual Multitask fine-tuning A universal model for all tasks is trained. But it is still language specific as described in Monolingual fine-tuning.

Multilingual fine-tuning The pre-trained model is jointly finetuned with data in all languages for a specific task. Different from the monolingual fine-tuning setting, there is only one model for each downstream task, which can serve all languages.

Multilingual Multitask fine-tuning Rather than only combining training data of all languages in Multilingual fine-tuning, data from all languages and all tasks are gathered to train a single model for all languages and tasks.

Cross-lingual generation Since MTG is multi-way parallel, it can be reorganized to create input-output pairs that belong to different languages. For example, in title generation, a sample can consist of source document in English and target title in German. For a multilingual dataset with languages, input-output pairs can be constructed, where the direction matters. The cross-lingual generation performances on all directions are evaluated.

Zero-shot transfer We also try to explore the zero-shot generation on the three tasks. The model is finetuned on a specific task with English input and output. Then it is used to generate output in other languages with a given language tag.

Figure 2: The pearson correlation scores between automatic metric scores and human annotated scores. Avg denotes the average score of Grammar, Fluency and Logical scores.
(a) SG
(b) QG
(c) TG
Figure 3: Ensemble metric scores for four models under four different settings. Here Mono and Multi represent the monolingual and multilingual training respectively. *-M denotes the corresponding multitask training counterparts.

4.3 Evaluation Metrics

The quality of generated texts are evaluated from different aspects to fully understand the model performance. Moreover, we propose a new ensemble metric that has higher correlation scores with human annotations scores.

N-gram based Metrics N-gram-based metrics evaluate the text-overlapping score between the outputs and references. The following three metrics are used: (1) BLEU Papineni et al. (2002) is a popular metric that calculates the word-overlap scores between the generated texts and gold-standard ones. We use the BLEU-4, which is the average score for unigram, bigram, trigram, and 4-gram. (2) ROUGE Lin (2004) is a recall-oriented metric that counts the number of overlapping units such as n-gram, word sequences, and word pairs between the produced texts and gold-standard ones. ROUGE-L calculates the overlapping of the longest common subsequence between generated results and the references. (3) METEOR Banerjee and Lavie (2005) relies on semantic features to predict the similarity scores between system hypotheses and human references.

Embedding based Metrics The embedding-based metrics can, to a large extent, capture the semantic-level similarity between the generated texts and the ground truth. (1) BERTScore Zhang et al. (2019)

computes the token similarity of candidates and references as a sum of cosine similarities of tokens using pre-trained BERT contextual embeddings. (2)

BLEURT Sellam et al. (2020) is a learned metric on a diverse set of lexical- and semantic-level supervision via the BERT-base architecture. However, it only supports English currently.

Diversity Metrics We also employ the distinct metric Li et al. (2016), which calculates the proportion of the distinct n-grams in all the system hypotheses and can be used to evaluate the diversity of the generated texts. In this paper, we choose the distinct-1 for unigram diversity.

Human Evaluation Human evaluation is also leveraged to better understand the quality of model outputs. Specifically, cases are randomly sampled from the test set for each task and language, and then they are presented to human annotators with the model outputs. They evaluate each model output from three aspects: Grammar, Fluency, and Relevance and give a score from 1 to 5 for each aspect. The detailed annotation rules are in the appendix.

Ensemble Metric For the samples with human annotation, their automatic metrics scores (except for BLEURT because it only supports English) are gathered as features. Then the human annotated scores are served as targets. All these data are split into training, development and test sets. After comparing different regression models’ performance as shown in Table 3

, we finally choose gradient boosting regression model as the ensemble metric.

The correlations between automatic metric and human evaluation scores on the above test set are displayed in Figure 2. Our ensemble metric outperforms the other three metrics999The comparison with all other automatic metrics are included in the appendix, still, our ensemble metric achieves the best performance..

(a) SG M-BERT
(b) SG XLM
(c) SG mBART
(d) SG mT5
(e) QG M-BERT
(f) QG XLM
(g) QG mBART
(h) QG mT5
(i) TG M-BERT
(j) TG XLM
(k) TG mBART
(l) TG mT5
Figure 4: The cross-lingual ensemble metric results for four models (M-BERT, XLM, mBART and mT5) on three tasks (SG, QG and TG). Here, every cell of row lang1 and column lang2 means the model is trained on lang1 and tested on lang2. Deeper gray represents better cross-lingual performance while deeper red indicates worse performance.

4.4 Results

We finetune the state-of-the-art multilingual pre-trained models on the three tasks involved in MTG, and evaluate the performance under different scenarios. Due to the limited space, we list parts of results in the main body, and the full results and experimental settings can be found in the appendix.

Monolingual and Multilingual

In most cases, multilingual training can improve model performance. As shown in Figure 3, For M-BERT, mBART and mT5, training with all language data can improve the model performance on all tasks compared with the monolingual ones. In the multitask setting, multilingual training can also bring performance improvement. Compared with other models, multilingual training can bring more performance enhancement for mT5. This is because the mT5 has more parameters and performs better when the training data size increases.

Multitask and Single-task

Besides multilingual training, we also explore the influence of multitask training by training with all tasks’ data. As shown in Figure 3, compared with single-task training on monolingual data, the multitask counterparts are likely to boost the performance. While for multilingual data, adding multitask training objective sometimes can cause performance decline especially in the task of story generation. This is because the reference of story generation is less deterministic compared with title generation and question generation. The task difference between story generation and the other two causes the multitask training performance to drop.

Setting Model BLEU R-1 R-2 R-L METEOR BERTScore D-1 D-2 Ensemble
SG M-BERT 0.0 3.2 0.0 3.2 0.048 0.703 0.94 0.98 2.746
XLM 0.5 9.6 1.1 9.1 0.068 0.694 0.89 0.92 3.074
mBART 0.1 4.0 0.0 4.0 0.038 0.687 0.90 0.98 2.458
mT5 0.1 3.3 0.0 3.3 0.044 0.705 0.98 1.00 2.768
QG M-BERT 0.6 3.2 0.4 3.1 0.059 0.731 0.92 0.99 2.636
XLM 2.9 20.7 4.1 19.6 0.126 0.738 0.94 0.98 3.297
mBART 1.0 5.5 1.0 5.3 0.071 0.745 0.98 1.00 2.820
mT5 1.3 5.0 1.0 4.9 0.075 0.748 0.97 1.00 2.838
TG M-BERT 2.0 7.5 2.0 7.2 0.106 0.658 0.91 0.95 3.007
XLM 3.4 14.9 5.2 13.7 0.133 0.688 0.88 0.93 3.150
mBART 4.5 13.9 5.1 13.3 0.145 0.711 0.97 1.00 3.222
mT5 4.4 11.8 4.3 11.2 0.128 0.686 0.97 0.99 3.135
Table 4: Zero shot results for four models on three tasks. The results are the average scores on the en->de, en->fr, en->es test set. Here R-* and D-* represent ROUGE-* and Distant-* respectively.
Setting Model Gram. Flu. Rel.
Mono M-BERT 3.89 3.93 3.55
XLM 4.37 4.40 4.13
mBART 4.02 4.03 3.83
mT5 4.18 4.17 4.05
Multi M-BERT 3.89 3.78 3.70
XLM 4.03 3.99 3.84
mBART 3.96 3.72 3.58
mT5 3.84 3.74 3.72
Cross M-BERT 3.82 3.90 3.94
XLM 3.64 3.64 3.69
mBART 3.68 3.64 3.70
mT5 3.72 3.58 3.81
Zero M-BERT 2.47 1.76 1.67
XLM 3.16 2.69 2.72
mBART 3.31 3.00 3.18
mT5 3.41 3.22 3.20
Table 5: Human evaluation on TG task under four evaluation settings. The score is 1-5 with 5 is the best. It is the average score on four languages. ‘Gram.’, ‘Flu.’, ‘Rel.’ indicates Grammar, Fluency and Relevance respectively. The bolded results are the best among four models. The full results are included in the appendix.

Cross-lingual

In this paper, we make use of the multi-way parallel data to do the supervised cross-lingual training, e.g., for English centric cross-lingual training, we take the English source as the input and the parallel German, French, Spanish target as the output. Then we evaluate the model on same setting (en->de, en->es, en->fr). Figure 4 contains the four language centric cross-lingual results and monolingual results. Compared with the other three models, XLM performs best in most cases (cells for XLM are closer to gray). This is because XLM makes use of parallel data for TLM during the pre-training phase, which gives it cross-language capabilities. Besides, mT5 outperforms mBART under the cross-lingual setting when both of them only use the monolingual corpus for pre-training. The reason is that the training of mT5 contains more languages compared with languages for mBART.

On the other hand, as displayed in Figure 4, it is much easier for models to transfer to English compared with transferring to German (nearly all columns of English is gray while most cells in the columns for German are red). This is because the word order of German is more different from the other three languages.

Zero-shot

We also explore the zero-shot transfer by training on English input and output. And then, we try to generate output in other languages via a given English input and a language tag directly. Different from the cross-lingual setting, the model does not see en->x (x=de, es, fr) data during training. Table 19 indicates that XLM still outperforms other models by a large margin on SG and QG tasks. However, mT5 fails because it does not have language tags during the pre-training. Since we only use en->en data to finetune models to the downstream tasks, mT5 has never seen the language tags of target languages and can not generate the corresponding language.

Human evaluation

Table 5 presents the human evaluation scores for TG in four evaluation scenarios. From the table, we can find that XLM and mT5 usually have higher scores than M-BERT and mBART. This is because they are pre-trained with more languages and better pretrain tasks, which enables them to fuse semantic spaces of different languages better.

Takeaways

Based on the analysis above, we can draw several conclusions: (1) Multilingual training can boosting model performance in most cases. (2) German is harder for the cross-lingual model to generate compared with the other three languages. (3) XLM has the best cross-lingual performance across all three tasks. In the setting of zero-shot, it still maintains superior performance.

4.5 Discussion

Pseudo vs. Annotated

To answer the question Does the 5k annotated data help the model to generate better ? We use the rough training data filtered by back-translation for the first stage finetuning and the annotated training data for the second stage. We conduct the ablation study on the two-step finetuning in QG under all evaluation scenarios with mBART and present the results in Figure 5

. To compare the performance between the first stage and the second, we also make T-test and prove that the improvement of annotated training data is significant in nearly all settings (except for in Multilingual-Multitask setting)

101010The average score and p-value are displayed in appendix.. It illustrates that although the number of annotated data is small, it can further improve the performance. It also highlights the necessity of human-annotated multilingual data, comparing with pseudo parallel data via machine translation.

Figure 5: The different stage performances of models under various settings. Here stage1 represents models trained only on rouge training data while stage2 represents models further trained on human annotated training data based on models in stage1.

Translation vs. Cross-lingual

Different from cross-lingual generation, we can also generate output in the same language with the input and then translate it to another language. Thus, we compare the results of cross-lingual generation with the generation-translation baseline, which uses the English generation model to get English output and then translate it to German, French, and Spanish. The results are plotted in Figure 6. As the figure shows, the cross-lingual generation model in almost all tasks and languages outperforms the generation-translation baselines. This means that the supervised cross-lingual generation is the better solution when the source and target are in different languages. Our multi-way parallel multilingual benchmark provides plenty of cross-lingual data in different directions, which will encourage research on cross-lingual generation.

Figure 6: The cross-lingual performances of translation-based models and models directly trained on our cross-lingual data.

5 Conclusion

In this paper, we propose a multilingual benchmark MTG for text generation. It contains three typical generation tasks: story, question, and title generation. The key feature of MTG is that it has multi-way parallel data across four languages: English, German, French, Spanish. It provides the benchmark ability to create more evaluation scenarios, such as cross-lingual training and zero-shot transfer. We also benchmark state-of-the-art multilingual pre-trained models on our MTG from different metrics to explore its features and challenges and promote research and progress in multilingual text generation.

6 Ethics Consideration

Since we propose a new multilingual text generation benchmark MTG, we solve some possible ethic considerations in this section.

English dataset

We choose ROCStories, SQUAD 1.0 and XX111111As stated above, we do not use the true name here. as the English dataset for story, question and title generation tasks. All of them are available for research use under their licenses. They can be download freely from their websites121212ROCStories requires for some necessary contact information. We ensure that these datasets are only used for academic research and the dataset construction process is consistent with the intellectual property and privacy rights of the original authors.

Annotation process

As described in Section 3.4, we hire some full-time and part-time language experts to do the annotation and all of them are paid fairly. Their salary is higher than the local minimum average hourly wage. They are voluntary participants and are aware of any risks of harm associated with their participation. The annotation process is consistent with the intellectual property and privacy rights of the recruited annotators as well.

References

  • N. Akoury, S. Wang, J. Whiting, S. Hood, N. Peng, and M. Iyyer (2020) HUSH: a dataset and platform for human-in-the-loop story generation. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    pp. 6470–6484. Cited by: §1.
  • N. Arivazhagan, A. Bapna, O. Firat, D. Lepikhin, M. Johnson, M. Krikun, M. X. Chen, Y. Cao, G. Foster, C. Cherry, et al. (2019)

    Massively multilingual neural machine translation in the wild: findings and challenges

    .
    arXiv preprint arXiv:1907.05019. Cited by: §1.
  • M. Artetxe, S. Ruder, and D. Yogatama (2020) On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §1.
  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §4.3.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. Cited by: §2.2.
  • M. A. Covington (2001) Building natural language generation systems. Language 77 (3), pp. 611–612. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2.2, §4.1.
  • N. Duan, D. Tang, P. Chen, and M. Zhou (2017) Question generation for question answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 866–874. Cited by: §3.1.
  • S. Gehrmann, T. Adewumi, K. Aggarwal, P. S. Ammanamanchi, A. Anuoluwapo, A. Bosselut, K. R. Chandu, M. Clinciu, D. Das, K. D. Dhole, et al. (2021) The gem benchmark: natural language generation, its evaluation and metrics. arXiv preprint arXiv:2102.01672. Cited by: §1, §2.1.
  • J. Guan, Y. Wang, and M. Huang (2019) Story ending generation with incremental encoding and commonsense knowledge. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 6473–6480. Cited by: §3.1.
  • M. Guo, Z. Dai, D. Vrandečić, and R. Al-Rfou (2020) Wiki-40b: multilingual language model dataset. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 2440–2452. Cited by: §2.1.
  • J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson (2020) XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In

    International Conference on Machine Learning

    ,
    pp. 4411–4421. Cited by: §2.1.
  • H. Huang, Y. Liang, N. Duan, M. Gong, L. Shou, D. Jiang, and M. Zhou (2019) Unicoder: a universal language encoder by pre-training with multiple cross-lingual tasks. arXiv preprint arXiv:1909.00964. Cited by: §2.2.
  • Z. Jiang, A. Anastasopoulos, J. Araki, H. Ding, and G. Neubig (2020) X-factr: multilingual factual knowledge retrieval from pretrained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5943–5959. Cited by: §2.1.
  • R. Jin and A. G. Hauptmann (2002) A new probabilistic model for title generation. In COLING 2002: The 19th International Conference on Computational Linguistics, Cited by: §3.1.
  • K. Kageura (2012) The quantitative analysis of the dynamics and structure of terminologies. Vol. 15, John Benjamins Publishing. Cited by: §1.
  • F. Ladhak, E. Durmus, C. Cardie, and K. McKeown (2020) WikiLingua: a new benchmark dataset for cross-lingual abstractive summarization. arXiv preprint arXiv:2010.03093. Cited by: §1, §2.1.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §1, §2.2, §4.1.
  • J. Li, M. Galley, C. Brockett, J. Gao, and W. B. Dolan (2016) A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. Cited by: §4.3.
  • Y. Liang, N. Duan, Y. Gong, N. Wu, F. Guo, W. Qi, M. Gong, L. Shou, D. Jiang, G. Cao, et al. (2020) XGLUE: a new benchmark dataset for cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6008–6018. Cited by: §1, §2.1.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.3.
  • Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020) Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, pp. 726–742. Cited by: §1, §2.2, §4.1.
  • S. Longpre, Y. Lu, and J. Daiber (2020) MKQA: a linguistically diverse benchmark for multilingual open domain question answering. arXiv preprint arXiv:2007.15207. Cited by: §1, §2.1.
  • N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen (2016) A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 839–849. Cited by: §3.1.
  • D. Moussallem, P. Kaur, T. C. Ferreira, C. van der Lee, A. Shimorina, F. Conrads, M. Röder, R. Speck, C. Gardent, S. Mille, et al. (2020) A general benchmarking framework for text generation. In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)., pp. 27–33. Cited by: §2.1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: §4.3.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    .
    Journal of Machine Learning Research 21 (140), pp. 1–67. External Links: Link Cited by: §4.1.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789. Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392. Cited by: §3.1.
  • A. Ritter, C. Cherry, and W. B. Dolan (2011) Data-driven response generation in social media. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 583–593. Cited by: §1.
  • T. Sellam, D. Das, and A. P. Parikh (2020) BLEURT: learning robust metrics for text generation. arXiv preprint arXiv:2004.04696. Cited by: §4.3.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Cited by: §4.1.
  • L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel (2020) MT5: a massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934. Cited by: §1, §4.1.
  • T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2019) Bertscore: evaluating text generation with bert. arXiv preprint arXiv:1904.09675. Cited by: §4.3.

Appendix A Experimental settings

We use the encoder-decoder architecture for our generation tasks. Among the models described above, mBART and mT5 have been pre-trained for generation tasks, but M-BERT and XLM-R are only pre-trained for encoder representations. Therefore, we initialize the decoder with the encoder parameters for M-BERT and XLM-R. During the pre-training phrase, there are no language tags in M-BERT and mT5. Thus we manually add the language tag at the beginning of the source and target for M-BERT and add the target language tag to the beginning of source for mT5.

We adjust the input format for each task. For QG, we append the answer to the passage and insert a special token to separate them. For SG, we take the beginning four sentences as the source and make the last sentence as the target.

We take a two-step finetuning to make full use of our MTG benchmark. We first use the large rough parallel training data to train our models on the downstream tasks for 20 epochs, and then finetune the models on the small annotated training data to further improve the generation performance for 10 epochs. We evaluate the model for every 2000 steps and use the loss on development to choose the best model. The batch size is 32. The learning rate and optimizer parameters are set to the default parameters for each model.

Threshold QG TG SG
0 82306 393792 88161
0.3 80836 355034 88158
0.4 79390 333461 88077
0.5 71819 280376 87003
0.6 32261 144109 75892
Table 6: The data size of datasets filtered by back translation with respect to different thresholds.

Appendix B Human evaluation

We use Grammar, Fluency, and Relevance to evaluate the generation performance. Each score is from 1 to 5. The principle of each level can be found below.

Grammar

  1. Not the target language at all

  2. Is the target language, but has too many grammatical errors to convey the meaning

  3. Barely conveys the meaning of the sentence, but has many errors

  4. Grammatically correct overall, with a few errors

  5. Perfect sentence

Fluency

  1. Completely incomplete, with loose words

  2. Basically incomplete, with some phrases

  3. Barely formed sentences, but not fluent

  4. Basically complete, with a few flaws

  5. Perfect sentence

Relevance

  1. Not related at all

  2. A few words are relevant (e.g., character, place, time, etc.), but the overall narrative is not relevant

  3. Relevant, but not logical

  4. Basically reasonable, with a few irrelevant descriptions

  5. Very reasonable

Appendix C Back Translation Threshold Testing

The detailed data size of back translation filtered dataset for different tasks are presented in Table 6.

Appendix D Automated Metric Performances

The full comparison of the ensemble metric and other automatic metrics are displayed in Figure 7. The ensemble metric outperforms all other metrics.

Figure 7: The pearson correlation scores between automatic metric scores and human annotated scores. Avg denotes the average score of Grammar, Fluency and Logical scores.

Appendix E Significant Test Results

The average ensemble metric scores for stage1 and stage2 in question generation and the corresponding significant test p-values are displayed in Table 7. As it shows, adding human annotated training data can always improve the model performance under different settings. The improvements are all significant in all settings except for in Multilingual-Multitask setting.

Mono Multi Mono-M Multi-M Cross Zero
stage1 3.375 3.375 3.392 3.784 3.352 2.397
stage2 3.754 3.761 3.754 3.790 3.447 2.820
p-value 0.000 0.000 0.000 0.234 0.000 0.000
Table 7: The average ensemble metric scores for stage1 and stage2 in question generation task and the corresponding significant test p-values. Here stage1 represents models trained only on rouge training data while stage2 represents models further trained on human annotated training data based on models in stage1.

Appendix F Human evaluation scores

The human evaluation scores for different models and tasks are included in Table 8, 9 and 10.

Setting Model Gram. Flu. Rel.
Mono M-BERT 4.30 4.47 4.06
XLM 4.53 4.56 4.00
mBART 4.65 4.69 4.14
mT5 4.51 4.63 4.09
Multi M-BERT 4.31 4.38 4.22
XLM 4.20 4.39 4.43
mBART 4.22 4.34 4.31
mT5 4.12 4.13 4.15
Mono-M M-BERT 4.23 4.37 3.76
XLM 4.16 4.32 3.60
mBART 4.13 4.38 3.51
mT5 4.18 4.44 3.67
Multi-M M-BERT 4.32 4.42 3.78
XLM 4.31 4.41 3.78
mBART 4.18 4.38 3.59
mT5 4.20 4.48 3.79
Cross M-BERT 4.38 4.56 4.28
XLM 4.09 4.19 4.19
mBART 3.97 4.06 3.97
mT5 4.26 4.28 4.12
Zero M-BERT 2.07 2.10 2.12
XLM 2.04 1.98 1.98
mBART 2.07 2.04 2.04
mT5 2.19 2.21 2.22
Table 8: Human evaluation on QG task under four evaluation settings. The score is 1-5 with 5 is the best. It is the average score on four languages. ‘Gram.’, ‘Flu.’, ‘Rel.’ indicates Grammar, Fluency and Relevance respectively.
Setting Model Gram. Flu. Rel.
Mono M-BERT 3.89 3.93 3.55
XLM 4.37 4.40 4.13
mBART 4.02 4.03 3.83
mT5 4.18 4.17 4.05
Multi M-BERT 3.89 3.78 3.70
XLM 4.03 3.99 3.84
mBART 3.96 3.72 3.58
mT5 3.84 3.74 3.72
Mono-M M-BERT 4.09 4.26 3.43
XLM 4.12 4.35 3.38
mBART 4.03 4.19 3.30
mT5 4.21 4.35 3.54
Multi-M M-BERT 4.28 4.33 3.47
XLM 4.21 4.34 3.48
mBART 4.14 4.29 3.45
mT5 4.16 4.43 3.57
Cross M-BERT 3.82 3.90 3.94
XLM 3.64 3.64 3.69
mBART 3.68 3.64 3.70
mT5 3.72 3.58 3.81
Zero M-BERT 2.47 1.76 1.67
XLM 3.16 2.69 2.72
mBART 3.31 3.00 3.18
mT5 3.41 3.22 3.20
Table 9: Human evaluation on TG task under four evaluation settings. The score is 1-5 with 5 is the best. It is the average score on four languages. ‘Gram.’, ‘Flu.’, ‘Rel.’ indicates Grammar, Fluency and Relevance respectively.
Setting Model Gram. Flu. Rel.
Mono M-BERT 4.52 4.55 4.14
XLM 4.56 4.53 4.08
mBART 4.39 4.59 3.94
mT5 4.40 4.73 4.22
Multi M-BERT 3.70 3.62 3.91
XLM 4.00 4.09 4.16
mBART 4.04 4.04 4.08
mT5 4.07 4.06 4.03
Mono-M M-BERT 4.52 4.63 3.81
XLM 4.48 4.63 3.71
mBART 4.53 4.63 3.62
mT5 4.50 4.59 3.73
Multi-M M-BERT 4.47 4.58 3.90
XLM 4.53 4.63 3.80
mBART 4.43 4.66 3.61
mT5 4.53 4.63 3.67
Cross M-BERT 3.76 3.67 3.47
XLM 3.91 3.79 3.59
mBART 3.94 3.86 3.50
mT5 3.88 3.81 3.61
Zero M-BERT 1.36 1.28 1.23
XLM 1.47 1.47 1.36
mBART 1.16 1.12 1.07
mT5 1.34 1.32 1.26
Table 10: Human evaluation on SG task under four evaluation settings. The score is 1-5 with 5 is the best. It is the average score on four languages. ‘Gram.’, ‘Flu.’, ‘Rel.’ indicates Grammar, Fluency and Relevance respectively.

Appendix G Experimental Results

We present detailed experimental results of our four baseline models under four different evaluation settings here.

Task Model Language N-gram-based Embedding-based Others Ours
BLEU ROUGE-L METEOR BERTScore BLEURT Distant-1 Ensemble
SG M-BERT en->en 2.44 18.1 0.105 0.885 -0.771 0.93 3.68
de->de 1.39 11.8 0.124 0.704 - 0.918 3.172
fr->fr 1.14 12 0.205 0.711 - 0.935 3.259
es->es 1.26 12.4 0.106 0.701 - 0.949 3.16
XLM en->en 3.17 20.4 0.092 0.879 -0.747 0.96 3.768
de->de 2.5 15.8 0.101 0.712 - 0.962 3.325
fr->fr 3.4 18.3 0.167 0.727 - 0.952 3.394
es->es 3.13 16.3 0.123 0.725 - 0.96 3.341
mBART en->en 4.09 21.4 0.112 0.894 -0.683 0.98 3.755
de->de 3.63 15.3 0.135 0.733 - 0.989 3.336
fr->fr 4.07 17 0.194 0.75 - 0.985 3.403
es->es 4.03 15.4 0.136 0.743 - 0.987 3.31
mT5 en->en 1.75 16.6 0.09 0.886 -0.807 0.98 3.656
de->de 1.39 11.4 0.116 0.719 - 0.955 3.204
fr->fr 2.34 12.6 0.189 0.737 - 0.963 3.312
es->es 2.25 12.5 0.117 0.729 - 0.95 3.177
QG M-BERT en->en 12.87 38.6 0.189 0.891 -0.45 0.918 3.939
de->de 4.8 19.6 0.18 0.727 - 0.921 3.285
fr->fr 5.02 18.5 0.252 0.737 - 0.935 3.392
es->es 6.28 21.3 0.286 0.791 - 0.939 3.571
XLM en->en 19.24 44 0.226 0.911 -0.256 0.959 4.036
de->de 10.19 35.5 0.236 0.771 - 0.97 3.632
fr->fr 16.47 39.9 0.351 0.794 - 0.944 3.733
es->es 19.2 47 0.376 0.829 - 0.948 3.865
mBART en->en 19.09 44.7 0.216 0.912 -0.27 0.971 4.037
de->de 11.1 27.6 0.233 0.778 - 0.986 3.522
fr->fr 14.81 32.5 0.327 0.797 - 0.982 3.672
es->es 16.59 34.6 0.368 0.831 - 0.972 3.784
mT5 en->en 17.55 41.6 0.207 0.908 -0.382 0.967 4.01
de->de 9.93 23.3 0.216 0.76 - 0.979 3.447
fr->fr 13.03 25.3 0.312 0.762 - 0.964 3.544
es->es 15.93 28.5 0.354 0.817 - 0.956 3.702
TG M-BERT en->en 9.65 27.5 0.167 0.838 -0.743 0.914 3.72
de->de 3.34 11.6 0.132 0.678 - 0.887 3.067
fr->fr 5.27 18.2 0.217 0.709 - 0.9 3.307
es->es 6.75 21.8 0.227 0.718 - 0.871 3.387
XLM en->en 12.61 31.6 0.186 0.879 -0.581 0.96 3.855
de->de 6.24 19.8 0.177 0.71 - 0.942 3.348
fr->fr 9.28 25.8 0.27 0.739 - 0.907 3.52
es->es 9.6 27.2 0.263 0.747 - 0.897 3.557
mBART en->en 22.15 42.4 0.233 0.901 -0.362 0.984 4.007
de->de 9.48 22.9 0.203 0.734 - 0.977 3.41
fr->fr 12.29 29.4 0.282 0.759 - 0.949 3.551
es->es 13.8 32.1 0.315 0.77 - 0.939 3.62
mT5 en->en 15.02 34.7 0.193 0.881 -0.505 0.975 3.887
de->de 7.35 17.4 0.174 0.695 - 0.958 3.232
fr->fr 10.28 24.8 0.256 0.729 - 0.933 3.45
es->es 12.76 28.7 0.3 0.744 - 0.92 3.515
Table 11: The whole results under the monolingual evaluation scenarios.
Task Model Language N-gram-based Embedding-based Others Ours
BLEU ROUGE-L METEOR BERTScore BLEURT Distant-1 Ensemble
SG M-BERT en->en 2.61 18.1 0.098 0.888 -0.773 0.971 3.702
de->de 1.58 12 0.111 0.721 - 0.978 3.242
fr->fr 1.01 12 0.188 0.714 - 0.971 3.278
es->es 1.63 11.7 0.099 0.707 - 0.972 3.229
XLM en->en 2.14 18.3 0.083 0.874 -0.808 0.964 3.734
de->de 2.19 15.3 0.101 0.71 - 0.966 3.308
fr->fr 2.85 17.4 0.168 0.725 - 0.957 3.396
es->es 2.7 14.9 0.118 0.721 - 0.97 3.322
mBART en->en 4.45 21.9 0.116 0.895 -0.668 0.981 3.764
de->de 3.62 15.4 0.138 0.734 - 0.988 3.337
fr->fr 4.31 16.5 0.202 0.749 - 0.988 3.407
es->es 4.33 15.8 0.141 0.743 - 0.986 3.306
mT5 en->en 3.77 20.5 0.108 0.894 -0.7 0.972 3.739
de->de 3.12 14.6 0.134 0.732 - 0.983 3.327
fr->fr 3.61 16.5 0.188 0.748 - 0.965 3.376
es->es 3.56 15.2 0.135 0.744 - 0.972 3.26
QG M-BERT en->en 14.8 39.2 0.194 0.891 -0.468 0.918 3.929
de->de 6.09 21.1 0.188 0.733 - 0.953 3.317
fr->fr 6.34 21.6 0.262 0.741 - 0.926 3.419
es->es 7.36 21.9 0.298 0.789 - 0.931 3.566
XLM en->en 18.99 44.4 0.222 0.91 -0.288 0.955 4.036
de->de 9.81 35.8 0.238 0.77 - 0.972 3.623
fr->fr 16.19 40.3 0.351 0.797 - 0.946 3.737
es->es 19 47.3 0.375 0.83 - 0.942 3.865
mBART en->en 19.7 45.7 0.223 0.914 -0.236 0.974 4.041
de->de 11.92 29 0.245 0.782 - 0.986 3.557
fr->fr 14.79 32.6 0.327 0.793 - 0.978 3.66
es->es 16.42 34.5 0.366 0.832 - 0.972 3.784
mT5 en->en 19.87 45.9 0.227 0.914 -0.241 0.974 4.051
de->de 12.7 29.6 0.259 0.787 - 0.985 3.589
fr->fr 16.19 32.7 0.343 0.799 - 0.97 3.682
es->es 18.77 36.8 0.39 0.839 - 0.97 3.829
TG M-BERT en->en 9.88 28.1 0.164 0.848 -0.736 0.931 3.727
de->de 3.83 12.1 0.137 0.681 - 0.909 3.088
fr->fr 5.43 18.6 0.207 0.709 - 0.913 3.308
es->es 6.47 21.5 0.222 0.719 - 0.895 3.378
XLM en->en 12.05 32.1 0.19 0.88 -0.579 0.956 3.874
de->de 5.99 20.2 0.184 0.711 - 0.935 3.347
fr->fr 9.48 26.6 0.279 0.741 - 0.898 3.521
es->es 10.98 29.4 0.284 0.755 - 0.889 3.587
mBART en->en 21.36 42 0.231 0.901 -0.365 0.985 4.01
de->de 9.42 22.7 0.202 0.734 - 0.973 3.408
fr->fr 11.87 29.1 0.276 0.758 - 0.949 3.551
es->es 13.75 32.1 0.317 0.771 - 0.935 3.621
mT5 en->en 16.98 37.1 0.214 0.883 -0.448 0.976 3.924
de->de 8.21 18.8 0.193 0.697 - 0.954 3.243
fr->fr 11.31 26.3 0.276 0.733 - 0.934 3.476
es->es 12.76 28.7 0.3 0.744 - 0.92 3.515
Table 12: The whole results under the multilingual evaluation scenarios.
Task Model Language N-gram-based Embedding-based Others Ours
BLEU ROUGE-L METEOR BERTScore BLEURT Distant-1 Ensemble
SG M-BERT en->en 2.73 19.4 0.102 0.886 -0.812 0.92 3.69
de->de 2.46 13.3 0.13 0.718 - 0.946 3.226
fr->fr 1.53 13.6 0.196 0.721 - 0.935 3.28
es->es 1.86 13.3 0.115 0.717 - 0.955 3.178
XLM en->en 3.06 19.9 0.091 0.877 -0.756 0.962 3.753
de->de 2.98 16.3 0.107 0.708 - 0.973 3.336
fr->fr 3.52 18.3 0.173 0.723 - 0.959 3.398
es->es 2.44 15 0.116 0.718 - 0.962 3.32
mBART en->en 3.96 20.8 0.11 0.894 -0.692 0.983 3.744
de->de 3.23 14.7 0.131 0.731 - 0.989 3.326
fr->fr 3.89 16.3 0.197 0.749 - 0.989 3.4
es->es 3.64 14.8 0.131 0.741 - 0.988 3.29
mT5 en->en 2.51 17.7 0.099 0.889 -0.755 0.981 3.687
de->de 1.92 12.5 0.129 0.722 - 0.971 3.271
fr->fr 2.7 13.1 0.2 0.738 - 0.962 3.331
es->es 2.5 13 0.12 0.735 - 0.974 3.212
QG M-BERT en->en 13.03 38.4 0.192 0.89 -0.431 0.895 3.924
de->de 5.97 20.8 0.19 0.736 - 0.915 3.305
fr->fr 6.64 21.5 0.256 0.747 - 0.909 3.427
es->es 8.1 22.9 0.311 0.79 - 0.928 3.613
XLM en->en 20.18 45.3 0.231 0.912 -0.259 0.961 4.043
de->de 11.95 38.3 0.259 0.783 - 0.974 3.682
fr->fr 16.99 41.4 0.359 0.8 - 0.951 3.753
es->es 18.85 47 0.379 0.83 - 0.95 3.872
mBART en->en 18.89 44.8 0.218 0.912 -0.259 0.976 4.04
de->de 11.29 28.3 0.234 0.781 - 0.988 3.538
fr->fr 13.99 31.8 0.324 0.789 - 0.982 3.653
es->es 15.82 34.5 0.365 0.831 - 0.976 3.784
mT5 en->en 19.45 44.8 0.223 0.912 -0.273 0.973 4.039
de->de 8.93 22.2 0.211 0.733 - 0.956 3.399
fr->fr 15.63 30.1 0.339 0.78 - 0.955 3.624
es->es 16.51 29.5 0.374 0.821 - 0.957 3.737
TG M-BERT en->en 14.84 36.3 0.204 0.882 -0.527 0.915 3.896
de->de 6.18 15.7 0.165 0.707 - 0.892 3.122
fr->fr 7.53 21.8 0.238 0.732 - 0.898 3.404
es->es 8.96 25.3 0.268 0.749 - 0.898 3.507
XLM en->en 15.86 35.4 0.209 0.887 -0.523 0.963 3.923
de->de 7.51 21.8 0.194 0.716 - 0.932 3.367
fr->fr 11.18 28.3 0.294 0.746 - 0.901 3.554
es->es 10.79 29 0.285 0.753 - 0.9 3.595
mBART en->en 20.58 41.7 0.224 0.9 -0.384 0.985 4.004
de->de 8.62 21.7 0.19 0.73 - 0.979 3.391
fr->fr 10.95 28.4 0.261 0.755 - 0.955 3.534
es->es 12.18 30.8 0.296 0.766 - 0.945 3.597
mT5 en->en 14.31 33.1 0.187 0.883 -0.533 0.974 3.879
de->de 6.15 15.7 0.153 0.701 - 0.95 3.256
fr->fr 9.51 23.9 0.248 0.731 - 0.934 3.458
es->es 11.25 26.5 0.271 0.74 - 0.922 3.503
Table 13: The whole results under the monolingual multitask evaluation scenarios.
Task Model Language N-gram-based Embedding-based Others Ours
BLEU ROUGE-L METEOR BERTScore BLEURT Distant-1 Ensemble
SG M-BERT en->en 2.8 19.9 0.105 0.887 -0.78 0.918 3.709
de->de 2.71 14.1 0.135 0.72 - 0.944 3.234
fr->fr 1.54 13.7 0.205 0.719 - 0.935 3.274
es->es 2.04 13.7 0.121 0.718 - 0.952 3.18
XLM en->en 2.69 19.8 0.087 0.876 -0.761 0.966 3.759
de->de 2.94 16.9 0.109 0.711 - 0.967 3.35
fr->fr 3.34 18 0.173 0.723 - 0.956 3.403
es->es 2.74 15.8 0.12 0.721 - 0.97 3.354
mBART en->en 3.95 21.4 0.111 0.894 -0.69 0.98 3.758
de->de 3.26 15.1 0.135 0.731 - 0.987 3.333
fr->fr 3.65 16.1 0.191 0.748 - 0.986 3.387
es->es 3.96 15.1 0.136 0.742 - 0.986 3.286
mT5 en->en 3.09 19.9 0.106 0.892 -0.72 0.97 3.738
de->de 2.96 14.3 0.139 0.729 - 0.973 3.317
fr->fr 3.25 15.5 0.2 0.744 - 0.959 3.369
es->es 3.05 14.4 0.129 0.741 - 0.968 3.234
QG M-BERT en->en 17.68 44 0.224 0.902 -0.285 0.925 4.008
de->de 9.22 25.7 0.241 0.76 - 0.943 3.469
fr->fr 8.04 23.9 0.286 0.758 - 0.931 3.499
es->es 10.57 25.4 0.353 0.804 - 0.94 3.683
XLM en->en 18.16 43.6 0.224 0.909 -0.303 0.956 4.03
de->de 11.25 37.2 0.256 0.778 - 0.974 3.675
fr->fr 15.97 40 0.357 0.796 - 0.94 3.744
es->es 18.05 46.4 0.374 0.827 - 0.942 3.865
mBART en->en 20.24 47.2 0.229 0.916 -0.199 0.978 4.069
de->de 12.52 30.6 0.254 0.789 - 0.988 3.591
fr->fr 15.17 33.7 0.332 0.798 - 0.98 3.679
es->es 17.46 36.9 0.387 0.838 - 0.974 3.822
mT5 en->en 21.4 47.3 0.238 0.916 -0.206 0.971 4.068
de->de 12.33 28.2 0.262 0.774 - 0.978 3.559
fr->fr 17.03 33.2 0.356 0.794 - 0.968 3.668
es->es 18.25 31.7 0.399 0.831 - 0.966 3.781
TG M-BERT en->en 15.34 36.3 0.207 0.885 -0.501 0.941 3.916
de->de 6.94 16 0.177 0.71 - 0.919 3.214
fr->fr 7.6 22.1 0.238 0.731 - 0.907 3.404
es->es 9.09 25.5 0.271 0.749 - 0.913 3.497
XLM en->en 14.88 35.2 0.208 0.887 -0.514 0.962 3.922
de->de 8.17 23 0.209 0.721 - 0.937 3.392
fr->fr 11.16 28.4 0.294 0.748 - 0.901 3.557
es->es 12.13 30.5 0.303 0.76 - 0.901 3.62
mBART en->en 20.61 42 0.221 0.9 -0.372 0.987 4.003
de->de 8.76 22.3 0.193 0.731 - 0.977 3.393
fr->fr 11.08 28.8 0.258 0.756 - 0.954 3.531
es->es 13.17 32.1 0.312 0.77 - 0.94 3.615
mT5 en->en 16.46 36.1 0.205 0.888 -0.475 0.975 3.921
de->de 8.07 18.6 0.186 0.712 - 0.953 3.315
fr->fr 11 25.8 0.269 0.737 - 0.934 3.504
es->es 12.53 28.4 0.297 0.749 - 0.922 3.544
Table 14: The whole results under the multilingual multitask evaluation scenarios.
Task Model Language N-gram-based Embedding-based Others Ours
BLEU ROUGE-L METEOR BERTScore BLEURT Distant-1 Ensemble
SG M-BERT en->de 2.13 13.4 0.135 0.714 - 0.939 3.221
en->fr 1.21 12.7 0.201 0.719 - 0.949 3.29
en->es 1.69 13.1 0.115 0.714 - 0.934 3.139
XLM en->de 2.14 14.7 0.093 0.711 - 0.977 3.363
en->fr 2.7 17.2 0.176 0.723 - 0.932 3.372
en->es 1.93 14.1 0.107 0.716 - 0.954 3.299
mBART en->de 1.48 11 0.103 0.701 - 0.969 3.136
en->fr 1.43 12.7 0.164 0.727 - 0.977 3.318
en->es 1.25 11.2 0.094 0.723 - 0.961 3.149
mT5 en->de 3.33 15.2 0.138 0.733 - 0.978 3.32
en->fr 3.11 16.4 0.192 0.747 - 0.965 3.38
en->es 3.42 15.1 0.131 0.743 - 0.974 3.259
QG M-BERT en->de 5.68 19.5 0.181 0.705 - 0.922 3.289
en->fr 4.96 19.1 0.239 0.73 - 0.914 3.39
en->es 5.43 19.6 0.278 0.764 - 0.917 3.532
XLM en->de 9.18 35.3 0.239 0.77 - 0.967 3.626
en->fr 14.73 39.6 0.346 0.795 - 0.944 3.74
en->es 16.93 46 0.365 0.827 - 0.936 3.858
mBART en->de 4.62 20.6 0.171 0.75 - 0.985 3.383
en->fr 8.1 25.6 0.277 0.772 - 0.967 3.576
en->es 9.55 29.7 0.313 0.823 - 0.973 3.74
mT5 en->de 8.81 27.2 0.231 0.776 - 0.983 3.543
en->fr 12.4 31.5 0.336 0.799 - 0.967 3.695
en->es 14.35 35.4 0.372 0.838 - 0.966 3.829
TG M-BERT en->de 4.35 13.8 0.15 0.694 - 0.934 3.173
en->fr 3.35 13.2 0.145 0.696 - 0.926 3.197
en->es 6.6 22 0.231 0.726 - 0.912 3.422
XLM en->de 6.9 21.5 0.194 0.716 - 0.938 3.367
en->fr 9.94 27.2 0.284 0.744 - 0.898 3.532
en->es 11.24 29.8 0.292 0.756 - 0.895 3.593
mBART en->de 6.36 14.5 0.153 0.706 - 0.951 3.183
en->fr 8.39 21.6 0.213 0.738 - 0.958 3.412
en->es 9.85 23.6 0.25 0.748 - 0.949 3.457
mT5 en->de 8.38 20 0.201 0.686 - 0.964 3.174
en->fr 10.6 26.6 0.282 0.726 - 0.921 3.416
en->es 12.46 29.3 0.307 0.736 - 0.919 3.47
Table 15: The whole results under the English centric crosslingual evaluation scenarios.
Task Model Language N-gram-based Embedding-based Others Ours
BLEU ROUGE-L METEOR BERTScore BLEURT Distant-1 Ensemble
SG M-BERT de->en 1.78 16.9 0.098 0.884 -0.81 0.93 3.669
de->fr 1.19 12.8 0.192 0.716 - 0.95 3.29
de->es 1.27 12.9 0.108 0.712 - 0.948 3.153
XLM de->en 2.61 18.8 0.085 0.875 -0.792 0.958 3.748
de->fr 2.42 17.1 0.172 0.718 - 0.952 3.38
de->es 2.33 15 0.115 0.719 - 0.966 3.319
mBART de->en 1.99 18.4 0.094 0.887 -0.796 0.969 3.691
de->fr 1.21 11.8 0.144 0.723 - 0.933 3.174
de->es 0.8 9 0.081 0.711 - 0.941 2.957
mT5 de->en 3.61 20 0.104 0.893 -0.7 0.97 3.732
de->fr 3 15.6 0.186 0.744 - 0.963 3.356
de->es 3.01 14.4 0.126 0.741 - 0.975 3.243
QG M-BERT de->en 7.79 33 0.149 0.88 -0.649 0.884 3.834
de->fr 3.38 17.4 0.226 0.736 - 0.904 3.345
de->es 4 18.7 0.262 0.78 - 0.914 3.537
XLM de->en 12.03 37.6 0.186 0.9 -0.441 0.953 3.971
de->fr 11.58 36.5 0.317 0.786 - 0.944 3.709
de->es 12.88 42.5 0.323 0.816 - 0.947 3.831
mBART de->en 5.91 27.8 0.126 0.877 -0.827 0.872 3.757
de->fr 4.53 18.4 0.215 0.723 - 0.855 3.295
de->es 0.83 10.5 0.131 0.686 - 0.537 3.105
mT5 de->en 11.52 38.7 0.18 0.904 -0.424 0.97 3.99
de->fr 9.67 27.4 0.295 0.785 - 0.966 3.618
de->es 10.75 30.7 0.323 0.826 - 0.965 3.754
TG M-BERT de->en 6.71 24.6 0.142 0.859 -0.821 0.904 3.709
de->fr 5.15 17.7 0.205 0.71 - 0.879 3.295
de->es 6.06 21.5 0.216 0.728 - 0.872 3.379
XLM de->en 9.95 28.1 0.166 0.874 -0.65 0.962 3.822
de->fr 8.58 24.8 0.257 0.736 - 0.911 3.509
de->es 9.14 26.9 0.259 0.747 - 0.906 3.561
mBART de->en 15.5 37.5 0.205 0.894 -0.452 0.983 3.972
de->fr 4.31 17.5 0.146 0.675 - 0.737 3.161
de->es 0.94 13.4 0.095 0.597 - 0.374 2.955
mT5 de->en 11.71 32.4 0.185 0.884 -0.56 0.974 3.904
de->fr 9.59 24.7 0.264 0.746 - 0.919 3.499
de->es 10.85 27.5 0.279 0.757 - 0.909 3.58
Table 16: The whole results under the German centric crosslingual evaluation scenarios.
Task Model Language N-gram-based Embedding-based Others Ours
BLEU ROUGE-L METEOR BERTScore BLEURT Distant-1 Ensemble
SG M-BERT fr->en 1.63 16.8 0.099 0.884 -0.795 0.928 3.665
fr->de 2.04 13.2 0.134 0.716 - 0.953 3.245
fr->es 1.29 13 0.109 0.712 - 0.942 3.162
XLM fr->en 2.48 18.8 0.085 0.875 -0.78 0.952 3.739
fr->de 2.3 16 0.104 0.705 - 0.954 3.308
fr->es 2.32 15 0.114 0.719 - 0.963 3.329
mBART fr->en 1.49 18.1 0.078 0.881 -0.952 0.966 3.619
fr->de 1.89 12.6 0.124 0.726 - 0.967 3.207
fr->es 3.57 14.8 0.128 0.742 - 0.989 3.29
mT5 fr->en 3.46 20.1 0.105 0.893 -0.697 0.968 3.728
fr->de 3.28 14.9 0.137 0.732 - 0.978 3.315
fr->es 3.46 15 0.13 0.743 - 0.973 3.25
QG M-BERT fr->en 8.18 33.7 0.155 0.884 -0.573 0.877 3.865
fr->de 5.01 20.7 0.183 0.733 - 0.911 3.309
fr->es 4.98 20.3 0.283 0.788 - 0.914 3.578
XLM fr->en 13.29 38.7 0.192 0.902 -0.391 0.951 3.983
fr->de 7.1 33.2 0.212 0.761 - 0.978 3.609
fr->es 15.49 44.5 0.348 0.822 - 0.938 3.845
mBART fr->en 7.37 33.8 0.147 0.894 -0.559 0.982 3.935
fr->de 5.44 21.9 0.179 0.752 - 0.987 3.397
fr->es 9.17 27.7 0.296 0.819 - 0.979 3.698
mT5 fr->en 10.25 37.7 0.172 0.902 -0.441 0.966 3.976
fr->de 6.33 22.6 0.194 0.759 - 0.981 3.438
fr->es 11.54 30.8 0.327 0.825 - 0.964 3.754
TG M-BERT fr->en 6.77 24.7 0.144 0.86 -0.791 0.91 3.649
fr->de 5.16 14.2 0.151 0.696 - 0.888 3.125
fr->es 6.72 22.6 0.227 0.734 - 0.88 3.43
XLM fr->en 10.26 29.5 0.174 0.876 -0.619 0.965 3.848
fr->de 6.34 20.3 0.183 0.713 - 0.95 3.366
fr->es 9.74 27.7 0.265 0.751 - 0.903 3.575
mBART fr->en 14.57 37.4 0.196 0.892 -0.472 0.977 3.966
fr->de 3.81 10.8 0.102 0.669 - 0.82 2.976
fr->es 0.89 12.8 0.093 0.633 - 0.586 2.929
mT5 fr->en 12.17 33 0.191 0.886 -0.545 0.973 3.915
fr->de 7.69 19.3 0.189 0.72 - 0.954 3.36
fr->es 11.85 28.8 0.295 0.762 - 0.912 3.587
Table 17: The whole results under the French centric crosslingual evaluation scenarios.
Task Model Language N-gram-based Embedding-based Others Ours
BLEU ROUGE-L METEOR BERTScore BLEURT Distant-1 Ensemble
SG M-BERT es->en 2.06 17.1 0.1 0.885 -0.797 0.928 3.67
es->de 2.09 12.8 0.131 0.717 - 0.957 3.236
es->fr 1.01 12.1 0.187 0.715 - 0.95 3.27
XLM es->en 2.38 18.9 0.084 0.875 -0.787 0.955 3.737
es->de 1.99 15.7 0.105 0.704 - 0.957 3.311
es->fr 3.13 17.8 0.176 0.722 - 0.936 3.377
mBART es->en 1.6 17.6 0.084 0.878 -1.079 0.923 3.651
es->de 1.3 10.5 0.115 0.716 - 0.94 3.092
es->fr 0.85 11.3 0.129 0.71 - 0.958 3.168
mT5 es->en 3.36 19.9 0.104 0.893 -0.699 0.972 3.733
es->de 3.12 14.8 0.135 0.732 - 0.979 3.315
es->fr 3.08 16.1 0.187 0.747 - 0.966 3.364
QG M-BERT es->en 7.87 33.1 0.153 0.882 -0.606 0.876 3.854
es->de 4.9 20.7 0.183 0.732 - 0.914 3.294
es->fr 3.88 18.4 0.242 0.74 - 0.91 3.384
XLM es->en 12.99 39.3 0.194 0.903 -0.386 0.952 3.995
es->de 7.62 33.7 0.221 0.765 - 0.975 3.618
es->fr 13.16 38 0.329 0.792 - 0.946 3.724
mBART es->en 2.85 10.4 0.06 0.83 -0.968 0.972 3.134
es->de 0.8 2.5 0.055 0.72 - 0.968 2.755
es->fr 8.2 25.4 0.271 0.777 - 0.988 3.587
mT5 es->en 11.47 39.2 0.182 0.904 -0.419 0.969 3.992
es->de 7.15 24 0.206 0.763 - 0.979 3.459
es->fr 10.66 28.7 0.313 0.788 - 0.97 3.649
TG M-BERT es->en 7.33 25.5 0.146 0.863 -0.758 0.915 3.692
es->de 5.24 14.2 0.15 0.697 - 0.889 3.121
es->fr 5.84 19.1 0.22 0.72 - 0.891 3.327
XLM es->en 10.37 29.6 0.173 0.877 -0.618 0.964 3.859
es->de 6.47 20.8 0.187 0.715 - 0.951 3.381
es->fr 9.09 25.8 0.268 0.74 - 0.905 3.53
mBART es->en 0.3 3.7 0.012 0.766 -0.68 0.317 2.778
es->de 4.57 11.9 0.123 0.696 - 0.934 3.097
es->fr 0.19 8.2 0.041 0.568 - 0.322 2.688
mT5 es->en 12.29 33.4 0.191 0.887 -0.537 0.973 3.886
es->de 7.68 19.3 0.191 0.722 - 0.953 3.358
es->fr 10.41 26.2 0.283 0.752 - 0.924 3.547
Table 18: The whole results under the Spanish centric crosslingual evaluation scenarios.
Task Model Language N-gram-based Embedding-based Others Ours
BLEU ROUGE-L METEOR BERTScore BLEURT Distant-1 Ensemble
SG M-BERT en->de 0.04 2.8 0.053 0.694 - 0.939 2.756
en->fr 0.04 3.7 0.05 0.706 - 0.942 2.78
en->es 0.04 3.1 0.04 0.708 - 0.941 2.701
XLM en->de 0.31 7.9 0.051 0.686 - 0.893 3.034
en->fr 0.71 10.5 0.094 0.701 - 0.899 3.12
en->es 0.35 8.8 0.058 0.696 - 0.884 3.067
mBART en->de 0.08 3.7 0.039 0.681 - 0.907 2.435
en->fr 0.07 4.3 0.037 0.689 - 0.908 2.48
en->es 0.06 3.9 0.038 0.691 - 0.897 2.46
mT5 en->de 0.07 2.7 0.05 0.697 - 0.98 2.782
en->fr 0.04 3.8 0.044 0.708 - 0.979 2.801
en->es 0.04 3.3 0.038 0.711 - 0.98 2.721
QG M-BERT en->de 0.63 4.1 0.067 0.72 - 0.925 2.737
en->fr 0.57 3 0.057 0.729 - 0.924 2.673
en->es 0.47 2.2 0.054 0.745 - 0.924 2.497
XLM en->de 2.14 19.9 0.119 0.722 - 0.927 3.273
en->fr 2.94 19 0.121 0.738 - 0.939 3.304
en->es 3.53 19.8 0.137 0.754 - 0.942 3.313
mBART en->de 1.21 6 0.079 0.735 - 0.977 2.897
en->fr 1.01 6.6 0.062 0.742 - 0.978 2.88
en->es 0.82 3.4 0.072 0.758 - 0.974 2.683
mT5 en->de 1.51 5.6 0.084 0.739 - 0.971 2.912
en->fr 1.22 5.1 0.067 0.745 - 0.972 2.86
en->es 1.06 3.9 0.073 0.76 - 0.971 2.741
TG M-BERT en->de 2.52 7.8 0.107 0.651 - 0.903 2.988
en->fr 1.81 6.9 0.104 0.658 - 0.916 3.029
en->es 1.74 6.8 0.106 0.666 - 0.906 3.004
XLM en->de 3.36 13.7 0.132 0.677 - 0.885 3.099
en->fr 3.23 13.2 0.131 0.69 - 0.885 3.168
en->es 3.48 14.2 0.135 0.697 - 0.879 3.184
mBART en->de 5.66 14.6 0.148 0.705 - 0.971 3.203
en->fr 3.91 12.9 0.131 0.71 - 0.978 3.233
en->es 4.07 12.4 0.155 0.718 - 0.969 3.231
mT5 en->de 5.23 12.2 0.131 0.676 - 0.967 3.108
en->fr 3.92 10.8 0.118 0.687 - 0.975 3.147
en->es 4 10.6 0.135 0.696 - 0.967 3.149
Table 19: The whole results under the zero-shot evaluation scenarios.