Natural language generation (NLG) is the task to automatically generate meaningful texts, typically using a non-linguistic or textual representation of information as input Covington (2001). These texts generally aim to realize an underlying communicative goal while remaining coherent with the input information and keeping grammatically correct. Multilingual text generation extends the natural language generation task to produce texts in different languages, which is important to overcome language barriers and enable universal information access for the world’s citizens Artetxe et al. (2020); Arivazhagan et al. (2019).
However, most of the released text generation datasets are limited to English Rajpurkar et al. (2018); Ritter et al. (2011); Akoury et al. (2020). This limits the ability of researchers to understand many aspects of multilingual text generation that are related to open vocabulary and complex morphology Kageura (2012). To this end, some works have proposed text generation datasets in non-English languages, such as multilingual question answering dataset Longpre et al. (2020), cross-lingual summarization dataset Ladhak et al. (2020) or monolingual text generation datasets with plenty of languages pairs Liang et al. (2020); Gehrmann et al. (2021). However, they either apply to a single task Longpre et al. (2020); Ladhak et al. (2020), or limited to non-parallel data across different languages Liang et al. (2020); Gehrmann et al. (2021), which leads to constrained scenarios. A benchmark that enables the comprehensive evaluation of both parallel and non-parallel multilingual testing scenarios on a diverse range of languages and generation tasks is still missing.
In this paper, we propose MTG, a new benchmark suite for multilingual text generation and evaluation, to address the above problems. MTG is a human-annotated multi-way parallel dataset with three text generation tasks (story generation, question generation, and title generation) across four languages (English, German, French, and Spanish). The multi-way parallel feature enables fully parallel across all four languages, which can provide 12 parallel pairs and 4 monolingual pairs for each task. With the dataset in hand, we evaluate several strong, representative pre-trained models, including multilingual BERT (M-BERT) Devlin et al. (2019), XLM Lample and Conneau (2019), mBART Liu et al. (2020) and mT5 Xue et al. (2020) under six evaluation scenarios, which covers monolingual training, multilingual training, monolingual-multitask training, multilingual-multitask training, cross-lingual generation, and zero-shot transfer.
In summary, the contributions of this paper are listed as follows: (i) We release a new text generation benchmark suit MTG covering three tasks across four languages, with the human-annotated multi-way parallel data for each task. (ii) Based on the multi-way parallel characteristic, we provide six different test scenarios, which include monolingual, multilingual, monolingual-multitask, multilingual-multitask, cross-lingual training and zero-shot transfer. We evaluate several representative pre-trained models for different tasks and scenarios, and give an extensive analysis of the experimental results from different aspects. We further conduct qualitative experiments to verify the advantages of human-annotated multi-way parallel data. iii) We propose a new evaluation metric and prove that it has higher correlation score with human scores compared with other automatic metrics.
2 Related Work
2.1 Multilingual Dataset
BENG Moussallem et al. (2020) is a benchmarking platform for natural language generation and knowledge extraction system, which is limited to English data. XTREME Hu et al. (2020) is a multilingual understanding benchmark across languages and tasks, but it does not cover any generation task. Jiang et al. (2020) propose X-FACTR, which is a cross-lingual factual retrieval benchmark. Longpre et al. (2020) propose MKQA, an open-dowmain question answering evaluation dataset covering diverse languages. Ladhak et al. (2020) present WikiLingual, which is a large-scale, multilingual dataset for cross-lingual abstractive summarization systems. Wiki-B Guo et al. (2020) is a multilingual language model dataset across languages. Although these datasets cover multiple languages, they belong to a single, specific generation task, which limits researchers to obtain general findings considering a set of tasks. XGLUE Liang et al. (2020) is a cross-lingual benchmark dataset for nine understanding tasks and two generation tasks. GEM Gehrmann et al. (2021) is a newly-presented natural language generation benchmark covering tasks. A marked difference of our MTG from them is that MTG is parallel across all languages, which stimulates more testing scenarios.
2.2 Multilingual Modeling
Multilingual pre-training models can bring better performance for downstream tasks. Multilingual BERT (M-BERT) Devlin et al. (2019) is a single language model pre-trained from monolingual corpora in languages using Masked Language Modeling task. XLM Lample and Conneau (2019) is pre-trained simultaneously with Masked Language Model task and Translation Language Model task, which is then extended to a RoBERTa-version called XLM-R Conneau et al. (2019). Unicoder Huang et al. (2019) further leverages more cross-lingual pretraining tasks and achieves better results on XNLI than XLM. Multilingual BART Liu et al. (2020) is a pre-trained encoder-decoder model using denoising auto-encoding objective on monolingual data over languages. Multilingual T is a multilingual variant of T leveraging a text-to-text format. mT is pre-trained on a span-corruption version of Masked Language Modeling objective over languages.
3 Dataset Collection and Methodology
In this section, we will introduce how to create the benchmarking suite for multilingual text generation (MTG). First, several criteria selecting the tasks and datasets are described. Then one language is chosen as the starting language and the multi-way dataset is constructed by translating from the starting language to other languages. The other languages will be properly selected based on some principles. Finally, the data annotation process will be described in detail to ensure the quality of the dataset.
|Story Generation||ROCStories||Daily life||<story>||Generate the end of the story|
|Question Generation||SQUAD 1.0||Wikipedia||<passage, answer, question>||Generate the question of the answer|
|Title Generation||ByteCup||News||<article, title>||Generate the title of the document|
3.1 Task and Dataset Selection
There are plenty of generation tasks in monolingual text generation. It is important to select suitable tasks for our MTG benchmark to make it diverse and challenging. Thus, we define several criteria during the task selection:
Task Definition Tasks should be well-defined, which means that humans can easily determine whether the generated results meet the task requirements. Besides, these tasks should have been well-studied in one language and rarely been studied in multilingual scenarios.
Task Difficulty Tasks should be solvable by most college-educated speakers. Beyond that, they should be challenging to current models, the performance of which in various test scenarios falls short of human performance.
Task Diversity Tasks should cover a wide range of relevant generation challenges that allow for findings to be as general as possible.
Input Format The input format of the tasks needs to be as simple as possible in order to reduce the difficulty of data processing. Besides, it should not contain anything but text (e.g., images or videos).
In order to meet the above criteria, 8 domain experts are asked to vote from 10 typical generation tasks111 These generation tasks are story generation, commonsense generation, style transform, question generation, question answering, dialogue generation, title generation, text summarization, image caption, and data-to-text.
These generation tasks are story generation, commonsense generation, style transform, question generation, question answering, dialogue generation, title generation, text summarization, image caption, and data-to-text.. Finally, three generation tasks are selected for MTG, which are story generation, question generation, and title generation. Story generation (SG) aims to generate the end of a given story context, which requires the model to understand the story context and generate a reasonable and fluent ending Guan et al. (2019). Question generation (QG) targets at generating a correct question for a given passage and its answer Duan et al. (2017). For the same passage with different answers, the system should be able to generate different questions. Title generation (TG) converts a given article into a condensed sentence while preserving its main idea Jin and Hauptmann (2002). The title should be faithful to the original document while encourage users to read the news. These three tasks are very different from each other and focus on different generative abilities.
After determining the tasks, the next step is to choose the dataset for each task. The two selection principles are listed as follows:(1) License: Task data must be available under licenses that allow using and redistributing for research purposes. The dataset should be freely available for download. (2) Quality: The dataset size should be as large as possible and the quality should be checked.
Based on the accessibility and quality concerns, English is selected as the starting language and English datasets for the three tasks are gathered. We choose ROCStoriesMostafazadeh et al. (2016) for story generation, SQUAD Rajpurkar et al. (2016) for question generation and ByteCup 222https://www.biendata.xyz/competition/bytecup2018/ for title generation. These datasets are popular in the corresponding fields and have been verified to be high-quality by many works. Moreover, they are all under a permissive license. A specific example of question generation is shown in Figure 1. An overview of all task datasets is shown in Table 1.
3.2 Language Selection
The original datasets are only in English (en) and we want to extend them into multi-way parallel scenarios. This means that all English text should be translated into other languages, which will cause an expensive annotation cost. Thus, a state-of-the-art translator is used to do the translation and then annotators are asked to correct the translated text. Based on it, MTG should contain languages which (1) have good English-to-X translators and (2) are diverse on language family. Finally, German (de), French (fr) and Spanish (es) are chosen. German is from the same language branch with English while French and Spanish are from the different ones. In the future, more distant languages will be covered into MTG to discover more interesting results across various languages, such as Chinese.
3.3 Data Collection
After determining the tasks and languages, we introduce the data annotation process to get the MTG. Firstly, the Google Translator333https://translate.google.com/
is used to translate the English datasets to the selected languages. Then, the same translator is used to translate the results back to English. If the n-gram overlap ratio between the original English text and the back-translated one is less than the setting threshold, the example will be removed. Different threshold values (fromto with as step length) are tried and if the threshold is set to , the training data size of QG will drop more than . Thus we decide to use as the threshold number to improve the quality of the filtered data while still maintain more than of the original training data.444The detailed numbers of the filtered datasets with respect to different thresholds are included in the appendix. All four languages are aligned to ensure the dataset to be multi-way parallel. Afterward, 87k, 72k, and 280k rough parallel training data for SG, QG, and TG are constructed.
samples of each task are randomly chosen from the merged development and test sets as data for annotation. The annotators are required to further check the translated results based on the following rules: (1) Relevance Whether the target text is meaningful and relevant to the source text. (2) Fluency Whether the translated text is grammatically correct. (3) Style Whether the translation follows local culture, language conventions, and gender-related words.
If the translated text bothers any of the above rules, annotators will correct it accordingly. The annotated data is then split to 5k/2k/3k as training/development/test subset for training, validation, and test.
Then the multi-way parallel generation benchmark MTG is finally completed. It contains three different generation tasks (SG, QG, TG) in four languages (en, de, fr, es). Each example is fully parallel across the four languages, which means we can take the input in one language and use the output in another language. The statistic of MTG is shown in Table 2.
|Task||SG, QG, TG|
|For each language|
|Rough training size||87k/72k/280k|
|Annotated training size||5k/5k/5k|
|Annotated development size||2k/2k/2k|
|Annotated test size||3k/3k/3k|
|For four languages (en, de, fr, es)|
|Total Annotated size||120k|
|Total dataset size||1.87m|
3.4 Annotation Process
A team of 10 full-time experts555There are 3 language experts for German, 3 for French, and 4 for Spanish. are hired to do the annotation, who are paid daily. Some part-time workers666There are 16 part-time workers who participated in the German annotation, 39 for French and 4 for Spanish. are also employed to increase the annotations throughput, who are paid by the number of annotations. Each annotator is an expert in at least two languages (English and another target language). They are firstly trained on how to correct error translations according to the above rules and then annotate a small number of examples as test. These examples will be rechecked by us and give feedback to help them understand the tasks. After this annotation training process, the annotators will start annotating the dataset. For quality control, we sample from the generated annotations and arranged 9 experts to recheck them. Each example will be assigned to two other experts and the data will be qualified only if both of them agree on the annotation777The grammar, expressions, punctuation of the annotated text are completely correct and the expressions are in accordance with the foreign language.. If more than of the annotations fail, then all the data of that annotator for that day will be re-checked.
It takes days to finish the annotation process. Full-time experts are paid dollar per day and part-time annotators are paid per example888Full-time employees work at most hours per day, and the local minimum hourly wage is . The part-time annotators can produce at least examples per hour..
In this section, we conduct extensive experiments to benchmark the difficulty of our proposed MTG via several state-of-the-art multilingual models under different scenarios.
4.1 Evaluation Models
The performance of the following four most popular multilingual pre-trained models are explored:
M-BERT Multilingual BERT (M-BERT) Devlin et al. (2019) is a single language model pre-trained from monolingual corpora in languages using Masked Language Modeling (MLM) task. M-BERT leverages a shared vocabulary of k WordPiece tokens and has layers with M parameters totally.
XLM The Cross-Lingual Language Model (XLM) Lample and Conneau (2019) is pre-trained simultaneously with Masked Language Model (MLM) task using monolingual data and Translation Language Model (TLM) task using parallel data. XLM has a shared vocabulary of k byte-pair encoded (BPE) subwords Sennrich et al. (2016) and layers totaling M parameters.
mBART Multilingual BART (mBART) Liu et al. (2020) is a pre-trained encoder-decoder model using denoising auto-encoding objective on monolingual data over languages. mBART has a shared vocabulary of k tokens leveraging Sentence Piece tokenization scheme. mBAR T consists of -layer encoder and -layer decoder with a total of M parameters.
4.2 Evaluation Scenarios
A salient feature of MTG is that it is multi-way parallel across all languages. Thus, we conduct experiments in various scenarios to demonstrate the wide usability of MTG.
Monolingual fine-tuning The pre-trained model is trained for a downstream task using the training data for a specific language and evaluates on the test set for the same language. The input and output of the downstream task are in the same language.
Monolingual Multitask fine-tuning A universal model for all tasks is trained. But it is still language specific as described in Monolingual fine-tuning.
Multilingual fine-tuning The pre-trained model is jointly finetuned with data in all languages for a specific task. Different from the monolingual fine-tuning setting, there is only one model for each downstream task, which can serve all languages.
Multilingual Multitask fine-tuning Rather than only combining training data of all languages in Multilingual fine-tuning, data from all languages and all tasks are gathered to train a single model for all languages and tasks.
Cross-lingual generation Since MTG is multi-way parallel, it can be reorganized to create input-output pairs that belong to different languages. For example, in title generation, a sample can consist of source document in English and target title in German. For a multilingual dataset with languages, input-output pairs can be constructed, where the direction matters. The cross-lingual generation performances on all directions are evaluated.
Zero-shot transfer We also try to explore the zero-shot generation on the three tasks. The model is finetuned on a specific task with English input and output. Then it is used to generate output in other languages with a given language tag.
4.3 Evaluation Metrics
The quality of generated texts are evaluated from different aspects to fully understand the model performance. Moreover, we propose a new ensemble metric that has higher correlation scores with human annotations scores.
N-gram based Metrics N-gram-based metrics evaluate the text-overlapping score between the outputs and references. The following three metrics are used: (1) BLEU Papineni et al. (2002) is a popular metric that calculates the word-overlap scores between the generated texts and gold-standard ones. We use the BLEU-4, which is the average score for unigram, bigram, trigram, and 4-gram. (2) ROUGE Lin (2004) is a recall-oriented metric that counts the number of overlapping units such as n-gram, word sequences, and word pairs between the produced texts and gold-standard ones. ROUGE-L calculates the overlapping of the longest common subsequence between generated results and the references. (3) METEOR Banerjee and Lavie (2005) relies on semantic features to predict the similarity scores between system hypotheses and human references.
Embedding based Metrics The embedding-based metrics can, to a large extent, capture the semantic-level similarity between the generated texts and the ground truth. (1) BERTScore Zhang et al. (2019)
computes the token similarity of candidates and references as a sum of cosine similarities of tokens using pre-trained BERT contextual embeddings. (2)BLEURT Sellam et al. (2020) is a learned metric on a diverse set of lexical- and semantic-level supervision via the BERT-base architecture. However, it only supports English currently.
Diversity Metrics We also employ the distinct metric Li et al. (2016), which calculates the proportion of the distinct n-grams in all the system hypotheses and can be used to evaluate the diversity of the generated texts. In this paper, we choose the distinct-1 for unigram diversity.
Human Evaluation Human evaluation is also leveraged to better understand the quality of model outputs. Specifically, cases are randomly sampled from the test set for each task and language, and then they are presented to human annotators with the model outputs. They evaluate each model output from three aspects: Grammar, Fluency, and Relevance and give a score from 1 to 5 for each aspect. The detailed annotation rules are in the appendix.
Ensemble Metric For the samples with human annotation, their automatic metrics scores (except for BLEURT because it only supports English) are gathered as features. Then the human annotated scores are served as targets. All these data are split into training, development and test sets. After comparing different regression models’ performance as shown in Table 3
, we finally choose gradient boosting regression model as the ensemble metric.
The correlations between automatic metric and human evaluation scores on the above test set are displayed in Figure 2. Our ensemble metric outperforms the other three metrics999The comparison with all other automatic metrics are included in the appendix, still, our ensemble metric achieves the best performance..
We finetune the state-of-the-art multilingual pre-trained models on the three tasks involved in MTG, and evaluate the performance under different scenarios. Due to the limited space, we list parts of results in the main body, and the full results and experimental settings can be found in the appendix.
Monolingual and Multilingual
In most cases, multilingual training can improve model performance. As shown in Figure 3, For M-BERT, mBART and mT5, training with all language data can improve the model performance on all tasks compared with the monolingual ones. In the multitask setting, multilingual training can also bring performance improvement. Compared with other models, multilingual training can bring more performance enhancement for mT5. This is because the mT5 has more parameters and performs better when the training data size increases.
Multitask and Single-task
Besides multilingual training, we also explore the influence of multitask training by training with all tasks’ data. As shown in Figure 3, compared with single-task training on monolingual data, the multitask counterparts are likely to boost the performance. While for multilingual data, adding multitask training objective sometimes can cause performance decline especially in the task of story generation. This is because the reference of story generation is less deterministic compared with title generation and question generation. The task difference between story generation and the other two causes the multitask training performance to drop.
In this paper, we make use of the multi-way parallel data to do the supervised cross-lingual training, e.g., for English centric cross-lingual training, we take the English source as the input and the parallel German, French, Spanish target as the output. Then we evaluate the model on same setting (en->de, en->es, en->fr). Figure 4 contains the four language centric cross-lingual results and monolingual results. Compared with the other three models, XLM performs best in most cases (cells for XLM are closer to gray). This is because XLM makes use of parallel data for TLM during the pre-training phase, which gives it cross-language capabilities. Besides, mT5 outperforms mBART under the cross-lingual setting when both of them only use the monolingual corpus for pre-training. The reason is that the training of mT5 contains more languages compared with languages for mBART.
On the other hand, as displayed in Figure 4, it is much easier for models to transfer to English compared with transferring to German (nearly all columns of English is gray while most cells in the columns for German are red). This is because the word order of German is more different from the other three languages.
We also explore the zero-shot transfer by training on English input and output. And then, we try to generate output in other languages via a given English input and a language tag directly. Different from the cross-lingual setting, the model does not see en->x (x=de, es, fr) data during training. Table 19 indicates that XLM still outperforms other models by a large margin on SG and QG tasks. However, mT5 fails because it does not have language tags during the pre-training. Since we only use en->en data to finetune models to the downstream tasks, mT5 has never seen the language tags of target languages and can not generate the corresponding language.
Table 5 presents the human evaluation scores for TG in four evaluation scenarios. From the table, we can find that XLM and mT5 usually have higher scores than M-BERT and mBART. This is because they are pre-trained with more languages and better pretrain tasks, which enables them to fuse semantic spaces of different languages better.
Based on the analysis above, we can draw several conclusions: (1) Multilingual training can boosting model performance in most cases. (2) German is harder for the cross-lingual model to generate compared with the other three languages. (3) XLM has the best cross-lingual performance across all three tasks. In the setting of zero-shot, it still maintains superior performance.
Pseudo vs. Annotated
To answer the question Does the 5k annotated data help the model to generate better ? We use the rough training data filtered by back-translation for the first stage finetuning and the annotated training data for the second stage. We conduct the ablation study on the two-step finetuning in QG under all evaluation scenarios with mBART and present the results in Figure 5
. To compare the performance between the first stage and the second, we also make T-test and prove that the improvement of annotated training data is significant in nearly all settings (except for in Multilingual-Multitask setting)101010The average score and p-value are displayed in appendix.. It illustrates that although the number of annotated data is small, it can further improve the performance. It also highlights the necessity of human-annotated multilingual data, comparing with pseudo parallel data via machine translation.
Translation vs. Cross-lingual
Different from cross-lingual generation, we can also generate output in the same language with the input and then translate it to another language. Thus, we compare the results of cross-lingual generation with the generation-translation baseline, which uses the English generation model to get English output and then translate it to German, French, and Spanish. The results are plotted in Figure 6. As the figure shows, the cross-lingual generation model in almost all tasks and languages outperforms the generation-translation baselines. This means that the supervised cross-lingual generation is the better solution when the source and target are in different languages. Our multi-way parallel multilingual benchmark provides plenty of cross-lingual data in different directions, which will encourage research on cross-lingual generation.
In this paper, we propose a multilingual benchmark MTG for text generation. It contains three typical generation tasks: story, question, and title generation. The key feature of MTG is that it has multi-way parallel data across four languages: English, German, French, Spanish. It provides the benchmark ability to create more evaluation scenarios, such as cross-lingual training and zero-shot transfer. We also benchmark state-of-the-art multilingual pre-trained models on our MTG from different metrics to explore its features and challenges and promote research and progress in multilingual text generation.
6 Ethics Consideration
Since we propose a new multilingual text generation benchmark MTG, we solve some possible ethic considerations in this section.
We choose ROCStories, SQUAD 1.0 and XX111111As stated above, we do not use the true name here. as the English dataset for story, question and title generation tasks. All of them are available for research use under their licenses. They can be download freely from their websites121212ROCStories requires for some necessary contact information. We ensure that these datasets are only used for academic research and the dataset construction process is consistent with the intellectual property and privacy rights of the original authors.
As described in Section 3.4, we hire some full-time and part-time language experts to do the annotation and all of them are paid fairly. Their salary is higher than the local minimum average hourly wage. They are voluntary participants and are aware of any risks of harm associated with their participation. The annotation process is consistent with the intellectual property and privacy rights of the recruited annotators as well.
HUSH: a dataset and platform for human-in-the-loop story generation.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6470–6484. Cited by: §1.
Massively multilingual neural machine translation in the wild: findings and challenges. arXiv preprint arXiv:1907.05019. Cited by: §1.
- On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §1.
- METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §4.3.
- Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. Cited by: §2.2.
- Building natural language generation systems. Language 77 (3), pp. 611–612. Cited by: §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2.2, §4.1.
- Question generation for question answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 866–874. Cited by: §3.1.
- The gem benchmark: natural language generation, its evaluation and metrics. arXiv preprint arXiv:2102.01672. Cited by: §1, §2.1.
Story ending generation with incremental encoding and commonsense knowledge.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6473–6480. Cited by: §3.1.
- Wiki-40b: multilingual language model dataset. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 2440–2452. Cited by: §2.1.
XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalisation.
International Conference on Machine Learning, pp. 4411–4421. Cited by: §2.1.
- Unicoder: a universal language encoder by pre-training with multiple cross-lingual tasks. arXiv preprint arXiv:1909.00964. Cited by: §2.2.
- X-factr: multilingual factual knowledge retrieval from pretrained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5943–5959. Cited by: §2.1.
- A new probabilistic model for title generation. In COLING 2002: The 19th International Conference on Computational Linguistics, Cited by: §3.1.
- The quantitative analysis of the dynamics and structure of terminologies. Vol. 15, John Benjamins Publishing. Cited by: §1.
- WikiLingua: a new benchmark dataset for cross-lingual abstractive summarization. arXiv preprint arXiv:2010.03093. Cited by: §1, §2.1.
- Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291. Cited by: §1, §2.2, §4.1.
- A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 110–119. Cited by: §4.3.
- XGLUE: a new benchmark dataset for cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6008–6018. Cited by: §1, §2.1.
- Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.3.
- Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, pp. 726–742. Cited by: §1, §2.2, §4.1.
- MKQA: a linguistically diverse benchmark for multilingual open domain question answering. arXiv preprint arXiv:2007.15207. Cited by: §1, §2.1.
- A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 839–849. Cited by: §3.1.
- A general benchmarking framework for text generation. In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)., pp. 27–33. Cited by: §2.1.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: §4.3.
Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21 (140), pp. 1–67. External Links: Cited by: §4.1.
- Know what you don’t know: unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789. Cited by: §1.
- SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392. Cited by: §3.1.
- Data-driven response generation in social media. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 583–593. Cited by: §1.
- BLEURT: learning robust metrics for text generation. arXiv preprint arXiv:2004.04696. Cited by: §4.3.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Cited by: §4.1.
- MT5: a massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934. Cited by: §1, §4.1.
- Bertscore: evaluating text generation with bert. arXiv preprint arXiv:1904.09675. Cited by: §4.3.
Appendix A Experimental settings
We use the encoder-decoder architecture for our generation tasks. Among the models described above, mBART and mT5 have been pre-trained for generation tasks, but M-BERT and XLM-R are only pre-trained for encoder representations. Therefore, we initialize the decoder with the encoder parameters for M-BERT and XLM-R. During the pre-training phrase, there are no language tags in M-BERT and mT5. Thus we manually add the language tag at the beginning of the source and target for M-BERT and add the target language tag to the beginning of source for mT5.
We adjust the input format for each task. For QG, we append the answer to the passage and insert a special token to separate them. For SG, we take the beginning four sentences as the source and make the last sentence as the target.
We take a two-step finetuning to make full use of our MTG benchmark. We first use the large rough parallel training data to train our models on the downstream tasks for 20 epochs, and then finetune the models on the small annotated training data to further improve the generation performance for 10 epochs. We evaluate the model for every 2000 steps and use the loss on development to choose the best model. The batch size is 32. The learning rate and optimizer parameters are set to the default parameters for each model.
Appendix B Human evaluation
We use Grammar, Fluency, and Relevance to evaluate the generation performance. Each score is from 1 to 5. The principle of each level can be found below.
Not the target language at all
Is the target language, but has too many grammatical errors to convey the meaning
Barely conveys the meaning of the sentence, but has many errors
Grammatically correct overall, with a few errors
Completely incomplete, with loose words
Basically incomplete, with some phrases
Barely formed sentences, but not fluent
Basically complete, with a few flaws
Not related at all
A few words are relevant (e.g., character, place, time, etc.), but the overall narrative is not relevant
Relevant, but not logical
Basically reasonable, with a few irrelevant descriptions
Appendix C Back Translation Threshold Testing
The detailed data size of back translation filtered dataset for different tasks are presented in Table 6.
Appendix D Automated Metric Performances
The full comparison of the ensemble metric and other automatic metrics are displayed in Figure 7. The ensemble metric outperforms all other metrics.
Appendix E Significant Test Results
The average ensemble metric scores for stage1 and stage2 in question generation and the corresponding significant test p-values are displayed in Table 7. As it shows, adding human annotated training data can always improve the model performance under different settings. The improvements are all significant in all settings except for in Multilingual-Multitask setting.
Appendix F Human evaluation scores
Appendix G Experimental Results
We present detailed experimental results of our four baseline models under four different evaluation settings here.