Log In Sign Up

VieSum: How Robust Are Transformer-based Models on Vietnamese Summarization?

by   Hieu Nguyen, et al.
Case Western Reserve University,

Text summarization is a challenging task within natural language processing that involves text generation from lengthy input sequences. While this task has been widely studied in English, there is very limited research on summarization for Vietnamese text. In this paper, we investigate the robustness of transformer-based encoder-decoder architectures for Vietnamese abstractive summarization. Leveraging transfer learning and self-supervised learning, we validate the performance of the methods on two Vietnamese datasets.


page 1

page 2

page 3

page 4


ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation

We present ViT5, a pretrained Transformer-based encoder-decoder model fo...

Leveraging ParsBERT and Pretrained mT5 for Persian Abstractive Text Summarization

Text summarization is one of the most critical Natural Language Processi...

Summaformers @ LaySumm 20, LongSumm 20

Automatic text summarization has been widely studied as an important tas...

Applying Transformer-based Text Summarization for Keyphrase Generation

Keyphrases are crucial for searching and systematizing scholarly documen...

Repurposing Decoder-Transformer Language Models for Abstractive Summarization

Neural network models have shown excellent fluency and performance when ...

Enriching and Controlling Global Semantics for Text Summarization

Recently, Transformer-based models have been proven effective in the abs...

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

Transformer is a deep learning language model widely used for natural la...

1 Introduction

In recent years, Transformer-based architecture models and pre-trained language models (LMs) have played an important role in the development of Natural Language Processing (NLP) systems. These large pre-trained models such as ELMo Peters et al. (2018), GPT Brown et al. (2020), BERT Devlin et al. (2018) are trained on a large corpus and have the ability to derive contextual representation of the language(s) in the training data. After pre-training is complete, these models have achieved state-of-the-art results on a broad range of downstream tasks Devlin et al. (2018). These self-supervised learning methods make use of learning objectives such as Masked Language Modeling (MLM) Devlin et al. (2018) where random tokens in the input sequence are masked and the model attempts to predict the original tokens in order to gain a better understanding of context. The successes of pre-trained models in English have inspired new research efforts to development pre-trained models in languages such as Vietnamese (i.e., PhoBERT Nguyen and Nguyen (2020) and ViBERTBui et al. (2020)). There are also ongoing efforts to develop multilingual pre-trained models (mT5 Xue et al. (2020), mBART Liu et al. (2020)) in order to improve performance across multiple languages by learning both general and language-specific representations.

Text summarization is a summarization evaluation task in which an input will be a free-form text paragraph or document(s) and the output sequence will be a short summarization of the input. Although there has been extensive study in plain English text summarization Manor and Li (2019) and cross-lingual tasks Zagar and Robnik-Sikonja (2020); Xu et al. (2020), there have been minimal studies into the Vietnamese text summarization.

We will perform robust experiments on Transformer-based encoder-decoder models applied to Vietnamese Summarization Tasks. Following the transformer architecture from Vaswani et al. (2017) and the Bert2Bert architecture Rothe et al. (2019), we will use existing monolingual Vietnamese pre-trained models and multilingual pre-trained models to perform our experiments on the Vietnamese Summarization task. We will finetune the models on two datasets: Wikilingual Ladhak et al. (2020) and Vietnews Nguyen et al. (2019). Our results shows that mT5 and mBART achieve more competitive results than monolingual Vietnamese pretrained model. This indicates that there are significant opportunities to further improve performance on Vietnamese summarization tasks (and Vietnamese NLP in general) with the development of stronger mongolingual pre-trained models.

2 Related works

There are many abstractive summarization studies in English. In an early example, Gehrmann et al. (2018) employed a bottom-up content selector (BottomUp) to determine which phrases in the source document should be part of the summary, and then a copy mechanism was applied only to pre-selected phrases during decoding. Their experiments obtained significant improvements on ROUGE for some canonical summarization datasets.

In recent years, pre-trained language models have been used to enhance performance on language generation tasks. Liu and Lapata (2019) developed a Transformer-based encoder-decoder model so pre-trained language models like BERT can be adopted for abstractive summarization task. Here, the authors proposed a novel document-level BERT-based encoder (BERTSum) and a general framework encompassing both extractive and abstractive summarization tasks. Based on BERTSum, Dou et al. (2021) introduced GSum that effectively used different types of guidance signals as input in order to generate more suitable words and more accurate summaries. This model accomplished state-of-the-art performance on four popular English summarization datasets.

Meanwhile, there are a small number of studies on Vietnamese text summarization. Most of these focus on inspecting extractive summarization. The researchers Nguyen et al. (2018)

compared a wide range of extractive methods, including unsupervised ranking methods (e.g., LexRank, LSA, KL-divergence), supervised learning methods using TF-IDF and classifiers (e.g., Support Vector Machine, AdaBoost, Learning-2-rank), and deep learning methods (e.g., Convolutional Neural Network, Long-Short Term Memory). Similarly, the authors

Nguyen et al. (2019) also evaluated the extractive methods on their own dataset, which was released publicly as a benchmark for future studies.

Recent work Quoc et al. (2021)

investigated the combination of pre-trained BERT model and an unsupervised K-means clustering algorithm on extractive text summarization. The authors utilized multilingual and monolingual BERT models to encode sentence-level contextual information and then ranked this information using K-means algorithm. Their report showed that monolingual models achieved better results compared when to multilingual models performing the same extractive summarization tasks. However, due to the lack of studies on Vietnamese abstract summarization, we compare both multilingual and monolingual encoder-decoder models.

3 Models

3.1 Bert

BERT Devlin et al. (2018) (Bidirectional Encoder Representations from Transformers) is a multilayer bidirectional Transformer encoder based on the architecture described in Vaswani et al. (2017). In BERT, each input sequence is represented by a sum of the token embeddings, segment embeddings, its positional embeddings. BERT uses the WordPiece tokenizer to split words into subwords.

3.2 PhoBERT

PhoBERT Nguyen and Nguyen (2020) is the first public large-scale mongolingual language model pre-trained for Vietnamese. PhoBERT follows the architecture proposed by Liu et al. (2019). The PhoBERT pretraining approach is also based on RoBERTa Liu et al. (2019), employing the implementation in fairseq Ott et al. (2019) which optimizes the BERT pre-training procedure for more robust performance. The model is trained on a 20GB word-level Vietnamese news corpus with maximum sequence length of 256.

3.3 ViBERT

ViBERT Bui et al. (2020) leverages checkpoints from the multilingual pre-trained model mBERT Devlin et al. (2018) and continues training on monolingual Vietnamese corpus (10GB of syllable-level text). The authors also remove insignificant vocab from mBERT, keeping only Vietnamese vocab.

3.4 mBERT

mBERT followings the same training procedure of BERT Devlin et al. (2018) on a multilingual dataset. The training languages are the 100 languages with the largest amount of text on Wikipedia, which includes Vietnamese 111

3.5 mBART

mBART Liu et al. (2020) is a sequence-to-sequence denoising encoder-decoder model that was pre-trained on large-scale monolingual corpora (including Vietnamese) using the BART objective anđ architecture Lewis et al. (2019).

3.6 mT5

mT5 Xue et al. (2020) is a multilingual variant of the large encoder-decoder pre-trained text-to-text models T5 Raffel et al. (2019). mT5 inherits all of the benefits of T5 including the scale and the design which was based on a large-scale empirical study Raffel et al. (2019).

4 Methods

Following the works descried in Vaswani et al. (2017), a transformer-based sequence-to-sequence architecture is an encoder-decoder architecture that has multi-layer self-attention. This architecture consists of two components: an encoder and a decoder.

Therefore, in order to generate a target sequence from an input sequence, a full transformer architecture with both an encoder and a decoder is required. PhoBERT and ViBERT are encoder-only models developed for encoding Vietnamese Language representations, so we can quickly assemble these models into the architecture. However, for the decoder level, we change the self-attention layers from bidirectional into left-context-only. We also insert a randomly initialized cross-attention mechanism into the decoder. We denote these models as PhoBERT2PhoBERT and ViBERT2ViBERT. Following the practice from Press and Wolf (2017), we tie the input embedding and output embedding in the decoder block, allowing the model to learn from the input embedding weights instead of random weights.

For existing encoder-decoder models like mT5 and mBART, we continuously fine-tune on the released checkpoints while strictly keeping their structure and vocabulary.

5 Datasets

Wikilingua Vietnews
Size Train Dev Test Train Dev Test
13707 1957 3916 105418 22642 22643
#avg of words in body 521 519
#avg of words in abstract 44 38
Table 1: Data statistics of the finetune datasets

We test the VieSum models on two datasets: Wikilingua Ladhak et al. (2020) and Vietnews Nguyen et al. (2019). The statistics for these datasets are shown in Table 1.

5.1 Wikilingua

Wikilingua Ladhak et al. (2020) is a large-scale corpus multilingual corpus for abstractive summarization tasks. The corpus consists of 18 languages, including Vietnamese. These articles and summary pairs are extracted from WikiHow 222 Wikihow is an online resource containing guidelines and how-to articles for a wide-range of topics. These articles are written by human authors; to ensure quality and accuracy, human experts reviewed the content before inclusion in this study 333 Most of the non-English articles on the site are manually translated from the original English articles. Prior to publication, these foreign language articles are then reviewed by the WikiHow’s international translation team.

5.2 Vietnews

Vietnews Nguyen et al. (2019) collects news data from 3 well-known Vietnamese news websites:,, and The collected articles were written from 2016 to 2019. Then the authors remove all articles related to questionnaires, analytical comments and weather forecasts as they are not relevant to document summarization. The final corpus only contains news events.

Models WikiLingua Vietnews
Rouge-1 Rouge-2 Rouge-L Rouge-1 Rouge-2 Rouge-L
46.25 16.57 29.82 57.56 24.25 35.53
PhoBERT2RND 46.72 17.00 30.13 57.6 24.18 35.5
ViBERT2ViBERT 53.08 20.18 31.79 59.75 27.29 36.79
PhoBERT2PhoBERT 50.4 19.88 32.49 60.37 29.12 39.44
mBERT 52.82 20.57 31.55 59.67 27.36 36.73
mBART 55.21 25.69 37.33 59.81 28.28 38.71
mT5 55.27 27.63 38.3 58.05 26.76 37.38
  • Notes: The best scores are in bold and second best scores are underlined.

Table 2: Test result on Wikilingua and Vietnews

6 Experiments Setup

6.1 Baselines

We verify the effectiveness of our proposed methods by comparing with the Transformer model architecture based on Vaswani et al. (2017). The Transformer models has an encoder and a decoder, in which each layer of encoder-dencoder includes two major components: a multi-head self-attention and a feed-forward network. These model are intialized with random weights and labels RND.

6.2 Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the number of overlapping units (including n-grams and word sequences) between the generated summary and the reference summary.

  • ROUGE-N: measures overlap between unigrams, bigrams, trigrams and higher order n-grams For our experiments, we will use ROUGE-1 and ROUGE-2 (unigrams and bigrams)

  • ROUGE-L: measures longest matching sequence of words using Longest Common Subsequences (LCS) with the assumption that longer LCS between the generated summary and the reference summary shows more similarity (higher quality of results).

7 Results

We report the results of transformer encoder-decoder models on two datasets: Wikilingua and Vietnews in Table 2.

7.1 Wikilingua

For the Wikilingua dataset, a first obvious take away is that a pre-trained decoder is important for a transformer model to perform well in Vietnamese Summarization tasks. PhoBERT2RND has minor improvement comparing to untrained Transformer baseline model. Yet, incorporating a pre-trained decoder in PhoBERT2PhoBERT improves the understanding of a transformer model on a Vietnamese language and Summarization. The score improved from 2-4% accross for all metrics (Rouge-1, Rouge-2, and Rouge-L). Therefore, this setting can be further studied and improved by incorporating a larger pre-trained decoder like GPT Brown et al. (2020) on Vietnamese language with existing pre-trained encoders.

mT5 has the highest score across all metrics (Rouge-1, Rouge-2, and Rouge-L), following by mBART model. While PhoBERT2PhoBERT, which was trained monolingual on Vietnamese, has the lowest score compared to other pre-trained encoder-decoder models. This result can be attributed to a training data factor. Both mT5 and mBART train on mC4 Raffel et al. (2019) and CC25 Lewis et al. (2019) respectively, which are a extracted from a large Common Crawl corpus (Schwenk et al. (2020)) 444 Common Crawl is a publicly web archive that provides text scraped from websites. These datasets (C4, CC25, and Common Crawl) may include Vietnamese Wikipedia languages website that help the model excel in constrained domain corpus (Wikipedia). We will further discuss the affect of pre-training data for constrained domain corpus in Section 7.2.

7.2 Vietnews

Following the discussion of constrain domain training data in Section 7.1, PhoBERT2PhoBERT, which was trained on 20GB of news text, excels on the Vietnews corpus. On the other hand, large multingual model mT5 and mBART don’t have a significant increase in scores compared to RND2RND baseline.

The importance of a pre-trained decoder also shows in the Vietnews corpus. PhoBERT2RND shows approximately the same result as the Transformer baseline.

8 Conclusion

In this manuscript, we test a series of availbale pre-trained Transformer on Vietnamese Abstractive Summarization Tasks. Following the transformer architecture proposed by Vaswani et al. (2017), we showed that incorporating a pre-trained decoder to existed pre-trained encoder will improve the understanding of the model on the Vietnamese language and Summarization ability. We also show that pre-training data quality will affect the performance of the model on a constrain domain.


  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. CoRR abs/2005.14165. External Links: Link, 2005.14165 Cited by: §1, §7.1.
  • T. V. Bui, O. T. Tran, and P. Le-Hong (2020) Improving sequence tagging for vietnamese text using transformer-based neural models. CoRR abs/2006.15994. External Links: Link, 2006.15994 Cited by: §1, §3.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §1, §3.1, §3.3, §3.4.
  • Z. Dou, P. Liu, H. Hayashi, Z. Jiang, and G. Neubig (2021) GSum: a general framework for guided neural abstractive summarization. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Cited by: §2.
  • S. Gehrmann, Y. Deng, and A. Rush (2018) Bottom-up abstractive summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4098–4109. External Links: Link, Document Cited by: §2.
  • F. Ladhak, E. Durmus, C. Cardie, and K. R. McKeown (2020) WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. CoRR abs/2010.03093. External Links: Link, 2010.03093 Cited by: §1, §5.1, §5.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR abs/1910.13461. External Links: Link, 1910.13461 Cited by: §3.5, §7.1.
  • Y. Liu and M. Lapata (2019) Text summarization with pretrained encoders. In EMNLP/IJCNLP, Cited by: §2.
  • Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020)

    Multilingual denoising pre-training for neural machine translation

    CoRR abs/2001.08210. External Links: Link, 2001.08210 Cited by: §1, §3.5.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §3.2.
  • L. Manor and J. J. Li (2019) Plain english summarization of contracts. CoRR abs/1906.00424. External Links: Link, 1906.00424 Cited by: §1.
  • D. Q. Nguyen and A. T. Nguyen (2020) PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1037–1042. Cited by: §1, §3.2.
  • M. Nguyen, H. Nguyen, T. Nguyen, and V. Nguyen (2018) Towards state-of-the-art baselines for vietnamese multi-document summarization. In 2018 10th International Conference on Knowledge and Systems Engineering (KSE), Vol. , pp. 85–90. External Links: Document Cited by: §2.
  • V. Nguyen, T. Nguyen, M. Nguyen, and N. Hoai (2019) VNDS: a vietnamese dataset for summarization. pp. 375–380. External Links: Document Cited by: §1, §2, §5.2, §5.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: A fast, extensible toolkit for sequence modeling. CoRR abs/1904.01038. External Links: Link, 1904.01038 Cited by: §3.2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. CoRR abs/1802.05365. External Links: Link, 1802.05365 Cited by: §1.
  • O. Press and L. Wolf (2017) Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 157–163. External Links: Link Cited by: §4.
  • H. T. Quoc, K. V. Nguyen, N. L. Nguyen, and A. G. Nguyen (2021) Monolingual versus multilingual bertology for vietnamese extractive multi-document summarization. External Links: 2108.13741 Cited by: §2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019) Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR abs/1910.10683. External Links: Link, 1910.10683 Cited by: §3.6, §7.1.
  • S. Rothe, S. Narayan, and A. Severyn (2019) Leveraging pre-trained checkpoints for sequence generation tasks. CoRR abs/1907.12461. External Links: Link, 1907.12461 Cited by: §1.
  • H. Schwenk, G. Wenzek, S. Edunov, E. Grave, and A. Joulin (2020) CCMatrix: mining billions of high-quality parallel sentences on the web. External Links: 1911.04944 Cited by: §7.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §1, §3.1, §4, §6.1, §8.
  • R. Xu, C. Zhu, Y. Shi, M. Zeng, and X. Huang (2020) Mixed-lingual pre-training for cross-lingual summarization. CoRR abs/2010.08892. External Links: Link, 2010.08892 Cited by: §1.
  • L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel (2020) MT5: A massively multilingual pre-trained text-to-text transformer. CoRR abs/2010.11934. External Links: Link, 2010.11934 Cited by: §1, §3.6.
  • A. Zagar and M. Robnik-Sikonja (2020) Cross-lingual approach to abstractive summarization. CoRR abs/2012.04307. External Links: Link, 2012.04307 Cited by: §1.