BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

by   Nguyen Luong Tran, et al.

We present BARTpho with two versions – BARTpho_word and BARTpho_syllable – the first public large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese. Our BARTpho uses the "large" architecture and pre-training scheme of the sequence-to-sequence denoising model BART, thus especially suitable for generative NLP tasks. Experiments on a downstream task of Vietnamese text summarization show that in both automatic and human evaluations, our BARTpho outperforms the strong baseline mBART and improves the state-of-the-art. We release BARTpho to facilitate future research and applications of generative Vietnamese NLP tasks. Our BARTpho models are available at:



There are no comments yet.


page 1

page 2

page 3


PhoBERT: Pre-trained language models for Vietnamese

We present PhoBERT with two versions of "base" and "large"–the first pub...

Neural Language Modeling for Contextualized Temporal Graph Generation

This paper presents the first study on using large-scale pre-trained lan...

Effective Sequence-to-Sequence Dialogue State Tracking

Sequence-to-sequence models have been applied to a wide variety of NLP t...

PAGnol: An Extra-Large French Generative Model

Access to large pre-trained models of varied architectures, in many diff...

FiD-Ex: Improving Sequence-to-Sequence Models for Extractive Rationale Generation

Natural language (NL) explanations of model predictions are gaining popu...

Text Simplification by Tagging

Edit-based approaches have recently shown promising results on multiple ...

skweak: Weak Supervision Made Easy for NLP

We present skweak, a versatile, Python-based software toolkit enabling N...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The masked language model BERT Devlin et al. (2019)

and its variants, pre-trained on large-scale corpora, help improve the state-of-the-art (SOTA) performances of various natural language understanding tasks. However, due to a bidirectionality nature, it might be difficult to directly apply those pre-trained language models to natural language generation tasks

Wang and Cho (2019). Therefore, pre-trained sequence-to-sequence (seq2seq) models are proposed to handle this issue Dong et al. (2019); Lewis et al. (2020); Zhang et al. (2020); Raffel et al. (2020); Qi et al. (2020). The success of these pre-trained seq2seq models has largely been limited to the English language. For other languages, one could employ existing pre-trained multilingual seq2seq models Liu et al. (2020); Xue et al. (2021); Qi et al. (2021) or retrain a language-specific model using the proposed seq2seq architecture Eddine, Tixier, and Vazirgiannis (2020). It is worth noting that retraining a language-specific model might be preferable as dedicated language-specific models still outperform multilingual ones Nguyen and Nguyen (2020). To the best of our knowledge, there is not an existing monolingual seq2seq model pre-trained for Vietnamese.

In this paper, we introduce BARTpho with two versions—BARTphoword and BARTphosyllable—the first large-scale monolingual seq2seq models pre-trained for Vietnamese. The difference between our two BARTpho versions is that they take different types of input texts: syllable level for BARTphosyllable (e.g. 7-syllable written text “chúng tôi là những nghiên cứu viên”we are researchers) vs. word level for BARTphoword (e.g. 4-word text “chúng_tôiweare những nghiên_cứu_viênreseacher”). We compare BARTpho with mBART (Liu et al., 2020)—a multilingual variant of BART Lewis et al. (2020)—on a downstream task of Vietnamese text summarization. We find that our BARTpho models outperform mBART in both automatic and human evaluations, and help produce a new SOTA performance. We publicly release our BARTpho that can be used with popular libraries fairseq (Ott et al., 2019) and transformers Wolf et al. (2020). We hope that BARTpho can serve as a strong baseline for future research and applications of generative NLP tasks for Vietnamese.

Our BARTpho

This section describes the architecture, the pre-training data and the optimization setup, that we use for BARTpho.


Both BARTphoword and BARTphosyllable

use the “large” architecture and pre-training scheme of the seq2seq denoising autoencoder BART

Lewis et al. (2020). In particular, pre-training BART has two stages: (i) corrupting the input text with an arbitrary noising function, and (ii) learning to reconstruct the original text, i.e. optimizing the cross-entropy between its decoder’s output and the original text. Here, BART uses the standard architecture Transformer Vaswani et al. (2017)

, but employing the GeLU activation function

Hendrycks and Gimpel (2016)

rather than ReLU and performing parameter initialization from

(0, 0.02). Following Liu et al. (2020), we also add a layer-normalization layer on top of both the encoder and decoder. Following Lewis et al. (2020)

, we employ two types of noise in the noising function, including text infilling and sentence permutation. For text infilling, we sample a number of text spans with their lengths drawn from a Poisson distribution (

= 3.5), and replace each span by a single special mask token. For sentence permutation, consecutive sentences are grouped to generate sentence blocks of 512 tokens, and sentences in each block are then shuffled in random order.

Pre-training data

For BARTphoword, we employ the PhoBERT pre-training corpus Nguyen and Nguyen (2020), that contains 20GB of uncompressed texts (about 145M word-segmented sentences). In addition, we also reuse the PhoBERT’s tokenizer that applies a vocabulary of 64K subword types and BPE (Sennrich, Haddow, and Birch, 2016) to segment those word-segmented sentences with subword units. Our BARTphoword thus has about 420M parameters. Pre-training data for BARTphosyllable is a detokenized variant of the PhoBERT pre-training corpus. We employ the pre-trained SentencePiece model (Kudo and Richardson, 2018) used in mBART (Liu et al., 2020) to segment sentences with sub-syllable units and select a vocabulary of the top 40K most frequent types. The number of parameters of our BARTphosyllable is 396M.


We utilize the BART implementation with the denoising task from fairseq (Ott et al., 2019). We use Adam (Kingma and Ba, 2015) for optimization, and use a batch size of 512 sequence blocks across 8 A100 GPUs (40GB each) and a peak learning rate of 0.0001. Note that we initialize parameter weights of BARTphosyllable

by those from mBART. For each BARTpho model, we run for 15 training epochs in about 6 days (here, the learning rate is warmed up for 1.5 epochs).


Experimental setup

We evaluate and compare the performance of BARTpho with the strong baseline mBART on a downstream generative task of Vietnamese text summarization. Here, mBART is pre-trained on a Common Crawl dataset of 25 languages, which contains 137 GB of syllable-level Vietnamese texts.

We employ the single-document summarization dataset VNDS Nguyen et al. (2019), consisting of 150704 word-level news articles each including a news abstract (i.e. gold summary) and body content (i.e. input text). In particular, 105418, 22642 and 22644 articles are used for training, validation and test, respectively. However, we find that there are duplicate articles in this dataset. Therefore, we filter the duplicates, resulting in 102044, 21040 and 20733 articles for training, validation and test, respectively. When fine-tuning BARTphosyllable and mBART, we use a detokenized version of the filtered dataset, while its word-level (i.e. word-segmented) version is used for fine-tuning BARTphoword.

We formulate this summarization task as a monolingual translation problem, and fine-tune our BARTpho and the baseline mBART using the same hyper-parameter tuning strategy. We fix the maximum number of tokens in a batch at 4096. We use Adam and run for 20 training epochs. We also perform grid search to select the Adam initial learning rate from {1e-5, 2e-5, 3e-5, 5e-5}. We employ beam search with a beam size of 4 for decoding. We evaluate each model 4 times in every epoch. We select the model checkpoint that produces the highest ROUGE-L score (Lin, 2004) on the validation set, and then apply the selected one to the test set. Note that we compute the detokenized and case-sensitive ROUGE scores for all models (here, we detokenize the fine-tuned BARTphoword’s output before computing the scores).

Main results

Table 1 presents our obtained scores on the validation and test sets for the baseline mBART and our two BARTpho versions. Clearly, both BARTpho versions achieve significantly better ROUGE scores than mBART on both validation and test sets. It is somewhat not surprising that BARTphosyllable achieves higher ROUGE-1 scores, but lower ROUGE-L scores than BARTphoword. For comparison with previously published word-level results Nguyen et al. (2019), we apply our fine-tuned BARTphoword on the original test set of 22644 word-segmented articles, and obtain a word-level ROUGE-L score at 40.03. Note that Nguyen et al. (2019) use a larger training set of 105418 articles, in which 1470 training articles also appear in the test set. However, the highest ROUGE-L score obtained among 13 different models, reported by Nguyen et al. (2019), is 37.64 that is still 2.4 points lower than ours. Clearly, BARTpho helps attain a new SOTA performance for this task.

Model Validation set Test set
R-1 R-2 R-L R-1 R-2 R-L Human
mBART 60.25 28.94 39.04 60.23 28.92 39.09 21/100
BARTphos 60.73 29.90 39.67 60.77 29.86 39.75 37/100
BARTphow 60.41 29.93 39.82 60.38 29.86 39.85 42/100
Table 1: ROUGE scores (in %). R-1, R-2, R-L, BARTphos and BARTphow abbreviate ROUGE-1, ROUGE-2, ROUGE-L, BARTphosyllable and BARTphoword, respectively. Every score difference between mBART and each BARTpho version is statistically significant with p-value 0.05.

We also conduct a human-based manual comparison between the outputs produced by the baseline mBART and our two BARTpho versions. In particular, we randomly sample 100 input text examples from the test set; and for each input example, we anonymously shuffle the summary outputs from three fine-tuned models (here, each input sampled example satisfies that any two out of three summary outputs are not exactly the same). We then ask two external Vietnamese annotators to choose which summary they think is the best. We obtain a Cohen’s kappa coefficient at 0.61 for the inter-annotator agreement between the two annotators. Our second co-author then hosts and participates in a discussion session with the two annotators to resolve annotation conflicts (here, he does not know which model produces which summary). Table 1 shows final scores where BARTpho obtains better human evaluation result than mBART.

Our automatic and human evaluation results from Table 1 show the effectiveness of large-scale BART-based monolingual seq2seq models for Vietnamese. Though mBART uses 137 / 20 7 times bigger Vietnamese pre-training data, it is surpassed by BARTpho. Thus this also reconfirms that the dedicated language-specific model still performs better than the multilingual one Nguyen and Nguyen (2020).


We have presented BARTphoword and BARTphosyllable—the first pre-trained and large-scale monolingual seq2seq models for Vietnamese. We demonstrate the usefulness of our BARTpho by showing that BARTpho performs better than its competitor mBART and helps produce the SOTA performance for the Vietnamese text summarization task. We hope that our public BARTpho models can foster future research and applications of generative Vietnamese NLP tasks.


  • Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL, 4171–4186.
  • Dong et al. (2019) Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; and Hon, H.-W. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. In Proceedings of NeurIPS, volume 32.
  • Eddine, Tixier, and Vazirgiannis (2020) Eddine, M. K.; Tixier, A. J.-P.; and Vazirgiannis, M. 2020. BARThez: a Skilled Pretrained French Sequence-to-Sequence Model. arXiv preprint, arXiv:2010.12321.
  • Hendrycks and Gimpel (2016) Hendrycks, D.; and Gimpel, K. 2016. Gaussian Error Linear Units (GELUs). arXiv preprint, arXiv:1606.08415.
  • Kingma and Ba (2015) Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of ICLR.
  • Kudo and Richardson (2018) Kudo, T.; and Richardson, J. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of EMNLP: System Demonstrations, 66–71.
  • Lewis et al. (2020) Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of ACL, 7871–7880.
  • Lin (2004) Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74–81.
  • Liu et al. (2020) Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; and Zettlemoyer, L. 2020.

    Multilingual Denoising Pre-training for Neural Machine Translation.

    Transactions of the ACL, 8: 726–742.
  • Nguyen and Nguyen (2020) Nguyen, D. Q.; and Nguyen, A. T. 2020. PhoBERT: Pre-trained language models for Vietnamese. In Findings of EMNLP, 1037–1042.
  • Nguyen et al. (2019) Nguyen, V.-H.; Nguyen, T.-C.; Nguyen, M.-T.; and Hoai, N. X. 2019. VNDS: A Vietnamese Dataset for Summarization. In Proceedings of NICS, 375–380.
  • Ott et al. (2019) Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; and Auli, M. 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 48–53.
  • Qi et al. (2021) Qi, W.; Gong, Y.; Yan, Y.; Xu, C.; Yao, B.; Zhou, B.; Cheng, B.; Jiang, D.; Chen, J.; Zhang, R.; Li, H.; and Duan, N. 2021. ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation. In Proceedings of ACL: System Demonstrations, 232–239.
  • Qi et al. (2020) Qi, W.; Yan, Y.; Gong, Y.; Liu, D.; Duan, N.; Chen, J.; Zhang, R.; and Zhou, M. 2020.

    ProphetNet: Predicting Future N-gram for Sequence-to-SequencePre-training.

    In Findings of EMNLP, 2401–2410.
  • Raffel et al. (2020) Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020.

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

    Journal of Machine Learning Research

    , 21(140): 1–67.
  • Sennrich, Haddow, and Birch (2016) Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of ACL, 1715–1725.
  • Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L. u.; and Polosukhin, I. 2017. Attention is All you Need. In Proceedings of NIPS, volume 30.
  • Wang and Cho (2019) Wang, A.; and Cho, K. 2019. BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model. arXiv preprint, arXiv:1902.04094.
  • Wolf et al. (2020) Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davison, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.; Xu, C.; Le Scao, T.; Gugger, S.; Drame, M.; Lhoest, Q.; and Rush, A. 2020.

    Transformers: State-of-the-Art Natural Language Processing.

    In Proceedings of EMNLP 2020: System Demonstrations, 38–45.
  • Xue et al. (2021) Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; and Raffel, C. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of NAACL, 483–498.
  • Zhang et al. (2020) Zhang, J.; Zhao, Y.; Saleh, M.; and Liu, P. 2020. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In Proceedings of ICML, 11328–11339.