ParsBERT: Transformer-based Model for Persian Language Understanding

by   Mehrdad Farahani, et al.
Shahed University

The surge of pre-trained language models has begun a new era in the field of Natural Language Processing (NLP) by allowing us to build powerful language models. Among these models, Transformer-based models such as BERT have become increasingly popular due to their state-of-the-art performance. However, these models are usually focused on English, leaving other languages to multilingual models with limited resources. This paper proposes a monolingual BERT for the Persian language (ParsBERT), which shows its state-of-the-art performance compared to other architectures and multilingual models. Also, since the amount of data available for NLP tasks in Persian is very restricted, a massive dataset for different NLP tasks as well as pre-training the model is composed. ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones and improves the state-of-the-art performance by outperforming both multilingual BERT and other prior works in Sentiment Analysis, Text Classification and Named Entity Recognition tasks.



There are no comments yet.


page 7

page 8


BERTje: A Dutch BERT Model

The transformer-based pre-trained language model BERT has helped to impr...

AraBERT: Transformer-based Model for Arabic Language Understanding

The Arabic language is a morphologically rich and complex language with ...

Mono vs Multilingual Transformer-based Models: a Comparison across Several Language Tasks

BERT (Bidirectional Encoder Representations from Transformers) and ALBER...

TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing

Various robustness evaluation methodologies from different perspectives ...

Probing for Multilingual Numerical Understanding in Transformer-Based Language Models

Natural language numbers are an example of compositional structures, whe...

Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with Controllable Perturbations

Recent research has adopted a new experimental field centered around the...

mT5: A massively multilingual pre-trained text-to-text transformer

The recent "Text-to-Text Transfer Transformer" (T5) leveraged a unified ...

Code Repositories


🤗 ParsBERT: Transformer-based Model for Persian Language Understanding

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Natural language is the tool humans use to communicate with each other. Thus, a vast amount of data is encoded as texts using this tool. Extracting meaningful information from this type of data and manipulating them using computers lie within the field of Natural Language Processing (NLP). There are different NLP tasks such as Named Entity Recognition (NER), Sentiment Analysis (SA), and Question/Answering, each focusing on a particular aspect of the text data to achieve successful performance on each of these tasks, a variety of pre-trained word embedding and language modeling methods have been proposed in the recent years.

Word2Vec [mikolov2013distributed] and GloVe [pennington2014glove]

are pre-trained word embeddings methods based on Neural Networks (NNs) that investigate the semantic, syntactic, and logical relationships between words in a sequence to provide a static word representation vectors, based on the training data. While these methods leave the context of the input sequence out of the equation, contextualized word embedding methods such as ELMo

[peters2018deep] provide dynamic word embeddings by taking the context into account.

There are two approaches towards pre-trained language representations [devlin2018bert]: feature-based such as ELMo and fine-tuning such as OpenAI GPT [radford2018improving]

. Fine-tuning approaches (also known as Transfer Learning methods) seek to train a language model with large datasets of unlabeled plain texts. The parameters of these models are then fine-tuned using task-specific data to achieve state-of-the-art performance over various NLP tasks

[devlin2018bert, radford2018improving, liu2019roberta]. The fine-tuning phase, relative to pre-training, requires much less energy and time. Therefore, pre-trained language models can be used to save energy, time, and cost. However, this comes with specific challenges. The amount of data and the computational resources required to pre-train an efficient language model with acceptable performance is substantial; hundreds of gigabytes of text-documents and hundreds of Graphical Processing Units (GPUs) [yang2019xlnet, liu2019roberta, raffel2019exploring, conneau2019unsupervised].

As a solution, multilingual models have been developed, which can be beneficial for languages with similar morphology and syntactic structure (e.g., Latin-based languages). Other Non-Latin languages differ from Latin-based languages significantly and can not benefit from their shared representations. Therefore, a language-specific approach should be adapted. For instance, the framework of Recurrent Neural Network (RNN), along with morpheme representation, is proposed to overcome feature engineering and data sparsity for the Mongolian NER task


A similar situation applies to the Persian language. Although some multilingual models include Persian, they are susceptible to fall behind monolingual models that are concretely trained over language-specific vocabulary with more massive amounts of Persian text data. To the best of our knowledge, no specific effort has been made to pre-train a Bidirectional Encoder Representation Transformer (BERT) [devlin2018bert] model for the Persian language.

In this paper, we take advantage of the BERT architecture [devlin2018bert] to build a pre-trained language model for the Persian Language, which we call ParsBERT hereafter. We evaluate this model on three Persian NLP downstream tasks: (a) Sentiment Analysis, (b) Text Classification, and (c) Named Entity Recognition. We show that for all these tasks, ParsBERT outperforms several baselines, including previous multilingual and monolingual models. Thus, our contribution can be summarized as follows:

  • Proposing a monolingual Persian language model (ParsBERT) based on the BERT architecture.

  • ParsBERT achieves better performances regarding other multilingual and deep-hybrid architectures.

  • ParsBERT is lighter than the original multilingual BERT model.

  • During this procedure, the research provided a massive set of Persian text corpora and NLP tasks for other uses cases.

The rest of this paper is organized as follows. In section 2, a comprehensive study of previous related works is provided. Section 3 outlines the methodology used to pre-train ParsBERT. In the next section, 4 describes the NLP downstream tasks and benchmark datasets on which the model is evaluated. Section 5 provides a thorough discussion of the obtained results. Section 6 concludes this paper by providing a guideline for possible future works. Finally, section 7 appreciates everyone who supports and provides the chance to possible this research.

2 Related Work

2.1 Language Modelling

Language modeling has gained popularity in recent years, and many works have been dedicated to building models for different languages based on varying contexts. Some works have sought to build character-level models. For example, a character-level model with Recurrent Neural Network (RNN) is presented in [huang2019c]. This model reasons about word spelling and grammar dynamically. Another multi-task character-level attentional network model for the medical concept has been used to address Out-Of-Vocabulary (OOV) problem and to sustain morphological information inside the concept [niu2019multi].

Contextualized language modeling is centered around the idea that words can be represented differently based on the context in which they appear. Encoder-decoder language models, sequence autoencoders, and sequence-to-sequence models have this concept

[dai2015semi, ramachandran2016unsupervised, sutskever2014sequence]. ELMo and ULMFiT [howard2018universal] are contextualized language models pre-trained on large general domain corpora. They are both based on LSTM networks [hochreiter1997long]; ULMFiT benefits from a regular multi-layer LSTM network while ELMo utilizes a bidirectional LSTM structure to predict both next and previous words in a sequence of words. It then composes the final embedding for each token by concatenating the left-to-right and the right-to-left representations. Both ULMFiT and ELMo show considerable improvement in downstream tasks as compared to preceding language models and word embedding methods.

Another candidate for sequence-to-sequence mapping is the Transformer model [vaswani2017attention], which is based on the attention mechanism to evaluate dependencies between input/output sequences. Unlike LSTM, this model does not incorporate any recurrence. The Transformer model depends on two entities named encoder and decoder; the encoder takes the input sequence and maps it to a higher dimensional vector. This vector is then mapped to an output sequence by the decoder. Several pre-trained language modeling architectures are based on the transformer model, namely GPT [radford2018improving] and BERT [devlin2018bert].

GPT includes a stack of twelve Transformer decoders. However, its structure is unidirectional, meaning that each token attends only to the previous one in the sequence. On the other hand, BERT performs joint conditioning on both left and right contexts by using a Masked Language Model (MLM) and a stack of transformer encoders along with the decoders. This way, BERT achieves an accurate pre-trained deep bidirectional representation. There are other Transformer-based architectures such as XLNet [yang2019xlnet], RoBERTa [liu2019roberta], XLM [lample2019cross], T5 [raffel2019exploring], and ALBERT [lan2019albert], all of which have presented state-of-the-art results on multiple NLP tasks such as [wang2018glue] and SQuAD [rajpurkar2018know].

Monolingual pre-trained models have been developed for several languages other than English. ELMo models are available for Portuguese, Japanese, German, and Basque 111 Regarding BERT-based models, BERTje for Dutch [de2019bertje], Alberto for Italian [polignano2019alberto], AraBERT for Arabic [AraBert], and other models for Finnish [virtanen2019multilingual], Russian [kuratov2019adaptation] and Portuguese [souza2019portuguese] have been released.

For the Persian language, several word embeddings such as Word2Vec, GloVe, and FastText [grave2018learning] have been presented. All these word embeddings models are trained on Wikipedia corpus. A thorough comparison between these models is provided in [zahedi2018persian] and shows that FastText and Word2Vec outperform other models. Another LSTM-based language model for Persian is presented in [saravani2018persian]. Their model utilizes word embeddings as word representations and achieves the best performing model with a two-layer bidirectional LSTM network.

2.2 NLP Downstream Tasks

Although several works are presented to address NLP downstream tasks such as NER and Sentiment Analysis for the Persian language, the subject of pre-trained networks in the Persian language is a new topic. Most of the work done in this area is centered around machine learning or neural network methods built from scratch for each task, due to incapability of fine-tuning these approaches. For instance, a machine learning-based approach for Persian NER, using Hidden Markov Model (HMM), is presented in

[ahmadi2015hybrid]. Another approach for Persian NER is provided by [dashtipour2017persian]

which combines a rule-based grammatical approach. Moreover, a Deep Learning approach for Persian NER is provided in

[bokaei2018improved] facilitating bidirectional LSTM networks. Beheshti-NER [taher2020beheshti] uses multilingual Google BERT to form a fine-tuned model for Persian NER and is the closest work to present work. However, it only involves a fine-tuning phase for NER and does not entail developing a monolingual BERT-based model for the Persian Language.

The same situation applies to Persian sentiment analysis as Persian NER. In [dastgheib2020application]

a hybrid combination of Convolutional Neural Networks (CNN) and Structural Correspondence, Learning is presented to improve sentiment classification. Also, a graph-based text representation along with Deep Neural Learning is composed in

[bijari2020leveraging]. The closest work in sentiment analysis to the present work is DeepSentiPers [sharami2020deepsentipers], which leverages CNN and bidirectional LSTM networks combined with FastText trained over a balanced and augmented version of a Persian sentiment dataset known as SentiPers [hosseini2018sentipers].

It should be noted that none of these works uses pre-trained networks, and all of them focus solely on designing and combining methods to produce a task-specific approach.

3 ParsBERT: Methodology

In this section, the methodology of our proposed model is presented. It consists of five main tasks, of which the first three concern the dataset and the next two concern model development. These tasks are data gathering, data pre-processing, accurate sentence segmentation, pre-training setup, and fine-tuning.

3.1 Data Gathering

Although a few Persian text corpora are provided by the University of Leipzig [goldhahn-etal-2012-building] and University of Sorbonne [ortizsuarez:hal-02148693], the sentences in those corpora do not follow a logical corpora-level order and are somewhat erroneous. Also, these resources cover only a limited number of writing styles and subjects. Therefore, to increase the generality and efficiency of our pre-trained model in words, phrases, and sentence levels, it was necessary to compose a new form of the corpus from scratch to tackle the limitations mentioned earlier. This was done by crawling many sources such as Persian Wikipedia 222, BigBangPage 333, Chetor 444, Eligasht 555, Digikala 666, Ted Talks subtitles 777, several fictional books and novels, and MirasText [SABETI18.385]. The latter source has crawled more than 250 Persian news websites. Table 1 demonstrates the statistics of our general-domain corpus:

# Source Type Total Documents
1 Persian Wikipedia General(encyclopedia) 1,119,521
2 BigBang Page Scientific 135
3 Chetor Lifestyle 3,583
4 Eligasht Itinerary 9,629
5 Digikala Digital magazine 8,645
6 Ted Talks General (conversational) 2,475
7 Books Novels, storybooks, short stories from old to the contemporary era 13
8 Miras-Text News categories 2,835,414
Table 1: Statistics and types of each source in the proposed corpus, entailing a varied range of written styles.

3.2 Data Pre-Processing

After gathering the pre-training corpus, an immense hierarchy of processing steps, including cleaning, replacing, sanitizing, and normalizing 888, is vital to transform the dataset into a proper format. This is done via a two-step process and is illustrated in Figure 1.

(a) Step 1
(b) Step 2
Figure 1: Specific Persian corpus pre-processing that includes two steps: (a) removing all the trivial and junk characters and (b) standardizing the corpus with respect to Persian characters.

3.3 Document Segmentation into True Sentences

After the corpus is pre-processed, it should be segmented into True Sentences related to each document to achieve remarkable results for the pre-training model. A True Sentence in Persian is recognized based on this notations [?!.:]. However, dividing content based merely on these notations has shown to cause problems. In Figure 2, an example of such issues is illustrated. It can be seen that the result includes short meaningless sentences without any vital information because there are abbreviations in Persian separated with the dot (.) notation. As an alternative, Part Of Speech (POS) can be a proper solution to handle these types of errors and to produce desired outputs.

(a) Notation Segmentation
(b) POS Segmentation
Figure 2: Example of segmenting a document into its sentences based on (a) only writing notations and (b) POS

This procedure enables the system to learn the real relationship between the sentences in each document. Table 2 shows the statistics for the pre-training corpus segmented with the POS approach, resulting in 38,269,471 lines of True Sentences.

# Source Total True Sentences
1 Persian Wikipedia 1,878,008
2 BigBang Page 3,017
3 Chetor 166,312
4 Eligasht 214,328
5 Digikala 177,357
6 Ted Talks 46,833
7 Books 25,335
8 Miras-Text 35,758,281
Table 2: Statistics of the pre-training corpus.

3.4 Pre-training Setup

Our model is based on BERT model architecture [devlin2018bert], which includes a multi-layer bidirectional Transformer. In particular, we use the original BERT BASE configuration: 12 hidden layers, 12 attention heads, 768 hidden sizes. The total number of parameters in this configuration is 110M. As per the original BERT pre-training objective, our pre-training objective consists of two tasks:

  1. A Masked Language Model (MLM) is employed to train the model to predict randomly masked tokens by using cross-entropy loss. For this purpose given N tokens, 15% of them are selected at random. From these selected tokens, 80% of them are replaced by an exclusive [MASK] token, 10% are replaced with a random token, and 10% remain unchanged.

  2. Implementing Next Sentence prediction (NSP) task, in which the model learns to predict whether the second sentence in a pair of sentences is the actual next sentence of the first one or not. In the original BERT paper [devlin2018bert], it has been argued that removing NSP from pre-training can attenuate the performance of the model on some tasks. Therefore, we employ NSP in our model to ensure high efficiency on different tasks.

For model optimization [kingma2014adam], Adam optimizer with and is used for 1.9M steps. The batch size is set to 32, and each sequence contains 512 tokens at most. Finally, the learning rate is set to 1e-4.

Subword tokenization, which is necessary for better performance, is achieved using the WordPiece method [kudo-2018-subword]. WordPiece operates as an intermediary between BPE [sennrich-etal-2016-neural]

and Unigram Language Model (ULM) approaches. WordPiece is trained on our pre-training corpus with a minimum frequency of three and 1.5K alphabet token limitations. The resulting vocabulary consists of 100K tokens, including unique BERT-specific tokens, namely [PAD], [UNK], [CLS], [MASK] [SEP] and

which is used as a prefix for word relation tokenization. Table 3 shows an example of the tokenization process based on the WordPiece method.

Table 3: Example of the segmentation process: (1) unsegmented sentence (2) segmented sentence using WordPiece method (␣ interpret as -).
brAy bAzdyd az dyw^c^smh bAyad bh nw^shr brawyd, ^shry kh az ^smAl bh dryAy xzr, az jnwb bh kwhhAy albrz, az ^sarq bh ^sahrstAn nwr w az .grb bh ^cAlws mnthy my^swad. (1)
brAy – bAzdyd – az – dyw – ##^c^smh – bAyad – bh – nw^shr – brawyd –, – ^shry – kh – az – ^smAl – bh – dryAy – xzr – , – az – jnwb – bh – kwhhAy – albrz – , – az – ^sarq – bh – ^sahrstAn – nwr – w – az – .grb – bh – ^cAlws – mnthy – my^swad – . (2)

3.5 Fine-Tuning Setup

The final language model (our proposed model) should be fine-tuned towards different tasks: Sentiment Analysis, Text Classification, and Named Entity Recognition. Sentiment Analysis and Text Classification belong to a broader task called Sequence Classification. Sentiment Analysis recognized as a specific task of Text Classification in representing the emotions behind the text.

3.5.1 Sequence Classification

Sequence classification is the process of labeling texts in a supervised manner. In our model, we incorporated the corresponding class for each sequence into the distinctive [CLS] token. We then added a simple feed-forward Softmax layer to predict the output classes. During this process, to maximize the log-probability of the correct class, both classifier and pre-trained model weights are adjusted.

3.6 Named Entity Recognition

This task aims to extract named entities in the text, such as names and label with appropriate NER classes such as locations, organizations, etc. The datasets used for this task contain sentences that are labeled with IOB format. In this format, tokens that are not part of an entity are tagged as ”O”, the ”B” tag corresponds to the first word of an entity, and the ”I” tag corresponds to the rest of the words of the same entity. Both ”B” and ”I” tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text.

4 Evaluation

ParsBERT is evaluated on three downstream tasks: Sentiment Analysis (SA), Text Classification, and Named Entity Recognition (NER). Each of these tasks requires their specific datasets for the model to be fine-tuned and evaluated on.

4.1 Sentiment Analysis

It aims to classify text, such as comments based on their emotional bias. The proposed model is evaluated on three sentiment datasets as follows:

  1. Digikala user comments provided by Open Data Mining Program 999 (ODMP). This dataset contains 62,321 user comments with three labels: (0) No Idea, (1) Not Recommended and (2) Recommended.

  2. Snappfood 101010 (an online food delivery company) user comments containing 70,000 comments with two labels (i.e. polarity classification): (0) Happy and (1) Sad.

  3. DeepSentiPers [sharami2020deepsentipers], which is a balanced and augmented version of SentiPers [hosseini2018sentipers], contains 12,138 user opinions about digital products labeled with five different classes; two positives (i.e., happy and delighted), two negatives (i.e., furious and angry) and one neutral class. Therefore, this dataset can be utilized for both multi-class and binary classification. In the case of binary classification, the neutral class and its corresponding sentences are removed from the dataset.

The second dataset of the above list was not readily available. We extracted it using our tools to provide a more comprehensive evaluation. Figure 3 illustrates the class distribution for all three sentiment datasets.

Figure 3: Class distribution for (a) Multi-class DeepSentiPers, (b) Binary-class DeepSentiPers, (c) Digikala and (d) SnappFood datasets.

Baselines: Since no work has been done regarding the Digikala and SnappFood datasets, our baseline for these datasets is the multilingual BERT model. As for the DeepSentiPers [sharami2020deepsentipers] dataset, we compare our results with those reported in this paper. Their methodology for addressing the SA task entails a hybrid CNN and BiLSTM networks.

4.2 Text Classification

Text classification is an important NLP task in which the objective is to classify a text-based on pre-determined classes. The number of classes is usually higher than that of sentiment analysis and words distribution makes finding the right and main class so tricky. The datasets used for this task come from two sources:

  1. A total of 8,515 articles scraped from Digikala online magazine 111111 This dataset includes seven different classes.

  2. A dataset of various news articles scraped from different online news agencies’ websites. The total number of articles is 16,438, spread over eight different classes.

We have scraped and prepared both of these datasets using our own tools. Figure 4 shows the class distribution for each of these datasets.

Figure 4: Class distribution for (a) Digikala Online Magazine and (b) Persian news articles scraped from various websites.

Baseline: Since we have prepared both datasets for this task using our tool, no prior work has been done. Therefore, we only have the monolingual BERT model to compare our model to for this task.

4.3 Named Entity Recognition

For the NER task evaluation, PEYMA [shahshahani2018peyma] and ARMAN [poostchi2018bilstm] readily available datasets are used. PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes. On the other hand, the ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes. The class distribution for these datasets is shown in Figure 5.

Figure 5: Class distribution for (a) ARMAN and (b) PEYMA datasets

Baselines: We compare the result of our model for the NER task to that of Beheshti-NER [taher2020beheshti]. Beheshti-NER utilizes a multilingual BERT model to tackle the same NER task as ours.

5 Results

5.1 Sentiment Analysis Results

Table 4 shows the results obtained on Digikala and SnaooFood datasets. This table shows that ParsBERT outperforms the multilingual BERT model in terms of accuracy and score.

Model Digikala SnappFood
ParsBERT 82.52 81.74 87.80 88.12
multilingualBERT 81.83 80.74 87.44 87.87
Table 4: ParsBERT performance on Digikala and SnappFood datasets compared to multilingual BERT model.

The results for DeepSentiPers dataset are presented in table 5. It can be seen that ParsBERT achieves significantly higher scores for both multi-class and binary sentiment analysis compared to methods mentioned in DeepSentiPers [sharami2020deepsentipers].

Model Multi-Class Binary
ParsBERT 71.11 92.13
CNN + FastText [sharami2020deepsentipers] 66.30 80.06
CNN [sharami2020deepsentipers] 66.65 91.90
BiLSTM + FastText [sharami2020deepsentipers] 69.33 90.59
BiLSTM [sharami2020deepsentipers] 66.50 91.98
SVM [sharami2020deepsentipers] 67.62 91.31
Table 5: ParsBERT performance on DeepSentiPers dataset compared to methods mentioned in DeepSentiPers [sharami2020deepsentipers]

5.2 Text Classification Results

The obtained results for text classification task are summarized in Table 6. It can be seen that ParsBERT achieves better accuracy and scores compared to multilingual BERT model on both Digikala Magazine and Persian news datasets.

Model Digikala Magazine Persian News
ParsBERT 94.28 93.59 97.20 97.19
multilingualBERT 91.31 90.72 95.80 95.79
Table 6: ParsBERT performance on text classification task compared to multilingual BERT model.

5.3 Named Entity Recognition Results

Obtained results for NER task indicates that ParsBERT outperforms all prior works in this area by achieving scores as high as 98.79 and 93.10 for PEYMA and ARMAN datasets, respectively. A thorough comparison between ParsBERT performance and other works on these two datasets is provided in table 7.

ParsBERT 98.79 93.10
MorphoBERT [Taghizadeh2020NSURL2019T7] - 89.9
Beheshti-NER [taher2020beheshti] 90.59 84.03
LSTM-CRF [Hafezi2018] - 86.55
Rule-Based-CRF [shahshahani2018peyma] 84.00 -
BiLSTM-CRF [poostchi2018bilstm] - 77.45
LSTM [Hafezi2018] - 73.61
Deep CRF [bokaei2018improved] - 81.50
Deep Local [bokaei2018improved] - 79.10
SVM-HMM [poostchietal2016personer] - 72.59
Table 7: ParsBERT performance on PEYMA and ARMAN datasets for the NER task compared to prior works.

5.4 Discussion

ParsBERT successfully achieves state-of-the-art performance on all mentioned downstream tasks. This conclusively proves that monolingual language models outmatch multilingual ones. In the case of ParsBERT, this improvement roots in several causes. Firstly, the standardization and pre-processing employed in the current methodology overcomes the lack of correct sentences in Persian corpora and takes into account the complexities of the Persian language. Secondly, the range of topics and writing styles included in the pre-training dataset is much more diverse than that of multilingual BERT that only applies the Wikipedia dataset. Another limitation of the multilingual model caused by using the small Wikipedia corpus is that it contains a vocab size of 70K tokens for all 100 languages it supports. ParsBERT, on the other hand, incorporates a 14GB corpus with more than 3.9M documents with a vocab size of 100K. All in all, the obtained results indicate that ParsBERT is more competent at perceiving and understanding the Persian language than multilingual BERT or any of the previous works that have followed the same objective.

6 Conclusion

There are few specific language models for the Persian language capable of providing state-of-the-art performance on different NLP tasks. ParsBERT is a fresh model that is lighter than multilingual BERT and represents state-of-the-art results in downstream tasks, such as Sentiment Analysis, Text Classification, and Named Entity Recognition. Compared to other Persian NER competitor models, ParsBERT outperforms all prior works in terms of score by achieving 98% and 93% scores for PEYMA and ARMAN datasets, respectively. Moreover, in the SA task, ParsBERT gained better performance on the SentiPers dataset against the DeepSentiPers model by achieving scores as high as 71% and 92% for both binary and multi-label scenarios. In all cases, ParsBERT outperforms multilingual BERT and other suggestion networks.

The number of datasets for downstream tasks in Persian is limited. Therefore, we composed a considerable set of datasets to evaluate ParsBERT performance on them. These datasets will soon be published for public use 121212 Also, we happily announce that ParsBERT synchronizes to Huggingface Transformers 131313 for any public use and to serve as a new baseline for numerous Persian NLP use cases 141414

7 Acknowledgments

We hereby, express our gratitude to the Tensorflow Research Cloud (TFRC) program

151515 for providing us with the necessary computation resources. We also thank Hooshvare161616 Research Group for facilitating dataset gathering and scraping online text resources.