Most of the spoken languages around the world such as Persian are rhythmic and this fact will make the words of sentences more meaningful while speaking and one can change the meaning of a sentence by changing the purpose and significance of a particular word. By making sure that marks are used accurately, we can avoid probable uncertainty. Otherwise, the sentence will be understood differently; hence, not paying attention to the accurate use of punctuation leads to confusion and misunderstandings.
Automated Speech Recognition (ASR) is a technology that allows users to speak entries rather than punching numbers or letters on a keypad. However, in a typical ASR system, punctuation and capitalization of words are removed because they do not affect the pronunciation of words. The output contains purely a sequence of words or alphabets characters. This output is sufficient for many applications that usually use a short and independent segment of speeches, but it is difficult to be used in applications that decipher long speech segments. Furthermore, ASR results are fed into natural language processing (NLP) models that punctuations and word capitalizations are important pieces of information that can help to boost the NLP model’s performances.
To achieve a readable system in Persian there are different tasks. We considered punctuation restoration, zero-width non-joiner (ZWNJ) recognition, and Ezafe construction tasks to process texts for NLP tasks. Punctuation restoration is the task of identifying punctuation marks that increases the readability of texts. Punctuation restoration and correcting word casing is one requirement to offer a reliable post-processing system for texts before applying complex NLP algorithms. For English, we obtained comparable state-of-the-art results, while for Persian, there are a limited number of works in the field. In the English language, hyphens are used to join words and to separate syllables of a single word. This process is called hyphenation and this type of word is called multi-part word. However, the Persian language consists of multi-part words as well. In Persian morphology, the ZWNJ character is used to separate parts of multi-part words . Persian ezafe is an unstressed morpheme syllable that appears at the end of the words. It is pronounced as -e after consonants and as -ye after vowels. This syntactic phenomenon links a head noun, head pronoun, head adjective, head preposition, or head adverb to their modifiers in a constituent called ezafe construction .
In most studies, all of these three concepts are investigated separately (Punctuation restoration, ZWNJ recognition, and ezafe construction) one reason could be the avaliable dataset that are for different purposes. Another reason is the complexity of the tasks is varied from one task to another. However, in this work, we analogy that for an enriched text we must pay attention to all three aspects, so in this study, we proposed a framework that is a combination of three models for text refinement. In practice, we used publically avaliable datasets to create models that make the readable text for numerous NLP tasks. In general our research contributions are two folds:
We proposed a framework for text refinements in Persian for various Nlp tasks. We used three major text clarifications models in one framework.
We used ParsBERT (a BERT variant for the Persian language) for sequence labeling.
After reviewing the related works of three tasks in Section II, we introduce our proposed method in Section III. We then discuss punctuation restorations, ZWNJ recognition, and ezafe recognition tasks and their results in Sections IV and at the end, we made a conclusion of this paper in Section V.
2 Related Works
We first review methods for English, and then methods for the Persian language. Since our work relies on multiple tasks, we summarize related literature on these topics as well.
There are lots of works in the English language, we divided the related work in the field into 4 categories namely prosodic features, n-gram language model, RNN based approaches, and transformers. Prosodic features are features that appear when we put sounds together in connected speech. The used a statistical finite-state model that combines prosodic, linguistic, and punctuation class features for the punctuation restoration task. In a similar work, the 
proposed a combined punctuation generation and speech recognition system that uses prosody features to boost their performance. The n-gram language models (LM) have a long history in NLP tasks that are useful for punctuation annotation systems. The  proposed a metadata detection system based on n-gram LM that combines information from different types of textual knowledge sources with information from a prosodic classifier to automatic detection of sentence boundaries and disfluencies. In 
authors used n-gram LM and larger training datasets consistently, and they conclude that using large data will improve performance while increasing n-gram order does not help. With progress in deep learning models, RNN based models achieved higher attention in the text-domain. Subsequent research revealed different approaches to this task. For example, authors of the
proposed Deep Neural Network (DNN) sequence labeling model that uses Bidirectional Long Short-Term Memory (BiLSTM) and a Convolutional Neural Network (CNN) to predict the punctuation. The word and character-level embedding with CNN-RNN model for punctuation restoration excavated in. LSTM approach has been proposed by  authors for the punctuation prediction task.
The transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease . Transformers are groundbreaking models in various NLP tasks such as sequence labeling. In works of , , , , and  authors used transformers models for punctuation restoration task and achieved a promising results.
In this section, we divided the related works into three sections, to explore similarly.
2.2.1 ZWNJ Recognition
ZWNJ has been investigated in Persian and there are a few related works for this task. To the best of our knowledge, the oldest work on the shelf belongs to , in which they proposed a model based on a statistical machine translation paradigm for space corrections. In machine translation, the text in the source language is translated into a text in the destination language, in the proposed model the output is space corrected text. The BLEU score statistical machine-translation method reaches 0.91. Parsivar  is a toolkit that performs different kinds of activities composed of normalization, space correction, tokenization, stemming, parts of speech tagging, and shallow parsing. They proposed a Naïve Bayes model for space correction with a BIO scheme to find words with multiple parts separated by spaces with an F1 score of 89.50%. In a similar manner authors of  investigated various nlp toolkits for the Persian language and they found out that except Parsivar none of them is capable of handling spaces desirably than Parsivar. They believe that Hazm111https://www.sobhe.ir/hazm/ is not properly handling the spaces in Persian, so they trained a model that uses Hazm tokenization as input for the space correction model. First, they combined various datasets and achieved 1,750,607 words. Next, they trained the n-gram language model using the KenLM  toolkit for space correction. Their method outperforms the Hazm model for space correction with an F1 score of 81.94%. The BERT-based approach proposed by  for word segmentation correction and ZWNJ recognition. They approach the problem jointly as a sequence labeling problem and proposed a BERT-CRF model which consists of a BERT model followed by a CRF layer for sequence labeling task for ZWNJ recognition. They specified three variant classes namely none, space, and ZWNJ classes. Next, They proposed an algorithm that extends the existing dataset by adding noses. This allowed boosting their model performance from an F1 score of 96.67% to 98.14%. However, they collected 500 difficult sentences for further evaluation and reported an F1 score of 92.40%.
2.2.2 Punctuation Restoration
Punctuation prediction/restoration is a crucial task in ASR. The sequence of words with no punctuation needed to be processed to make sense for humans and even other areas of NLP. The  introduced the first-ever corpus for automatic punctuation prediction in Persian texts. They have made many revisions to their corpus such as word replacements, normalizations, word type corrections, and numerous corrections to the punctuation marks. They trained the CRF model and achieved an F1 score of 69.00% in their preliminary experiments.
2.2.3 Ezafe Construction
Ezafe construction is a peculiar aspect of the Persian language which is able to make a good indicator of important information in the text. The  mentioned that adding information about ezafe can boost performances of dependency parsing and shallow parsing by 4.6% and 9% respectively. Most of the work in the field used rule-based and hand-crafted features to tackle this challenge. The  proposed the head-driven phrase structure grammar to formalize Persian syntax and determine phrase boundaries and  used a rule-based method to create a Persian morphological analyzer. The  employed the CART tree classification approach for ezafe constructions. The  used a probabilistic approach namely CRF to create an ezafe recognition model in Persian and achieved accuracy 98.04%. Another work is 
that uses both a rule-based method and a genetic algorithm for ezafe recognition and reported accuracy of 95.26%. The newest work in ezafe recognition is the work of that uses transformers and ezafe recognition role in part-of-speech tagging. They employed transformer-based methods, BERT and XLMRoBERT, and achieved an F1 score of 98.10%.
3 Proposed Method
In this section, the methodology of the proposed system is described elaborately. The proposed system consists of four layers including the input processing layer, model layers which are divided into two modules (prediction and mapping layers), and the output construction layer. The architecture of the proposed method is presented in figure 1. First, in the processing layer documents will be cleaned, normalized, and tokenized for the next layer. Next, the model layer that consists of prediction and mapping layers uses processed documents for punctuation construction, ZWNJ construction, and ezafe recognition. In the last layer, outputs will be combined for the final result of the system. The model layers consist of three sections, each for a different task that works in parallel form. The prediction layer in model layers tries to train models in different sections; in the punctuation section it tries to identify the possibility of specified punctuation marks. The ZWNJ section at the prediction layer predicts whatever ZWNJ should occur or not. Similarly, it works in the ezafe section in this layer as well. The next layer in model layers is the mapping layer which tries to map the prediction results on text. Mapping starts with the punctuation section by constructing punctuations. Then the produced output of the punctuation section feeds into the ZWNJ section for ZWNJ construction at the mapping layer. Likewise, the punctuation and ZWNJ constructed output from the previous section will feed into the ezafe section to determine which token needs ezafe construction. In the following sections, we will describe each layer more elaborately.
3.1 Problem Formulations
In this paper, our training data consists of sentences that each sentence consists of tokens. Regarding this; we can define and as input and output representation of input sequences, respectively. The definitions are presented as follows:
Where changes based on models. In this paper, we defined the for each model at the model layer as followings:
Where is the label set for tokens in different tasks. For example, for punctuation tasks, we specified only four types of marks for restoration and the last unk label is for a token that doesn’t take any punctuation mark, in our model it is a five-class sequence labeling model. For other tasks, it is a binary class classification that 1 means changes required else changes are not required. Changes for binary classification are construction and specification (recognition) for ZWNJ and Ezafe tasks respectively. Overall, we are dealing with a classification task. Considering definitions, we can define the training, validation, and test sets as a following for each problem.
Where , , and are sizes of the train, validation, and test sets respectively. The goal is to create system , , and models that uses for punctuation, ZWNJ, and ezafe constructions.
Where is the punctuation construction model that predicts and constructs the specified punctuation marks in input , and the resulting output will be used in the model for the joining of ZWNJ’s. In the end, the output of the previous model will enter into model for ezafe construction.
3.2 Processing Layer
Raw documents first enter into the processing layer, the processing layer aims to create pairs from input raw documents. Generally speaking, raw documents can be sentences or paragraphs as well and they need to be processed into the form of input pairs for models. Nevertheless, in this layer, we removed every punctuations mark from inputs. This technique will allow us to use the proposed framework for punctuation restoration tasks. It presents everything similar to the output of ASR systems. However, it is usable for punctuation compilation as well.
First, a document is converted into tokens using a personalized tokenizer that considers spaces, half-spaces, and punctuations for splitting raw documents into texts. Next, the special character has been erased from tokens. Next, English numbers have been converted into Persian numerics. In the end, modifications on applied to output required labels for sequences.
3.3 Model Layers
Model layers aim to make a prediction and mappings for refinements. In these layers, a model will be trained in the prediction layer to predict new unseen input sequences and the mapping layer will use the trained model for refinements. In the following, we will describe each layer carefully.
ParsBERT  is a monolingual version of the BERT language model  for the Persian language that uses the base configuration of the BERT model such as 12 hidden layers, hidden size of 768 with 12 attention heads. ParsBERT is a transformer-based language model with an encoder-only structure. The input sequence is mapped to a contextualized encoded sequence
by going through a series of bi-directional self-attention blocks with two feed-forward layers in each block. Then we took the output sequences and mapped them to a binary and multi-classification with sigmoid and softmax activation functions for each task, respectively. First, we used ParsBERT tokenizer to create attention mask and word mappers, then we fine-tuned ParsBERT for tasks. For the training procedure, we usedCrossEntropy and CategoricalCrossEntropyloss functions for binary and multi-class classification models, respectively. The rest of the parameters are the same for all of the models. We set, learning rate to 2e5
, epoch to3, and dropout to 0.1 to train models.
The mapping goal is to use the trained model in the prediction layer for making changes into tokens. The takes raw as input and then first uses the punctuation prediction model to identify which punctuation mark is needed to be applied for tokens, after punctuation construction (), the refined enters into the ZWNJ construction () for adding changes. Similarly, the output of will enter into ezafe recognition () to sign ezafe in predicted tokens.
3.4 Output Layer
The output layer at top of the proposed framework combines the previous layer predictions from the mapper to construct the sentences from output sequences.
The figure 3 represents sample input and outputs from the proposed framework. In this figure, the blue-colored text presents the correct ZWNJ constructed in input sentences and the model do not need to make any changes on them. However, the green-colored and red-colored texts are wrong ZWNJ construction in input sequences that our models recognized and corrected in the outputs. In general, there are two type of modifications for ZWNJ; the first type is occurring in green-colored texts, where ZWNJ is required in input sentences, the second type is occurring in the red-colored text, where ZWNJ is not required in input sentences and our model removes them in outputs. Punctuation mark restorations in texts happened with background-colored punctuations where yellow color means that it was incorrectly has been utilized in input sentences and blue color means that input sentences missing those punctuations. More experimental evaluations will be presented in the next sections.
In this section, we will explore the dataset and results of the proposed framework.
For training and evaluation of the proposed framework, we used the Bijankhan corpus . The Bijankhan is a well-known dataset for Persian natural language processing tasks. The dataset consists of news articles and common texts consisting of in total 4300 different topics. This dataset consists of 2,597,939 words that are annotated with morpho-syntactic and partly semantic features. The corpus has been tagged by 550 tags in hierarchical order with more fine-grained POS tags like noun-plural-subj. Nearly 23% of words in this corpus are tagged with Ezafe. Punctuation marks are specified with the DELM tag in this corpus with a total number of 246,023. Each word in this corpus is specified as a single row and sentences divided by #. Multi-words are also specified in a single row, so in this way, we made modifications to the dataset using its structure to obtain a dataset for the ZWNJ model. We split multi-words into separate words and labeled them for ZWNJ recognition. We found almost 30% of the corpus appropriate for the ZWNJ recognition task. We split the dataset for each task in 80% 10% 10% for train, validation, and test sets, respectively. The table 1 presents the details about train, test, and validations sets based on tasks.
|Set||# of Sentences||# of Tokens|
Considering all the classes together, the averaged F1 macro score is:
Where is the number of classes, and is the averaged F1 macro score for class . The overall accuracy for all tokens are calculated as follows:
The table 2 presents the achieved results on both validation and test sets. For both punctuation and ezafe models, our proposed method achieved a higher F1 score regarding the related works in the fields. However, our proposed method performed second place for ZWNJ’s. In the following, we will discuss each model separately.
The proposed method achieved an F1 score of 92.13% in the test set. It is the first result in the punctuation restoration task using the Bijankhan dataset. However, we have found the work of  with an F1 score of 69.00% on a different dataset.
The proposed approach for ezafe recognition achieved the first F1 score of 98.50% on the test set. Regarding obtained performance, the second best performing approach is  that used XLMRoBERT and obtained an F1 score of 98.10% with 0.40% margin within our proposed model. In conclusion, the proposed model for ezafe recognitions is the best performer model.
The work of  used BERT-CRF with some additions to the training datasets. It boosted their performance from an F1 score of 96.67% to 98.14%. Nevertheless, in normal conditions, our proposed method performed 0.23% higher. However, in the end, our model performed 2.24% less than the highest approach in the task. However, they created a BIO structure while training their model but we used only a binary classification scheme and achieved F1 score of 96.90%.
In this paper, we presented ViraPart framework for automatic text refinements for various NLP tasks in the Persian language such as part-of-speech tagging, named entity recognition, ASR, grapheme to phoneme, spell checker, text summarization, and machine translation, etc. The ViraPart consists of 3 models, namely punctuation restoration, ZWNJ recognition, and ezafe recognition. We fine-tuned ParsBERT as a sequence labeling model for tasks and achieved promising results. We obtained averaged F1 macro score of 92.13%, 98.50%, and 96.90% for punctuation restoration, ezafe recognition, and ZWNJ recognition, respectively. The presented results show the effectiveness of the proposed framework on text refinement.
The authors would like to give big thanks to Part AI Research Center (the biggest AI company in Iran) for supporting and funding this research.
-  E. Cho, J. Niehues, and A. Waibel, Segmentation and punctuation prediction in speech language translation using a monolingual translation system, in International Workshop on Spoken Language Translation (IWSLT) 2012, 2012.
-  Ghorbaniyun, H. (2005). The Method of Representing the Written and Spoken Points. Mashhad: Jam Aria Press.
-  Zahedi, M., Arjomandzadeh, A. (2016). A new model for persian multi-part words edition based on statistical machine translation. Journal of AI and Data Mining, 4(1), 27-34. doi: 10.5829/idosi.JAIDM.2016.04.01.04
-  Doostmohammadi, E., Nassajian, M., & Rahimi, A. (2020, November). Persian Ezafe Recognition Using Transformers and Its Role in Part-Of-Speech Tagging. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings (pp. 961-971).
-  H. Christensen, Y. Gotoh, and S. Renals, Punctuation annotation using statistical prosody models. in Proc Isca Workshop on Prosody in Speech Recognition and Understanding, 2001.
-  J. Kim and P. Woodland, A combined punctuation generation and speech recognition system and its performance enhancement using prosody, Speech Commun., vol. 41, pp. 563–577, 2003
-  D. Beeferman, A. Berger, and J. Lafferty, Cyberpunc: a lightweight punctuation annotation system for speech, in ICASSP, 1998, pp. 689–692 vol.2.
-  Y. Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper, Enriching speech recognition with automatic detection of sentence boundaries and disfluencies, IEEE Transactions on Audio, Speech and Language Processing, vol. 14, no. 5, p. 1526–1540, Sep. 2006.
-  A. Gravano, M. Jansche, and M. Bacchiani, Restoring punctuation and capitalization in transcribed speech, in ICASSP, 2009, USA, p. 4741–4744
-  P. Zelasko, P. Szymanski, J. Mizgajski, A. Szymczak, Y. Carmiel, ˙ and N. Dehak, Punctuation prediction model for conversational speech, in Proc. Interspeech, 2018.
-  Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
-  B. Lin and L. Wang, Joint Prediction of Punctuation and Disfluency in Speech Transcripts, in Proc. Interspeech, 2020, pp. 716–720.
-  J. Yi, J. Tao, Z. Tian, Y. Bai, and C. Fan, Focal Loss for Punctuation Prediction, in Proc. Interspeech, 2020, pp. 721–725.
-  M. Courtland, A. Faulkner, and G. McElvain, Efficient automatic punctuation restoration using bidirectional transformers with robust inference, in Proceedings of the 17th International Conference on Spoken Language Translation, 2020, pp. 272–279.
-  A. Nagy, B. Bial, and J. Acs, Automatic punctuation restoration with BERT models, arXiv preprint arXiv:2101.07343, 2021.
-  K. Makhija, T.-N. Ho, and E. Chng, Transfer learning for punctuation prediction, in Proceedings of APSIPA, 2019, pp. 268–273
-  M. Á. Tündik and G. Szaszák, Joint Word- and Character-level Embedding CNN-RNN Models for Punctuation Restoration, 2018 9th IEEE International Conference on Cognitive Infocommunications (CogInfoCom), 2018, pp. 000135-000140, doi: 10.1109/CogInfoCom.2018.8639876.
-  Kaituo Xu, Lei Xie, and Kaisheng Yao. 2016. Investigating LSTM for Punctuation Prediction. In 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), pages 1–5.
-  Salar Mohtaj, Behnam Roshanfekr, Atefeh Zafarian, Habibollah Asghari, Parsivar: A Language Processing Toolkit for Persian. In the Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018.
-  Zahedi, M., Arjomandzadeh, A. (2016). A new model for persian multi-part words edition based on statistical machine translation. Journal of AI and Data Mining, 4(1), 27-34. doi: 10.5829/idosi.JAIDM.2016.04.01.04
-  M. Panahandeh and S. Ghanbari, Correction of Spaces in Persian Sentences for Tokenization, 2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI), 2019, pp. 670-674, doi: 10.1109/KBEI.2019.8734954.
-  Doostmohammadi, E., Nassajian, M., & Rahimi, A. (2020, December). Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 4612-4618).
-  Heafield, K. (2011, July). KenLM: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation (pp. 187-197).
-  S. M. Hosseini and H. Sameti, Creating a corpus for automatic punctuation prediction in Persian texts, 2017 Iranian Conference on Electrical Engineering (ICEE), 2017, pp. 1537-1542, doi: 10.1109/IranianCEE.2017.7985288.
-  Doostmohammadi, E., Nassajian, M., & Rahimi, A. (2020). Persian Ezafe Recognition Using Transformers and Its Role in Part-Of-Speech Tagging. FINDINGS.
-  Nourian, A., Rasooli, M.S., Imany, M., & Faili, H. (2015). On the Importance of Ezafe Construction in Persian Parsing. ACL.
-  Seraji, M., Megyesi, B., & Nivre, J. (2012). A basic language resource kit for Persian. In Eight International Conference on Language Resources and Evaluation (LREC 2012), 23-25 May 2012, Istanbul, Turkey (pp. 2245-2252). European Language Resources Association.
-  Müller, S., & Ghayoomi, M. (2010). PerGram: A TRALE implementation of an HPSG fragment of Persian. Proceedings of the International Multiconference on Computer Science and Information Technology, 461-467.
-  Megerdoomian, K. (2000). Persian Computational Morphology: A Unification-Based Approach.
-  Koochari, A., QasemiZadeh, B., & Kasaeiyan, M. (2006). Ezafe Prediction in Phrases of Farsi Using CART. In Proceedings of the I International Conference on Multidisciplinary Information Sciences and Technologies (pp. 329-332).
-  Asghari, H., Maleki, J., & Faili, H. (2014, April). A probabilistic approach to persian ezafe recognition. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers (pp. 138-142).
-  Noferesti, S., & Shamsfard, M. (2014). A Hybrid Algorithm for Recognizing the Position of Ezafe Constructions in Persian Texts. Int. J. Interact. Multim. Artif. Intell., 2, 17-25.
-  Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, January). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1).
-  Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, and M. Manthouri. Parsbert: Transformerbased model for persian language understanding. ArXiv, abs/2005.12515, 2020.
-  Bijankhan, M. (2004). The role of the corpus in writing a grammar: An introduction to a software. Iranian Journal of Linguistics, 19(2), 48-67.