Non-Standard Vietnamese Word Detection and Normalization for Text-to-Speech

by   Huu-Tien Dang, et al.

Converting written texts into their spoken forms is an essential problem in any text-to-speech (TTS) systems. However, building an effective text normalization solution for a real-world TTS system face two main challenges: (1) the semantic ambiguity of non-standard words (NSWs), e.g., numbers, dates, ranges, scores, abbreviations, and (2) transforming NSWs into pronounceable syllables, such as URL, email address, hashtag, and contact name. In this paper, we propose a new two-phase normalization approach to deal with these challenges. First, a model-based tagger is designed to detect NSWs. Then, depending on NSW types, a rule-based normalizer expands those NSWs into their final verbal forms. We conducted three empirical experiments for NSW detection using Conditional Random Fields (CRFs), BiLSTM-CNN-CRF, and BERT-BiGRU-CRF models on a manually annotated dataset including 5819 sentences extracted from Vietnamese news articles. In the second phase, we propose a forward lexicon-based maximum matching algorithm to split down the hashtag, email, URL, and contact name. The experimental results of the tagging phase show that the average F1 scores of the BiLSTM-CNN-CRF and CRF models are above 90.00 reaching the highest F1 of 95.00 approach has low sentence error rates, at 8.15 BiLSTM-CNN-CRF taggers, and only 6.67


Normalization of Non-Standard Words in Croatian Texts

This paper presents text normalization which is an integral part of any ...

FinBERT-MRC: financial named entity recognition using BERT under the machine reading comprehension paradigm

Financial named entity recognition (FinNER) from literature is a challen...

Semantic Tagging with LSTM-CRF

In the present paper, two models are presented namely LSTM-CRF and BERT-...

Toward a Standardized and More Accurate Indonesian Part-of-Speech Tagging

Previous work in Indonesian part-of-speech (POS) tagging are hard to com...

Improving Agreement and Disagreement Identification in Online Discussions with A Socially-Tuned Sentiment Lexicon

We study the problem of agreement and disagreement detection in online d...

Measurement Context Extraction from Text: Discovering Opportunities and Gaps in Earth Science

We propose Marve, a system for extracting measurement values, units, and...

To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging

Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, ...

Please sign up or login with your details

Forgot password? Click here to reset