Normalizing Text using Language Modelling based on Phonetics and String Similarity

06/25/2020
by   Fenil Doshi, et al.
0

Social media networks and chatting platforms often use an informal version of natural text. Adversarial spelling attacks also tend to alter the input text by modifying the characters in the text. Normalizing these texts is an essential step for various applications like language translation and text to speech synthesis where the models are trained over clean regular English language. We propose a new robust model to perform text normalization. Our system uses the BERT language model to predict the masked words that correspond to the unnormalized words. We propose two unique masking strategies that try to replace the unnormalized words in the text with their root form using a unique score based on phonetic and string similarity metrics.We use human-centric evaluations where volunteers were asked to rank the normalized text. Our strategies yield an accuracy of 86.7 effectiveness of our system in dealing with text normalization.

READ FULL TEXT
research
03/27/2015

Normalization of Non-Standard Words in Croatian Texts

This paper presents text normalization which is an integral part of any ...
research
10/18/2021

Contextual Hate Speech Detection in Code Mixed Text using Transformer Based Approaches

In the recent past, social media platforms have helped people in connect...
research
11/17/2022

UPTON: Unattributable Authorship Text via Data Poisoning

In online medium such as opinion column in Bloomberg, The Guardian and W...
research
01/13/2021

Toward Data Cleaning with a Target Accuracy: A Case Study for Value Normalization

Many applications need to clean data with a target accuracy. As far as w...
research
02/01/2022

Transformer-based Models of Text Normalization for Speech Applications

Text normalization, or the process of transforming text into a consisten...
research
11/01/2021

PerSpeechNorm: A Persian Toolkit for Speech Processing Normalization

In general, speech processing models consist of a language model along w...

Please sign up or login with your details

Forgot password? Click here to reset