DeepAI AI Chat
Log In Sign Up

indic-punct: An automatic punctuation restoration and inverse text normalization framework for Indic languages

03/31/2022
by   Anirudh Gupta, et al.
0

Automatic Speech Recognition (ASR) generates text which is most of the times devoid of any punctuation. Absence of punctuation is text can affect readability. Also, down stream NLP tasks such as sentiment analysis, machine translation, greatly benefit by having punctuation and sentence boundary information. We present an approach for automatic punctuation of text using a pretrained IndicBERT model. Inverse text normalization is done by hand writing weighted finite state transducer (WFST) grammars. We have developed this tool for 11 Indic languages namely Hindi, Tamil, Telugu, Kannada, Gujarati, Marathi, Odia, Bengali, Assamese, Malayalam and Punjabi. All code and data is publicly. available

READ FULL TEXT

page 1

page 2

page 3

04/11/2021

NeMo Inverse Text Normalization: From Development To Production

Inverse text normalization (ITN) converts spoken-domain automatic speech...
04/03/2018

Attentive Sequence-to-Sequence Learning for Diacritic Restoration of Yorùbá Language Text

Yorùbá is a widely spoken West African language with a writing system ri...
11/07/2022

Streaming, fast and accurate on-device Inverse Text Normalization for Automatic Speech Recognition

Automatic Speech Recognition (ASR) systems typically yield output in lex...
03/13/2021

OkwuGbé: End-to-End Speech Recognition for Fon and Igbo

Language is inherent and compulsory for human communication. Whether exp...
04/03/2019

A Large-Scale Comparison of Historical Text Normalization Systems

There is no consensus on the state-of-the-art approach to historical tex...
02/12/2021

Neural Inverse Text Normalization

While there have been several contributions exploring state of the art t...