Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin

12/10/2022
by   Abhinav Rao, et al.
0

This paper presents the work of restoring punctuation for ASR transcripts generated by multilingual ASR systems. The focus languages are English, Mandarin, and Malay which are three of the most popular languages in Singapore. To the best of our knowledge, this is the first system that can tackle punctuation restoration for these three languages simultaneously. Traditional approaches usually treat the task as a sequential labeling task, however, this work adopts a slot-filling approach that predicts the presence and type of punctuation marks at each word boundary. The approach is similar to the Masked-Language Model approach employed during the pre-training stages of BERT, but instead of predicting the masked word, our model predicts masked punctuation. Additionally, we find that using Jieba1 instead of only using the built-in SentencePiece tokenizer of XLM-R can significantly improve the performance of punctuating Mandarin transcripts. Experimental results on English and Mandarin IWSLT2022 datasets and Malay News show that the proposed approach achieved state-of-the-art results for Mandarin with 73.8 while maintaining a reasonable F1-score for English and Malay, i.e. 74.7 78 building a simple web-based application for demonstration purposes is available on Github.

READ FULL TEXT
research
07/28/2018

Building a Unified Code-Switching ASR System for South African Languages

We present our first efforts towards building a single multilingual auto...
research
01/31/2022

Correcting diacritics and typos with a ByT5 transformer model

Due to the fast pace of life and online communications and the prevalenc...
research
11/22/2022

Coreference Resolution through a seq2seq Transition-Based System

Most recent coreference resolution systems use search algorithms over po...
research
01/18/2021

Automatic punctuation restoration with BERT models

We present an approach for automatic punctuation restoration with BERT m...
research
08/23/2022

MATra: A Multilingual Attentive Transliteration System for Indian Scripts

Transliteration is a task in the domain of NLP where the output word is ...
research
05/31/2023

The Tag-Team Approach: Leveraging CLS and Language Tagging for Enhancing Multilingual ASR

Building a multilingual Automated Speech Recognition (ASR) system in a l...
research
05/19/2020

Fast, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

In this work, we first show that on the widely used LibriSpeech benchmar...

Please sign up or login with your details

Forgot password? Click here to reset