Unsupervised Stemming based Language Model for Telugu Broadcast News Transcription

08/10/2019
by   Mythili Sharan Pala, et al.
0

In Indian Languages , native speakers are able to understand new words formed by either combining or modifying root words with tense and / or gender. Due to data insufficiency, Automatic Speech Recognition system (ASR) may not accommodate all the words in the language model irrespective of the size of the text corpus. It also becomes computationally challenging if the volume of the data increases exponentially due to morphological changes to the root word. In this paper a new unsupervised method is proposed for a Indian language: Telugu, based on the unsupervised method for Hindi, to generate the Out of Vocabulary (OOV) words in the language model. By using techniques like smoothing and interpolation of pre-processed data with supervised and unsupervised stemming, different issues in language model for Indian language: Telugu has been addressed. We observe that the smoothing techniques Witten-Bell and Kneser-Ney perform well when compared to other techniques on pre-processed data from supervised learning. The ASRs accuracy is improved by 0.76 supervised and unsupervised stemming respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/17/2023

Syllable Subword Tokens for Open Vocabulary Speech Recognition in Malayalam

In a hybrid automatic speech recognition (ASR) system, a pronunciation l...
research
07/16/2021

A Comparison of Methods for OOV-word Recognition on a New Public Dataset

A common problem for automatic speech recognition systems is how to reco...
research
05/04/2020

Fast and Robust Unsupervised Contextual Biasing for Speech Recognition

Automatic speech recognition (ASR) system is becoming a ubiquitous techn...
research
06/16/2015

Recognize Foreign Low-Frequency Words with Similar Pairs

Low-frequency words place a major challenge for automatic speech recogni...
research
07/20/2021

Seed Words Based Data Selection for Language Model Adaptation

We address the problem of language model customization in applications w...
research
04/16/2021

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

Segmentation remains an important preprocessing step both in languages w...
research
03/26/2018

Unsupervised Separation of Transliterable and Native Words for Malayalam

Differentiating intrinsic language words from transliterable words is a ...

Please sign up or login with your details

Forgot password? Click here to reset