Language Modeling for Code-Switched Data: Challenges and Approaches

11/09/2017
by   Ganji Sreeram, et al.
0

Lately, the problem of code-switching has gained a lot of attention and has emerged as an active area of research. In bilingual communities, the speakers commonly embed the words and phrases of a non-native language into the syntax of a native language in their day-to-day communications. The code-switching is a global phenomenon among multilingual communities, still very limited acoustic and linguistic resources are available as yet. For developing effective speech based applications, the ability of the existing language technologies to deal with the code-switched data can not be over emphasized. The code-switching is broadly classified into two modes: inter-sentential and intra-sentential code-switching. In this work, we have studied the intra-sentential problem in the context of code-switching language modeling task. The salient contributions of this paper includes: (i) the creation of Hindi-English code-switching text corpus by crawling a few blogging sites educating about the usage of the Internet (ii) the exploration of the parts-of-speech features towards more effective modeling of Hindi-English code-switched data by the monolingual language model (LM) trained on native (Hindi) language data, and (iii) the proposal of a novel textual factor referred to as the code-switch factor (CS-factor), which allows the LM to predict the code-switching instances. In the context of recognition of the code-switching data, the substantial reduction in the PPL is achieved with the use of POS factors and also the proposed CS-factor provides independent as well as additive gain in the PPL.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/24/2018

Hindi-English Code-Switching Speech Corpus

Code-switching refers to the usage of two languages within a sentence or...
research
12/14/2016

Grammatical Constraints on Intra-sentential Code-Switching: From Theories to Working Models

We make one of the first attempts to build working models for intra-sent...
research
05/30/2018

Code-Switching Language Modeling using Syntax-Aware Multi-Task Learning

Lack of text data has been the major issue on code-switching language mo...
research
11/01/2018

On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition

Code-switching (CS) refers to a linguistic phenomenon where a speaker us...
research
07/31/2022

The Who in Code-Switching: A Case Study for Predicting Egyptian Arabic-English Code-Switching Levels based on Character Profiles

Code-switching (CS) is a common linguistic phenomenon exhibited by multi...
research
03/24/2017

Crowdsourcing Universal Part-Of-Speech Tags for Code-Switching

Code-switching is the phenomenon by which bilingual speakers switch betw...
research
12/13/2021

Predicting User Code-Switching Level from Sociological and Psychological Profiles

Multilingual speakers tend to alternate between languages within a conve...

Please sign up or login with your details

Forgot password? Click here to reset