Subword-Level Language Identification for Intra-Word Code-Switching

04/03/2019
by   Manuel Mager, et al.
0

Language identification for code-switching (CS), the phenomenon of alternating between two or more languages in conversations, has traditionally been approached under the assumption of a single language per token. However, if at least one language is morphologically rich, a large number of words can be composed of morphemes from more than one language (intra-word CS). In this paper, we extend the language identification task to the subword-level, such that it includes splitting mixed words while tagging each part with a language ID. We further propose a model for this task, which is based on a segmental recurrent neural network. In experiments on a new Spanish--Wixarika dataset and on an adapted German--Turkish dataset, our proposed model performs slightly better than or roughly on par with our best baseline, respectively. Considering only mixed words, however, it strongly outperforms all baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/28/2019

Part of speech tagging for code switched data

We address the problem of Part of Speech tagging (POS) in the context of...
research
10/26/2022

Reducing Language confusion for Code-switching Speech Recognition with Token-level Language Diarization

Code-switching (CS) refers to the phenomenon that languages switch withi...
research
05/02/2022

TuGeBiC: A Turkish German Bilingual Code-Switching Corpus

In this paper we describe the process of collection, transcription, and ...
research
11/14/2019

Training a code-switching language model with monolingual data

A lack of code-switching data complicates the training of code-switching...
research
12/14/2016

Grammatical Constraints on Intra-sentential Code-Switching: From Theories to Working Models

We make one of the first attempts to build working models for intra-sent...
research
12/14/2014

Recurrent-Neural-Network for Language Detection on Twitter Code-Switching Corpus

Mixed language data is one of the difficult yet less explored domains of...
research
10/16/2018

Strategies for Language Identification in Code-Mixed Low Resource Languages

In the recent years, substantial work has been done on language tagging ...

Please sign up or login with your details

Forgot password? Click here to reset