Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training

10/28/2018
by   Hila Gonen, et al.
0

We focus on the problem of language modeling for code-switched language, in the context of automatic speech recognition (ASR). Language modeling for code-switched language is challenging for (at least) three reasons: (1) lack of available large-scale code-switched data for training; (2) lack of a replicable evaluation setup that is ASR directed yet isolates language modeling performance from the other intricacies of the ASR system; and (3) the reliance on generative modeling. We tackle these three issues: we propose an ASR-motivated evaluation setup which is decoupled from an ASR system and the choice of vocabulary, and provide an evaluation dataset for English-Spanish code-switching. This setup lends itself to a discriminative training approach, which we demonstrate to work better than generative language modeling. Finally, we present an effective training protocol that integrates small amounts of code-switched data with large amounts of monolingual data, for both the generative and discriminative cases.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/28/2022

Reducing language context confusion for end-to-end code-switching automatic speech recognition

Code-switching is about dealing with alternative languages in the commun...
research
07/04/2021

Arabic Code-Switching Speech Recognition using Monolingual Data

Code-switching in automatic speech recognition (ASR) is an important cha...
research
09/24/2018

Hindi-English Code-Switching Speech Corpus

Code-switching refers to the usage of two languages within a sentence or...
research
10/21/2022

Optimizing Bilingual Neural Transducer with Synthetic Code-switching Text Generation

Code-switching describes the practice of using more than one language in...
research
05/30/2018

Code-Switching Language Modeling using Syntax-Aware Multi-Task Learning

Lack of text data has been the major issue on code-switching language mo...
research
06/14/2021

Using heterogeneity in semi-supervised transcription hypotheses to improve code-switched speech recognition

Modeling code-switched speech is an important problem in automatic speec...
research
08/01/2017

A Generative Parser with a Discriminative Recognition Algorithm

Generative models defining joint distributions over parse trees and sent...

Please sign up or login with your details

Forgot password? Click here to reset