Streaming End-to-End Bilingual ASR Systems with Joint Language Identification

07/08/2020
by   Surabhi Punjabi, et al.
0

Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream processing of ASR output. In this paper, we introduce streaming, end-to-end, bilingual systems that perform both ASR and language identification (LID) using the recurrent neural network transducer (RNN-T) architecture. On the input side, embeddings from pretrained acoustic-only LID classifiers are used to guide RNN-T training and inference, while on the output side, language targets are jointly modeled with ASR targets. The proposed method is applied to two language pairs: English-Spanish as spoken in the United States, and English-Hindi as spoken in India. Experiments show that for English-Spanish, the bilingual joint ASR-LID architecture matches monolingual ASR and acoustic-only LID accuracies. For the more challenging (owing to within-utterance code switching) case of English-Hindi, English ASR and LID metrics show degradation. Overall, in scenarios where users switch dynamically between languages, the proposed architecture offers a promising simplification over running multiple monolingual ASR models and an LID classifier in parallel.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2020

Streaming Language Identification using Combination of Acoustic Representations and ASR Hypotheses

This paper presents our modeling and architecture approaches for buildin...
research
02/22/2023

UML: A Universal Monolingual Output Layer for Multilingual ASR

Word-piece models (WPMs) are commonly used subword units in state-of-the...
research
08/29/2022

A Language Agnostic Multilingual Streaming On-Device ASR System

On-device end-to-end (E2E) models have shown improvements over a convent...
research
09/13/2022

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Language identification is critical for many downstream tasks in automat...
research
05/09/2017

Phone-aware Neural Language Identification

Pure acoustic neural models, particularly the LSTM-RNN model, have shown...
research
02/21/2022

Adaptive Discounting of Implicit Language Models in RNN-Transducers

RNN-Transducer (RNN-T) models have become synonymous with streaming end-...
research
03/01/2023

Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training

We propose gated language experts to improve multilingual transformer tr...

Please sign up or login with your details

Forgot password? Click here to reset