How Phonotactics Affect Multilingual and Zero-shot ASR Performance

10/22/2020
by   Siyuan Feng, et al.
0

The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system's performance, and retaining only the target language's phonotactic data in LM training is preferable.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/16/2020

That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

Only a handful of the world's languages are abundant with the resources ...
research
06/05/2023

N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition

Whisper, the recently developed multilingual weakly supervised model, is...
research
01/26/2022

Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition

The high cost of data acquisition makes Automatic Speech Recognition (AS...
research
05/31/2023

Zero-Shot Automatic Pronunciation Assessment

Automatic Pronunciation Assessment (APA) is vital for computer-assisted ...
research
02/26/2020

Towards Zero-shot Learning for Automatic Phonemic Transcription

Automatic phonemic transcription tools are useful for low-resource langu...
research
09/19/2023

Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

We present a novel integration of an instruction-tuned large language mo...
research
11/10/2021

Scaling ASR Improves Zero and Few Shot Learning

With 4.5 million hours of English speech from 10 different sources acros...

Please sign up or login with your details

Forgot password? Click here to reset