A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition

06/09/2021
by   Shigeki Karita, et al.
0

End-to-end (E2E) modeling is advantageous for automatic speech recognition (ASR) especially for Japanese since word-based tokenization of Japanese is not trivial, and E2E modeling is able to model character sequences directly. This paper focuses on the latest E2E modeling techniques, and investigates their performances on character-based Japanese ASR by conducting comparative experiments. The results are analyzed and discussed in order to understand the relative advantages of long short-term memory (LSTM), and Conformer models in combination with connectionist temporal classification, transducer, and attention-based loss functions. Furthermore, the paper investigates on effectivity of the recent training techniques such as data augmentation (SpecAugment), variational noise injection, and exponential moving average. The best configuration found in the paper achieved the state-of-the-art character error rates of 4.1 eval1, eval2, and eval3 tasks, respectively. The system is also shown to be computationally efficient thanks to the efficiency of Conformer transducers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/13/2021

Exploring CTC Based End-to-End Techniques for Myanmar Speech Recognition

In this work, we explore a Connectionist Temporal Classification (CTC) b...
research
11/20/2019

On using 2D sequence-to-sequence models for speech recognition

Attention-based sequence-to-sequence models have shown promising results...
research
07/03/2017

Improving LSTM-CTC based ASR performance in domains with limited training data

This paper addresses the observed performance gap between automatic spee...
research
07/29/2022

Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition

For Mandarin end-to-end (E2E) automatic speech recognition (ASR) tasks, ...
research
03/03/2017

Exponential Moving Average Model in Parallel Speech Recognition Training

As training data rapid growth, large-scale parallel training with multi-...
research
11/05/2018

Manner of Articulation Detection using Connectionist Temporal Classification to Improve Automatic Speech Recognition Performance

Conventionally, the manner of articulations in speech signal are derived...
research
05/25/2022

Heterogeneous Reservoir Computing Models for Persian Speech Recognition

Over the last decade, deep-learning methods have been gradually incorpor...

Please sign up or login with your details

Forgot password? Click here to reset