A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese

05/16/2018
by   Shiyu Zhou, et al.
0

The choice of modeling units is critical to automatic speech recognition (ASR) tasks. Conventional ASR systems typically choose context-dependent states (CD-states) or context-dependent phonemes (CD-phonemes) as their modeling units. However, it has been challenged by sequence-to-sequence attention-based models, which integrate an acoustic, pronunciation and language model into a single neural network. On English ASR tasks, previous attempts have shown that the modeling unit of graphemes can outperform that of phonemes by sequence-to-sequence attention-based model. In this paper, we are concerned with modeling units on Mandarin Chinese ASR tasks using sequence-to-sequence attention-based models with the Transformer. Five modeling units are explored including context-independent phonemes (CI-phonemes), syllables, words, sub-words and characters. Experiments on HKUST datasets demonstrate that the lexicon free modeling units can outperform lexicon related modeling units in terms of character error rate (CER). Among five modeling units, character based model performs best and establishes a new state-of-the-art CER of 26.64% on HKUST datasets without a hand-designed lexicon and an extra language model integration, which corresponds to a 4.8% relative improvement over the existing best CER of 28.0% by the joint CTC-attention based encoder-decoder network.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/28/2018

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Sequence-to-sequence attention-based models have recently shown very pro...
research
10/25/2019

Exploring Lexicon-Free Modeling Units for End-to-End Korean and Korean-English Code-Switching Speech Recognition

As the character-based end-to-end automatic speech recognition (ASR) mod...
research
05/18/2020

Weak-Attention Suppression For Transformer Based Speech Recognition

Transformers, originally proposed for natural language processing (NLP) ...
research
03/01/2017

Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling

Most existing sequence labelling models rely on a fixed decomposition of...
research
06/12/2018

Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages

Sequence-to-sequence attention-based models integrate an acoustic, pronu...
research
05/24/2022

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition

The choice of modeling units affects the performance of the acoustic mod...

Please sign up or login with your details

Forgot password? Click here to reset