Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition

04/19/2021
by   Wei Zhou, et al.
9

Subword units are commonly used for end-to-end automatic speech recognition (ASR), while a fully acoustic-oriented subword modeling approach is somewhat missing. We propose an acoustic data-driven subword modeling (ADSM) approach that adapts the advantages of several text-based and acoustic-based subword methods into one pipeline. With a fully acoustic-oriented label design and learning process, ADSM produces acoustic-structured subword units and acoustic-matched target sequence for further ASR training. The obtained ADSM labels are evaluated with different end-to-end ASR approaches including CTC, RNN-transducer and attention models. Experiments on the LibriSpeech corpus show that ADSM clearly outperforms both byte pair encoding (BPE) and pronunciation-assisted subword modeling (PASM) in all cases. Detailed analysis shows that ADSM achieves acoustically more logical word segmentation and more balanced sequence length, and thus, is suitable for both time-synchronous and label-synchronous models. We also briefly describe how to apply acoustic-based subword regularization and unseen text segmentation using ADSM.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/20/2020

A Comparison of Label-Synchronous and Frame-Synchronous End-to-End Models for Speech Recognition

End-to-end models are gaining wider attention in the field of automatic ...
research
11/03/2022

Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR system

Exploiting effective target modeling units is very important and has alw...
research
11/10/2018

Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling

In recent years, end-to-end models have become popular for application i...
research
03/03/2018

On Modular Training of Neural Acoustics-to-Word Model for LVCSR

End-to-end (E2E) automatic speech recognition (ASR) systems directly map...
research
11/05/2018

When CTC Training Meets Acoustic Landmarks

Connectionist temporal classification (CTC) training criterion provides ...
research
06/15/2016

Automatic Pronunciation Generation by Utilizing a Semi-supervised Deep Neural Networks

Phonemic or phonetic sub-word units are the most commonly used atomic el...
research
07/09/2021

On lattice-free boosted MMI training of HMM and CTC-based full-context ASR models

Hybrid automatic speech recognition (ASR) models are typically sequentia...

Please sign up or login with your details

Forgot password? Click here to reset