Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular Subword Units

10/08/2021
by   Yosuke Higuchi, et al.
0

In end-to-end automatic speech recognition (ASR), a model is expected to implicitly learn representations suitable for recognizing a word-level sequence. However, the huge abstraction gap between input acoustic signals and output linguistic tokens makes it challenging for a model to learn the representations. In this work, to promote the word-level representation learning in end-to-end ASR, we propose a hierarchical conditional model that is based on connectionist temporal classification (CTC). Our model is trained by auxiliary CTC losses applied to intermediate layers, where the vocabulary size of each target subword sequence is gradually increased as the layer becomes close to the word-level output. Here, we make each level of sequence prediction explicitly conditioned on the previous sequences predicted at lower levels. With the proposed approach, we expect the proposed model to learn the word-level representations effectively by exploiting a hierarchy of linguistic structures. Experimental results on LibriSpeech-100h, 960h and TEDLIUM2 demonstrate that the proposed model improves over a standard CTC-based model and other competitive models from prior work. We further analyze the results to confirm the effectiveness of the intended representation learning with our model.

READ FULL TEXT
research
07/31/2020

Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model

End-to-end (E2E) systems have played a more and more important role in a...
research
08/08/2018

End-to-end Speech Recognition with Word-based RNN Language Models

This paper investigates the impact of word-based RNN language models (RN...
research
04/01/2022

Multi-sequence Intermediate Conditioning for CTC-based ASR

End-to-end automatic speech recognition (ASR) directly maps input speech...
research
10/18/2019

End-to-End Speech Recognition: A review for the French Language

Recently, end-to-end ASR based either on sequence-to-sequence networks o...
research
03/25/2021

Residual Energy-Based Models for End-to-End Speech Recognition

End-to-end models with auto-regressive decoders have shown impressive re...
research
09/19/2023

Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

We present a novel integration of an instruction-tuned large language mo...
research
03/11/2023

Transcription free filler word detection with Neural semi-CRFs

Non-linguistic filler words, such as "uh" or "um", are prevalent in spon...

Please sign up or login with your details

Forgot password? Click here to reset