Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and Backward Transformers

by   Yusuke Kida, et al.

This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR). The speech-to-text alignment is a problem of splitting long audio recordings with un-aligned transcripts into utterance-wise pairs of speech and text. Unlike conventional methods based on frame-synchronous prediction, the proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem. This enables an accurate alignment benefiting from the strong inference ability of the state-of-the-art attention-based encoder-decoder models, which cannot be applied to the conventional methods. Two different Transformer models named forward Transformer and backward Transformer are respectively used for estimating an initial and final tokens of a given speech segment based on end-of-sentence prediction with teacher-forcing. Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment, that matches the manually annotated alignment with as few as 0.2 Transformer-based hybrid CTC/Attention ASR model using the aligned speech and text pairs as an additional training data reduces character error rates relatively up to 59.0 conventional alignment method based on connectionist temporal classification model.



There are no comments yet.


page 1

page 2

page 3

page 4


Synchronous Transformers for End-to-End Speech Recognition

For most of the attention-based sequence-to-sequence models, the decoder...

Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept

With the advent of direct models in automatic speech recognition (ASR), ...

A Comparison of Label-Synchronous and Frame-Synchronous End-to-End Models for Speech Recognition

End-to-end models are gaining wider attention in the field of automatic ...

Optimizing Alignment of Speech and Language Latent Spaces for End-to-End Speech Recognition and Understanding

The advances in attention-based encoder-decoder (AED) networks have brou...

Sequence-to-Sequence Learning via Attention Transfer for Incremental Speech Recognition

Attention-based sequence-to-sequence automatic speech recognition (ASR) ...

CTC-synchronous Training for Monotonic Attention Model

Monotonic chunkwise attention (MoChA) has been studied for the online st...

Emotion recognition by fusing time synchronous and time asynchronous representations

In this paper, a novel two-branch neural network model structure is prop...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.