NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism

03/31/2022
by   Jingbei Li, et al.
0

Although deep learning and end-to-end models have been widely used and shown superiority in automatic speech recognition (ASR) and text-to-speech (TTS) synthesis, state-of-the-art forced alignment (FA) models are still based on hidden Markov model (HMM). HMM has limited view of contextual information and is developed with long pipelines, leading to error accumulation and unsatisfactory performance. Inspired by the capability of attention mechanism in capturing long term contextual information and learning alignments in ASR and TTS, we propose a neural network based end-to-end forced aligner called NeuFA, in which a novel bidirectional attention mechanism plays an essential role. NeuFA integrates the alignment learning of both ASR and TTS tasks in a unified framework by learning bidirectional alignment information from a shared attention matrix in the proposed bidirectional attention mechanism. Alignments are extracted from the learnt attention weights and optimized by the ASR, TTS and FA tasks in a multi-task learning manner. Experimental results demonstrate the effectiveness of our proposed model, with mean absolute error on test set drops from 25.8 ms to 23.7 ms at word level, and from 17.0 ms to 15.7 ms at phoneme level compared with state-of-the-art HMM based model.

READ FULL TEXT

page 2

page 3

research
10/12/2017

Convolutional Attention-based Seq2Seq Neural Network for End-to-End ASR

This thesis introduces the sequence to sequence model with Luong's atten...
research
03/13/2018

LCANet: End-to-End Lipreading with Cascaded Attention-CTC

Machine lipreading is a special type of automatic speech recognition (AS...
research
11/13/2018

An Online Attention-based Model for Speech Recognition

Attention-based end-to-end (E2E) speech recognition models such as Liste...
research
11/12/2018

Multi-encoder multi-resolution framework for end-to-end speech recognition

Attention-based methods and Connectionist Temporal Classification (CTC) ...
research
11/01/2022

Speech-text based multi-modal training with bidirectional attention for improved speech recognition

To let the state-of-the-art end-to-end ASR model enjoy data efficiency, ...
research
06/12/2023

FocalGatedNet: A Novel Deep Learning Model for Accurate Knee Joint Angle Prediction

Predicting knee joint angles accurately is critical for biomechanical an...
research
10/08/2021

Explaining the Attention Mechanism of End-to-End Speech Recognition Using Decision Trees

The attention mechanism has largely improved the performance of end-to-e...

Please sign up or login with your details

Forgot password? Click here to reset