Attention-based Wav2Text with Feature Transfer Learning

09/22/2017
by   Andros Tjandra, et al.
0

Conventional automatic speech recognition (ASR) typically performs multi-level pattern recognition tasks that map the acoustic speech waveform into a hierarchy of speech units. But, it is widely known that information loss in the earlier stage can propagate through the later stages. After the resurgence of deep learning, interest has emerged in the possibility of developing a purely end-to-end ASR system from the raw waveform to the transcription without any predefined alignments and hand-engineered models. However, the successful attempts in end-to-end architecture still used spectral-based features, while the successful attempts in using raw waveform were still based on the hybrid deep neural network - Hidden Markov model (DNN-HMM) framework. In this paper, we construct the first end-to-end attention-based encoder-decoder model to process directly from raw speech waveform to the text transcription. We called the model as "Attention-based Wav2Text". To assist the training process of the end-to-end model, we propose to utilize a feature transfer learning. Experimental results also reveal that the proposed Attention-based Wav2Text model directly with raw waveform could achieve a better result in comparison with the attentional encoder-decoder model trained on standard front-end filterbank features.

READ FULL TEXT
research
06/08/2017

Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

We present a state-of-the-art end-to-end Automatic Speech Recognition (A...
research
03/31/2021

Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition

Stream fusion, also known as system combination, is a common technique i...
research
08/08/2023

Comparative Analysis of the wav2vec 2.0 Feature Extractor

Automatic speech recognition (ASR) systems typically use handcrafted fea...
research
04/23/2019

Replay attack detection with complementary high-resolution information using end-to-end DNN for the ASVspoof 2019 Challenge

In this study, we concentrate on replacing the process of extracting han...
research
07/26/2020

End-to-end spoofing detection with raw waveform CLDNNs

Albeit recent progress in speaker verification generates powerful models...
research
08/15/2023

Improving CTC-AED model with integrated-CTC and auxiliary loss regularization

Connectionist temporal classification (CTC) and attention-based encoder ...
research
01/25/2022

Improved Mispronunciation detection system using a hybrid CTC-ATT based approach for L2 English speakers

This report proposes state-of-the-art research in the field of Computer ...

Please sign up or login with your details

Forgot password? Click here to reset