End-to-End Speech Recognition with High-Frame-Rate Features Extraction

07/03/2019
by   Cong-Thanh Do, et al.
0

State-of-the-art end-to-end automatic speech recognition (ASR) extracts acoustic features from input speech signal every 10 ms which corresponds to a frame rate of 100 frames/second. In this paper, we investigate the use of high-frame-rate features extraction in end-to-end ASR. High frame rates of 200 and 400 frames/second are used in the features extraction and provide additional information for end-to-end ASR. The effectiveness of high-frame-rate features extraction is evaluated independently and in combination with speed perturbation based data augmentation. Experiments performed on two speech corpora, Wall Street Journal (WSJ) and CHiME-5, show that using high-frame-rate features extraction yields improved performance for end-to-end ASR, both independently and in combination with speed perturbation. On WSJ corpus, the relative reduction of word error rate (WER) yielded by high-frame-rate features extraction independently and in combination with speed perturbation are up to 21.3 WER reductions are up to 2.8 by microphone arrays and up to 11.8 recorded by binaural microphones.

READ FULL TEXT
research
10/09/2021

Data Augmentation with Locally-time Reversed Speech for Automatic Speech Recognition

Psychoacoustic studies have shown that locally-time reversed (LTR) speec...
research
11/01/2022

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

Automatic speech recognition (ASR) systems typically rely on an external...
research
06/09/2023

Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition

End-to-end (E2E) systems have shown comparable performance to hybrid sys...
research
10/11/2022

An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition

Differential privacy (DP) is one data protection avenue to safeguard use...
research
06/23/2022

Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems

Fundamental modelling differences between hybrid and end-to-end (E2E) au...
research
03/24/2022

Complex Frequency Domain Linear Prediction: A Tool to Compute Modulation Spectrum of Speech

Conventional Frequency Domain Linear Prediction (FDLP) technique models ...
research
12/22/2019

End-Point Detection with State Transition Model based on Chunk-Wise Classification

A state transition model (STM) based on chunk-wise classification was pr...

Please sign up or login with your details

Forgot password? Click here to reset