Small energy masking for improved neural network training for end-to-end speech recognition

02/15/2020
by   Chanwoo Kim, et al.
0

In this paper, we present a Small Energy Masking (SEM) algorithm, which masks inputs having values below a certain threshold. More specifically, a time-frequency bin is masked if the filterbank energy in this bin is less than a certain energy threshold. A uniform distribution is employed to randomly generate the ratio of this energy threshold to the peak filterbank energy of each utterance in decibels. The unmasked feature elements are scaled so that the total sum of the feature values remain the same through this masking procedure. This very simple algorithm shows relatively 11.2 Error Rate (WER) improvements on the standard LibriSpeech test-clean and test-other sets over the baseline end-to-end speech recognition system. Additionally, compared to the input dropout algorithm, SEM algorithm shows relatively 7.7 test-other sets. With a modified shallow-fusion technique with a Transformer LM, we obtained a 2.62 on the LibriSpeech test-other set.

READ FULL TEXT
research
12/22/2019

end-to-end training of a large vocabulary end-to-end speech recognition system

In this paper, we present an end-to-end training framework for building ...
research
12/22/2019

power-law nonlinearity with maximally uniform distribution criterion for improved neural network training in automatic speech recognition

In this paper, we describe the Maximum Uniformity of Distribution (MUD) ...
research
12/29/2022

Macro-block dropout for improved regularization in training end-to-end speech recognition models

This paper proposes a new regularization algorithm referred to as macro-...
research
03/25/2021

Radically Old Way of Computing Spectra: Applications in End-to-End ASR

We propose a technique to compute spectrograms using Frequency Domain Li...
research
10/28/2022

Improving short-video speech recognition using random utterance concatenation

One of the limitations in end-to-end automatic speech recognition framew...
research
01/01/2019

Exploring spectro-temporal features in end-to-end convolutional neural networks

Triangular, overlapping Mel-scaled filters ("f-banks") are the current s...
research
07/15/2013

Modified SPLICE and its Extension to Non-Stereo Data for Noise Robust Speech Recognition

In this paper, a modification to the training process of the popular SPL...

Please sign up or login with your details

Forgot password? Click here to reset