Effects of Number of Filters of Convolutional Layers on Speech Recognition Model Accuracy

02/03/2021
by   James Mou, et al.
10

Inspired by the progress of the End-to-End approach [1], this paper systematically studies the effects of Number of Filters of convolutional layers on the model prediction accuracy of CNN+RNN (Convolutional Neural Networks adding to Recurrent Neural Networks) for ASR Models (Automatic Speech Recognition). Experimental results show that only when the CNN Number of Filters exceeds a certain threshold value is adding CNN to RNN able to improve the performance of the CNN+RNN speech recognition model, otherwise some parameter ranges of CNN can render it useless to add the CNN to the RNN model. Our results show a strong dependency of word accuracy on the Number of Filters of convolutional layers. Based on the experimental results, the paper suggests a possible hypothesis of Sound-2-Vector Embedding (Convolutional Embedding) to explain the above observations. Based on this Embedding hypothesis and the optimization of parameters, the paper develops an End-to-End speech recognition system which has a high word accuracy but also has a light model-weight. The developed LVCSR (Large Vocabulary Continuous Speech Recognition) model has achieved quite a high word accuracy of 90.2 intermediate phonetic representation and any Language Model. Its acoustic model contains only 4.4 million weight parameters, compared to the 35 68 million acoustic-model weight parameters in DeepSpeech2 [2] (one of the top state-of-the-art LVCSR models) which can achieve a word accuracy of 91.5 light-weighted model is good for improving the transcribing computing efficiency and also useful for mobile devices, Driverless Vehicles, etc. Our model weight is reduced to  10 remains close to that of DeepSpeech2. If combined with a Language Model, our LVCSR system is able to achieve 91.5

READ FULL TEXT
research
05/07/2020

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

Convolutional neural networks (CNN) have shown promising results for end...
research
10/02/2016

Very Deep Convolutional Neural Networks for Robust Speech Recognition

This paper describes the extension and optimization of our previous work...
research
11/13/2018

Exploring RNN-Transducer for Chinese Speech Recognition

End-to-end approaches have drawn much attention recently for significant...
research
09/29/2022

ConvRNN-T: Convolutional Augmented Recurrent Neural Network Transducers for Streaming Speech Recognition

The recurrent neural network transducer (RNN-T) is a prominent streaming...
research
05/04/2023

Employing Hybrid Deep Neural Networks on Dari Speech

This paper is an extension of our previous conference paper. In recent y...
research
02/18/2018

Improved TDNNs using Deep Kernels and Frequency Dependent Grid-RNNs

Time delay neural networks (TDNNs) are an effective acoustic model for l...
research
04/05/2019

Jasper: An End-to-End Convolutional Neural Acoustic Model

In this paper, we report state-of-the-art results on LibriSpeech among e...

Please sign up or login with your details

Forgot password? Click here to reset