ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

05/07/2020
by   Wei Han, et al.
0

Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/03/2021

Effects of Number of Filters of Convolutional Layers on Speech Recognition Model Accuracy

Inspired by the progress of the End-to-End approach [1], this paper syst...
research
05/16/2020

Conformer: Convolution-augmented Transformer for Speech Recognition

Recently Transformer and Convolution neural network (CNN) based models h...
research
06/20/2018

Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition

Recently, the connectionist temporal classification (CTC) model coupled ...
research
07/24/2017

Exploring Neural Transducers for End-to-End Speech Recognition

In this work, we perform an empirical comparison among the CTC, RNN-Tran...
research
11/07/2018

CNN-based MultiChannel End-to-End Speech Recognition for everyday home environments

Casual conversations involving multiple speakers and noises from surroun...
research
10/18/2021

Automatic Learning of Subword Dependent Model Scales

To improve the performance of state-of-the-art automatic speech recognit...
research
10/29/2018

Cascaded CNN-resBiLSTM-CTC: An End-to-End Acoustic Model For Speech Recognition

Automatic speech recognition (ASR) tasks are resolved by end-to-end deep...

Please sign up or login with your details

Forgot password? Click here to reset