ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

by   Wei Han, et al.

Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind other state-of-the-art methods in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy. We demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the previous best published system of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.


page 1

page 2

page 3

page 4


Effects of Number of Filters of Convolutional Layers on Speech Recognition Model Accuracy

Inspired by the progress of the End-to-End approach [1], this paper syst...

Conformer: Convolution-augmented Transformer for Speech Recognition

Recently Transformer and Convolution neural network (CNN) based models h...

Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition

Recently, the connectionist temporal classification (CTC) model coupled ...

Extending Recurrent Neural Aligner for Streaming End-to-End Speech Recognition in Mandarin

End-to-end models have been showing superiority in Automatic Speech Reco...

Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

Recurrent neural network transducers (RNN-T) are a promising end-to-end ...

A Convolutional Neural Network model based on Neutrosophy for Noisy Speech Recognition

Convolutional neural networks are sensitive to unknown noisy condition i...

Automatic Learning of Subword Dependent Model Scales

To improve the performance of state-of-the-art automatic speech recognit...

1 Introduction

Convolution Neural Network (CNN) based models for end-to-end (E2E) speech recognition is attracting an increasing amount of attention [zhang2017very, zeghidour2018fully, li2019jasper, kriman2019quartznet]. Among them, the Jasper model [li2019jasper] recently achieves close to the state-of-the-art word error rate (WER) 2.95% on LibriSpeech test-clean [panayotov2015librispeech] with an external neural language model. The main feature of the Jasper model is a deep convolution based encoder with stacked layers of 1D convolutions and skip connections. Depthwise separable convolutions [chollet2017xception] have been utilized to further increase the speed and accuracy of CNN models [hannun2019sequence, kriman2019quartznet]. The key advantage of a CNN based model is its parameter efficiency; however, the WER achieved by the best CNN model, QuartzNet [kriman2019quartznet], is still behind the RNN/transformer based models [largespecaugment, karita2019comparative, wang2019transformer, zhang2020transformer].

A major difference between the RNN/Transformer [karita2019comparative, wang2019transformer, zhang2020transformer] based models and a CNN model is the length of the context. In a bidirectional RNN model, a cell in theory has access to the information of the whole sequence; in a Transformer model, the attention mechanism explicitly allows the nodes at two distant time stamps to attend each other. However, a naive convolution with a limited kernel size only covers a small window in the time domain; hence the context is small and the global information is not incorporated. In this paper, we argue that the lack of global context is the main cause of the gap of WER between the CNN based ASR model and the RNN/Transformer based models.

To enhance the global context in the CNN model, we draw inspirations from the squeeze-and-excitation (SE) layer introduced in [hu2018squeeze]

, and propose a novel CNN model for ASR, which we call ContextNet. An SE layer squeezes a sequence of local feature vectors into a single global context vector, broadcasts this context back to each local feature vector, and merges the two via multiplications. When we place an SE layer after a naive convolution layer, we grant the convolution output the access to global information. Empirically, we observe that adding squeeze-and-excitation layers to ContextNet introduces the most reduction in the WER on LibriSpeech test-other.

Previous works on hybrid ASR have successfully introduced the context to acoustic models by either stacking a large number of layers, or having a separately trained global vector to represent the speaker and the environment information [peddinti2015jhu, xue2014fast, karafiat2011ivector, saon2013speaker]. In [sailor2019unsupervised], SE has been adopted to RNN for unsupervised adaptation. In this paper, we show that SE can also be effective for CNN encoders.

The architecture of ContextNet is also inspired by the design choices of QuartzNet [kriman2019quartznet], such as the usage of depthwise separable 1D convolution in the encoder. However, there are some key differences in the architectures in addition to the incorporation of the SE layer. For instance, we use a RNN-T decoder [graves2012sequence, rao2017exploring, he2017streaming] instead of the CTC decoder [graves2006connectionist]

. Moreover, we use the Swish activation function 

[ramachandran2017searching], which contributes a slight but consistent reduction in WER. Overall, ContextNet achieves the state-of-the-art WER of 1.9%/4.1% on LibriSpeech test-clean/test-other, outperforming all existing CNN, transformer and LSTM based models [kriman2019quartznet, zhang2020transformer, wang2019transformer, zeyer2019comparison, karita2019comparative, park2019specaugment].

This paper also studies how to reduce the computation cost of ContextNet for faster training and inference. First, we adopt a progressive downsampling scheme that is commonly used in vision models. Specifically, we progressively reduce the length of the encoded sequence eight times, significantly lower the computation while maintaining the encoder’s representation power and the overall model accuracy. As a benefit, this downsampling scheme allows us to reduce the kernel size of all the convolution layers to five without significantly reducing the effective receptive field of an encoder output node.

Figure 1: LibriSpeech test-other WER vs. model size. All numbers for E2E models are without external LM. For the Transformer-Hybrid model we use report the encoder size. ContextNet numbers are highlighted in red, and is the main model scaling parameter discussed in Section 2.2.6. Detailed results are in Table 2.

We can scale ContextNet by globally changing the number of channels in convolutional filters. Figure 1 illustrates the trade-off of ContextNet between model size and WER, as well as its comparison against other methods. Clearly, our scaled model achieves the best trade-offs among all.

In summary, the main contributions of this paper are: (1) a CNN architecture with global context that achieves state-of-the-art performance, (2) a progressive downsampling and model scaling scheme to achieve superior accuracy and model size trade-off.

2 Model

This section introduces the architecture details of ContextNet. Section 2.1 discusses the high-level design of ContextNet. Then Section 2.2 introduces our convolutional encoder, and discusses how we progressively reduce the temporal length of the input utterance in the network to reduce the computation while maintaining the accuracy of the model.

2.1 End-to-end Network: CNN-RNN-Transducer

Our network is based on the RNN-Transducer framework [graves2012sequence, rao2017exploring, he2017streaming]. The network contains three components: audio encoder on the input utterance, label encoder on the input label, and a joint network to combine the two and decode. We directly use the architecture of the label encoder and the joint network from [he2017streaming], but propose a new CNN based audio encoder.

2.2 Encoder Design

Let the input sequence be . The encoder transforms the original signal into a high level representation , where . Our convolution based is defined as:


where each

defines a convolution block. It contains a few layers of convolutions, each followed by batch normalization 

[goodfellow2016deep] and an activation function. It also includes the squeeze-and-excitation component [hu2018squeeze] and skip connections [he2016deep].

Before presenting the details of , we first elaborate the important modules in .

2.2.1 Squeeze-and-excitation

As illustrated in Figure 2, the Squeeze-and-excitation [hu2018squeeze] function, , performs global average pooling on the input , transforms it into a global channelwise weight and element-wise multiplies each frame by this weight. We adopt the idea to the 1D case,

where represents element-wise multiplication, are weight matrics, and

are bias vectors.

Figure 2:

1D Squeeze-and-excitation module. The input first goes through a convolution layer followed by batch normalization and activation. Then average pooling is applied to condense the conv result into a 1D vector, which is then processed by a bottleneck structure formed by two fully connected (FC) layers with activation functions. The output goes through a Sigmoid function to be mapped to

, and then tiled and applied on the conv output using pointwise multiplications.

2.2.2 Depthwise separable convolution

Let represent the convolution function used in the encoder. In this paper, we choose depthwise separable convolution as , because such a design has been previously shown in various applications [chollet2017xception, kriman2019quartznet, sandler2018mobilenetv2] to achieve better parameter efficiency without impacting accuracy.

For simplicity, we use the same kernel size on all depthwise convolution layers in the network.

2.2.3 Swish activation function

Let represent the activation function in the encoder. To choose

, we’ve experimented with both ReLU and the swish function 

[ramachandran2017searching] defined as:


where for all our experiments. We’ve observed that the swish function works consistently better than ReLU.

2.2.4 Convolution block

With all the individual modules introduced, we now present the convolution block from Equation (1). Figure 3 illustrate a high-level architecture of . A block can contain a few functions; let be the number of functions. Let be the batch normalization [ioffe2015batch]. We define each layer as . Therefore,

where means stacking layers of the function on the input and represents a pointwise projection function on the residual. By a slight abuse of notation, we allow the first layer and the last layer to be different from the other

layers: if the block needs to downsample the input sequence by two times, the last layer has a stride of two while all the rest

layers has a stride of one; otherwise all layers have a stride of one. Additionally, if the block has an input number of channels and output number of channels , the first layer turns channels into channels while the rest layers maintain the number of channels as . Following the convention, the projection function has the same number of stride as the first layer.

Figure 3: A convolution block contains a number of convolutions, each followed by batch normalization and activation. A squeeze-and-excitation (SE) block operates on the output of the last convolution layer. A skip connection with projection is applied on the output of the squeeze-and-excitation block.

2.2.5 Progressive downsampling

We use strided convolution for temporal downsampling. More downsampling layers reduces computation cost, but excessive downsampling in the encoder may negatively impact the decoder. Empirically, we find that a progressive downsampling scheme achieves a good trade-off between speed and accuracy. These trade-offs are discussed in Section 3.3.

2.2.6 Configuration details of ContextNet

ContextNet has convolution blocks . All convolution blocks have five layers of convolution, except and , which only have one layer of convolution each. Table 1 summarizes the architecture details. Note that a global parameter controls the scaling of our model. Increasing when increases the number of channels of the convolutions, giving the model more representation power with a larger model size.

Block ID #Conv layers #Output channels Kernel size Other
1 5 No residual
- 5 5
5 5 stride is
- 5 5
5 5 stride is
- 5 5
- 5 5
5 5 stride is
- 5 5
1 5 No residual
Table 1: Configuration of the ContextNet encoder. controls the number of output channels, and thus the scaling of our model. The kernel size is for the window size in the temporal domain; the convolutions are across frequency. If the stride of a convolution block is 2, its last conv layer has a stride of two while the rest of the conv layers has a stride of one, as discussed in Section 2.2.

3 Experiments

We conduct experiments on the Librispeech [panayotov2015librispeech] dataset which consists of 970 hours of labeled speech and an additional text only corpus for building language model. We extract 80 dimensional filterbanks features using a 25ms window with a stride of 10ms.

We use the Adam optimizer [kingma2014adam] and a transformer learning rate schedule [vaswani2017attention] with 15k warm-up steps and a peak learning rate of , where is the number of output channels on in Table 1. An regularization with weight is also added to all the trainable weights in the network. We use a single layer LSTM as decoder with input dimension of 640. Variational noise is introduced to the decoder as a regularization.

We use SpecAugment [park2019specaugment, largespecaugment] with mask parameter (), and ten time masks with maximum time-mask ratio (), where the maximum size of the time mask is set to times the length of the utterance. Time warping is not used. We use a 3-layer LSTM LM with width 4096 trained on the LibriSpeech langauge model corpus with the LibriSpeech960h transcripts added, tokenized with the 1k WPM built from LibriSpeech 960h. The LM has word-level perplexity 63.9 on the dev-set transcripts. The LM weight for shallow fusion is tuned on the dev-set via grid search.

3.1 Results on LibriSpeech

Table 2 shows the WER of three different configurations of our ContextNet on LibriSpeech, as well as the comparison with some published state-of-the-art systems with different architectures. The difference in these ContextNet configurations is the network width controlled by in Table 1. Specifically, we choose in for the small, medium and large ContextNet. We also build our own LSTM baseline as a reference.

From Table 2, we find that with only 30M parameters, our medium model, ContextNet(M), is already comparable with the best previously published system [zhang2020transformer]. Our large model , ContextNet(L), outperforms all the existing models by 13% relatively on test-clean and 18% relatively on test-other without an LM and achieves 1.9/4.1% with an LM. Our small model, ContextNet(S), is still significantly better than QuartzNet (CNN) [kriman2019quartznet], with or without a language model. This clearly demonstrated the representation power and parameter efficiency of ContextNet.

Method #Params (M) Without LM With LM
testclean testother testclean testother
 Transformer [wang2019transformer] - - - 2.26 4.85
 QuartzNet (CNN) [kriman2019quartznet] 19 3.90 11.28 2.69 7.25
 Transformer [karita2019comparative] - - - 2.6 5.7
 Transformer [synnaeve2019endtoend] 270 2.89 6.98 2.33 5.17
 LSTM 360 2.6 6.0 2.2 5.2
 Transformer [zhang2020transformer] 139 2.4 5.6 2.0 4.6
This Work
 ContextNet(S) 10 2.9 7.0 2.3 5.5
 ContextNet(M) 30 2.4 5.4 2.0 4.5
 ContextNet(L) 112 2.1 4.6 1.9 4.1
Table 2: Compared to the state-of-the-art models, ContextNet achieves superior performance both with and without language models

3.2 Effect of Context Size

To validate the effectiveness of adding global context to the CNN model for ASR, we perform an ablation study on how the squeeze-and-excitation module affects the WER on LibriSpeech test-clean/test-other. ContextNet in Table 1 with all squeeze-and-excitation modules removed and serves as the baseline of zero context.

The vanilla squeeze-and-excitation module uses the whole utterance as context. To investigate the effect of different context sizes, we replace the global average pooling operator of the squeeze-and-excitation module by a stride-one pooling operator where the context can be controlled by the size of the pooling window. In this study, we compare the window size of , and on all convolutional blocks.

As illustrated in Table 3, the SE module provides major improvement over the baseline. In addition, the benefit becomes greater as the length of the context window increases. This is consistent with the observation in a similar study of SE on image classification models [hu2018gather].

Context dev clean dev other test clean test other
None 2.6 7,0 2.6 6.9
256 2.1 5.4 2.3 5.5
512 2.1 5.1 2.3 5.2
1024 2.1 5.0 2.3 5.1
global 2.0 4.9 2.3 4.9
Table 3: Effect of the context window size. All models have .

3.3 Depth, Width, Kernel Size and Downsampling

Depth:  We perform a sweeping on the number of convolutional blocks and our best configuration is in Table 1. We find that with this configuration, we can train a model in a day with stable convergence.

GFLOPS111We report the average encoder FLOPS for processing one second of audio. testclean testother
2x 3 2.131 2.7 6.3
5 2.137 2.6 5.8
11 2.156 2.4 5.4
23 2.194 2.3 5.0
8x 3 1.036 2.3 5.1
5 1.040 2.3 5.0
11 1.050 2.3 5.0
23 1.071 2.3 5.2
Table 4: The effect of temporal reduction and convolution kernel size on FLOPS and model accuracy.
#Params(M) dev clean dev other test clean test other
0.5 10 2.7 7.0 2.9 7.0
1 30 2.2 5.1 2.4 5.4
1.5 65 2.0 4.7 2.2 4.8
2 112 2.0 4.6 2.1 4.6
Table 5: Model scaling by network width.

Width:   We globally scale the width of the network (i.e., the number of channels) on all layers and study how it impacts the model performance. Specifically, we take the ContextNet model from Table 1, sweep , and report the model size and the WER on LibriSpeech. Table 5 summarizes the result; it demonstrates the good trade-off between model size and WER of ContextNet.

Downsampling and kernel size:  Table 4 summarizes the FLOPS and WER on LibriSpeech with various choices of downsampling and fileter size. We use the same model with only one downsampling layer as the baseline; hence the baseline only does temporal reduction. We sweep the kernel size in , each kernel size is applied to all the depthwise convolution layers. The results suggest that progressive downsampling introduces significant saving in the number of FLOPS. Moreover, it actually benefits the accuracy of the model slightly. In addition, with progressive downsampling, increasing the kernel size decreases the WER of the model.

3.4 Large Scale Experiments

Finally, we show that the proposed architecture is also effective on large scale datasets. We use a experiment setup similar to [Chiu19longform], where the training set has public Youtube videos with semi-supervised transcripts generated by the approach in [liao2013large]. We evaluate on 117 videos with a total duration of 24.12 hours. The test set has diverse and challenging acoustic environments 222The eval set has been changed recently so the numbers in Table 6 is different from reported in [Chiu19longform].. Table 6 summarizes the result. We can see that ContextNet  outperforms the previous best architecture from [Chiu19longform], which is a combination of convolution and bidirectional LSTM, by 12% relatively with fewer parameters and FLOPS.

Model #Params (M) GFLOPS Youtube
TDNN [Chiu19longform] 192 3.834 9.3
ContextNet 112 2.647 8.2
Table 6: Comparing ContextNet with previous best results on Youtube test sets

4 Conclusion

In this work, we proposed and evaluated a CNN-RNN-based transducer for end-to-end speech recognition. A couple of modeling choices are discussed and compared. This model achieves a new state-of-the-art accuracy on the LibriSpeech benchmark with much fewer parameters compare to published E2E models. The proposed architecture can easily be used to search for small ASR models by limiting the width of the network. Initial study on a much larger and more challenging dataset also confirms our findings.