Convolution Neural Network (CNN) based models for end-to-end (E2E) speech recognition is attracting an increasing amount of attention [zhang2017very, zeghidour2018fully, li2019jasper, kriman2019quartznet]. Among them, the Jasper model [li2019jasper] recently achieves close to the state-of-the-art word error rate (WER) 2.95% on LibriSpeech test-clean [panayotov2015librispeech] with an external neural language model. The main feature of the Jasper model is a deep convolution based encoder with stacked layers of 1D convolutions and skip connections. Depthwise separable convolutions [chollet2017xception] have been utilized to further increase the speed and accuracy of CNN models [hannun2019sequence, kriman2019quartznet]. The key advantage of a CNN based model is its parameter efficiency; however, the WER achieved by the best CNN model, QuartzNet [kriman2019quartznet], is still behind the RNN/transformer based models [largespecaugment, karita2019comparative, wang2019transformer, zhang2020transformer].
A major difference between the RNN/Transformer [karita2019comparative, wang2019transformer, zhang2020transformer] based models and a CNN model is the length of the context. In a bidirectional RNN model, a cell in theory has access to the information of the whole sequence; in a Transformer model, the attention mechanism explicitly allows the nodes at two distant time stamps to attend each other. However, a naive convolution with a limited kernel size only covers a small window in the time domain; hence the context is small and the global information is not incorporated. In this paper, we argue that the lack of global context is the main cause of the gap of WER between the CNN based ASR model and the RNN/Transformer based models.
To enhance the global context in the CNN model, we draw inspirations from the squeeze-and-excitation (SE) layer introduced in [hu2018squeeze]
, and propose a novel CNN model for ASR, which we call ContextNet. An SE layer squeezes a sequence of local feature vectors into a single global context vector, broadcasts this context back to each local feature vector, and merges the two via multiplications. When we place an SE layer after a naive convolution layer, we grant the convolution output the access to global information. Empirically, we observe that adding squeeze-and-excitation layers to ContextNet introduces the most reduction in the WER on LibriSpeech test-other.
Previous works on hybrid ASR have successfully introduced the context to acoustic models by either stacking a large number of layers, or having a separately trained global vector to represent the speaker and the environment information [peddinti2015jhu, xue2014fast, karafiat2011ivector, saon2013speaker]. In [sailor2019unsupervised], SE has been adopted to RNN for unsupervised adaptation. In this paper, we show that SE can also be effective for CNN encoders.
The architecture of ContextNet is also inspired by the design choices of QuartzNet [kriman2019quartznet], such as the usage of depthwise separable 1D convolution in the encoder. However, there are some key differences in the architectures in addition to the incorporation of the SE layer. For instance, we use a RNN-T decoder [graves2012sequence, rao2017exploring, he2017streaming] instead of the CTC decoder [graves2006connectionist]
. Moreover, we use the Swish activation function[ramachandran2017searching], which contributes a slight but consistent reduction in WER. Overall, ContextNet achieves the state-of-the-art WER of 1.9%/4.1% on LibriSpeech test-clean/test-other, outperforming all existing CNN, transformer and LSTM based models [kriman2019quartznet, zhang2020transformer, wang2019transformer, zeyer2019comparison, karita2019comparative, park2019specaugment].
This paper also studies how to reduce the computation cost of ContextNet for faster training and inference. First, we adopt a progressive downsampling scheme that is commonly used in vision models. Specifically, we progressively reduce the length of the encoded sequence eight times, significantly lower the computation while maintaining the encoder’s representation power and the overall model accuracy. As a benefit, this downsampling scheme allows us to reduce the kernel size of all the convolution layers to five without significantly reducing the effective receptive field of an encoder output node.
We can scale ContextNet by globally changing the number of channels in convolutional filters. Figure 1 illustrates the trade-off of ContextNet between model size and WER, as well as its comparison against other methods. Clearly, our scaled model achieves the best trade-offs among all.
In summary, the main contributions of this paper are: (1) a CNN architecture with global context that achieves state-of-the-art performance, (2) a progressive downsampling and model scaling scheme to achieve superior accuracy and model size trade-off.
This section introduces the architecture details of ContextNet. Section 2.1 discusses the high-level design of ContextNet. Then Section 2.2 introduces our convolutional encoder, and discusses how we progressively reduce the temporal length of the input utterance in the network to reduce the computation while maintaining the accuracy of the model.
2.1 End-to-end Network: CNN-RNN-Transducer
Our network is based on the RNN-Transducer framework [graves2012sequence, rao2017exploring, he2017streaming]. The network contains three components: audio encoder on the input utterance, label encoder on the input label, and a joint network to combine the two and decode. We directly use the architecture of the label encoder and the joint network from [he2017streaming], but propose a new CNN based audio encoder.
2.2 Encoder Design
Let the input sequence be . The encoder transforms the original signal into a high level representation , where . Our convolution based is defined as:
defines a convolution block. It contains a few layers of convolutions, each followed by batch normalization[goodfellow2016deep] and an activation function. It also includes the squeeze-and-excitation component [hu2018squeeze] and skip connections [he2016deep].
Before presenting the details of , we first elaborate the important modules in .
As illustrated in Figure 2, the Squeeze-and-excitation [hu2018squeeze] function, , performs global average pooling on the input , transforms it into a global channelwise weight and element-wise multiplies each frame by this weight. We adopt the idea to the 1D case,
where represents element-wise multiplication, are weight matrics, and
are bias vectors.
2.2.2 Depthwise separable convolution
Let represent the convolution function used in the encoder. In this paper, we choose depthwise separable convolution as , because such a design has been previously shown in various applications [chollet2017xception, kriman2019quartznet, sandler2018mobilenetv2] to achieve better parameter efficiency without impacting accuracy.
For simplicity, we use the same kernel size on all depthwise convolution layers in the network.
2.2.3 Swish activation function
Let represent the activation function in the encoder. To choose
, we’ve experimented with both ReLU and the swish function[ramachandran2017searching] defined as:
where for all our experiments. We’ve observed that the swish function works consistently better than ReLU.
2.2.4 Convolution block
With all the individual modules introduced, we now present the convolution block from Equation (1). Figure 3 illustrate a high-level architecture of . A block can contain a few functions; let be the number of functions. Let be the batch normalization [ioffe2015batch]. We define each layer as . Therefore,
where means stacking layers of the function on the input and represents a pointwise projection function on the residual. By a slight abuse of notation, we allow the first layer and the last layer to be different from the other
layers: if the block needs to downsample the input sequence by two times, the last layer has a stride of two while all the restlayers has a stride of one; otherwise all layers have a stride of one. Additionally, if the block has an input number of channels and output number of channels , the first layer turns channels into channels while the rest layers maintain the number of channels as . Following the convention, the projection function has the same number of stride as the first layer.
2.2.5 Progressive downsampling
We use strided convolution for temporal downsampling. More downsampling layers reduces computation cost, but excessive downsampling in the encoder may negatively impact the decoder. Empirically, we find that a progressive downsampling scheme achieves a good trade-off between speed and accuracy. These trade-offs are discussed in Section 3.3.
2.2.6 Configuration details of ContextNet
ContextNet has convolution blocks . All convolution blocks have five layers of convolution, except and , which only have one layer of convolution each. Table 1 summarizes the architecture details. Note that a global parameter controls the scaling of our model. Increasing when increases the number of channels of the convolutions, giving the model more representation power with a larger model size.
|Block ID||#Conv layers||#Output channels||Kernel size||Other|
We conduct experiments on the Librispeech [panayotov2015librispeech] dataset which consists of 970 hours of labeled speech and an additional text only corpus for building language model. We extract 80 dimensional filterbanks features using a 25ms window with a stride of 10ms.
We use the Adam optimizer [kingma2014adam] and a transformer learning rate schedule [vaswani2017attention] with 15k warm-up steps and a peak learning rate of , where is the number of output channels on in Table 1. An regularization with weight is also added to all the trainable weights in the network. We use a single layer LSTM as decoder with input dimension of 640. Variational noise is introduced to the decoder as a regularization.
We use SpecAugment [park2019specaugment, largespecaugment] with mask parameter (), and ten time masks with maximum time-mask ratio (), where the maximum size of the time mask is set to times the length of the utterance. Time warping is not used. We use a 3-layer LSTM LM with width 4096 trained on the LibriSpeech langauge model corpus with the LibriSpeech960h transcripts added, tokenized with the 1k WPM built from LibriSpeech 960h. The LM has word-level perplexity 63.9 on the dev-set transcripts. The LM weight for shallow fusion is tuned on the dev-set via grid search.
3.1 Results on LibriSpeech
Table 2 shows the WER of three different configurations of our ContextNet on LibriSpeech, as well as the comparison with some published state-of-the-art systems with different architectures. The difference in these ContextNet configurations is the network width controlled by in Table 1. Specifically, we choose in for the small, medium and large ContextNet. We also build our own LSTM baseline as a reference.
From Table 2, we find that with only 30M parameters, our medium model, ContextNet(M), is already comparable with the best previously published system [zhang2020transformer]. Our large model , ContextNet(L), outperforms all the existing models by 13% relatively on test-clean and 18% relatively on test-other without an LM and achieves 1.9/4.1% with an LM. Our small model, ContextNet(S), is still significantly better than QuartzNet (CNN) [kriman2019quartznet], with or without a language model. This clearly demonstrated the representation power and parameter efficiency of ContextNet.
|Method||#Params (M)||Without LM||With LM|
|QuartzNet (CNN) [kriman2019quartznet]||19||3.90||11.28||2.69||7.25|
3.2 Effect of Context Size
To validate the effectiveness of adding global context to the CNN model for ASR, we perform an ablation study on how the squeeze-and-excitation module affects the WER on LibriSpeech test-clean/test-other. ContextNet in Table 1 with all squeeze-and-excitation modules removed and serves as the baseline of zero context.
The vanilla squeeze-and-excitation module uses the whole utterance as context. To investigate the effect of different context sizes, we replace the global average pooling operator of the squeeze-and-excitation module by a stride-one pooling operator where the context can be controlled by the size of the pooling window. In this study, we compare the window size of , and on all convolutional blocks.
As illustrated in Table 3, the SE module provides major improvement over the baseline. In addition, the benefit becomes greater as the length of the context window increases. This is consistent with the observation in a similar study of SE on image classification models [hu2018gather].
|Context||dev clean||dev other||test clean||test other|
3.3 Depth, Width, Kernel Size and Downsampling
Depth: We perform a sweeping on the number of convolutional blocks and our best configuration is in Table 1. We find that with this configuration, we can train a model in a day with stable convergence.
|GFLOPS111We report the average encoder FLOPS for processing one second of audio.||testclean||testother|
|#Params(M)||dev clean||dev other||test clean||test other|
Width: We globally scale the width of the network (i.e., the number of channels) on all layers and study how it impacts the model performance. Specifically, we take the ContextNet model from Table 1, sweep , and report the model size and the WER on LibriSpeech. Table 5 summarizes the result; it demonstrates the good trade-off between model size and WER of ContextNet.
Downsampling and kernel size: Table 4 summarizes the FLOPS and WER on LibriSpeech with various choices of downsampling and fileter size. We use the same model with only one downsampling layer as the baseline; hence the baseline only does temporal reduction. We sweep the kernel size in , each kernel size is applied to all the depthwise convolution layers. The results suggest that progressive downsampling introduces significant saving in the number of FLOPS. Moreover, it actually benefits the accuracy of the model slightly. In addition, with progressive downsampling, increasing the kernel size decreases the WER of the model.
3.4 Large Scale Experiments
Finally, we show that the proposed architecture is also effective on large scale datasets. We use a experiment setup similar to [Chiu19longform], where the training set has public Youtube videos with semi-supervised transcripts generated by the approach in [liao2013large]. We evaluate on 117 videos with a total duration of 24.12 hours. The test set has diverse and challenging acoustic environments 222The eval set has been changed recently so the numbers in Table 6 is different from reported in [Chiu19longform].. Table 6 summarizes the result. We can see that ContextNet outperforms the previous best architecture from [Chiu19longform], which is a combination of convolution and bidirectional LSTM, by 12% relatively with fewer parameters and FLOPS.
In this work, we proposed and evaluated a CNN-RNN-based transducer for end-to-end speech recognition. A couple of modeling choices are discussed and compared. This model achieves a new state-of-the-art accuracy on the LibriSpeech benchmark with much fewer parameters compare to published E2E models. The proposed architecture can easily be used to search for small ASR models by limiting the width of the network. Initial study on a much larger and more challenging dataset also confirms our findings.