End-to-end automatic speech recognition (ASR) has become the standard of state-of-the-art approaches. Indeed the availability of large scale hand-labeled datasets and sufficient computing resources made it possible to train powerful deep neural networks for ASR, reaching very low Word Error Rate (WER) on academic benchmarks. Yet even if these new approaches are breaking the state-of-the-art, one major pitfall for using them in real-world is the resource cost. To achieve high performance, the training budget is often very large, implying to use a sizeable number of GPUs[kriman2020quartznet, majumdar2021citrinet]. And when the model has been successfully trained, the inference time may also become prohibitive for some specific usage. For instance, available devices often do not come with a GPU and an ideal usage in a production environment would be to be able to compute the inference on a single basic CPU.
The integration of neural networks as a production-ready technology has been broadly explored in many fields such as vision [sandler2018mobilenetv2] and different approaches have been proposed to address these problems. They may be gathered into several broad categories [neill2020overview], such as weights sharing [dabre2019recurrent], pruning [cantu2003pruning], quantization [dally2015high], knowledge distillation [bucilua2006model], low-rank decomposition [xue2013restructuring] and efficient architecture design [tan2019efficientnet]. Each of these methods (separately or together) may help to reduce the model complexity. In this paper we choose to focus on the design of an efficient architecture to address the ASR problem.
Different types of architectures have been used for ASR, such as RNN [graves2013speech, hannun2014deep, chan2015listen, rao2017exploring, he2019streaming], CNN [collobert2016wav2letter, li2019jasper, kriman2020quartznet, han2020contextnet, majumdar2021citrinet] and transformers [dong2018speech, karita2019comparative, zhang2020transformer, guo2021recent]. More recently architectures modelling both local and global dependencies have been introduced. For instance, [han2020contextnet, majumdar2021citrinet] enhance the global context of CNNs with the squeeze-and-excitation (SE) mechanism [hu2018squeeze], while [gulati2020conformer]
augment the transformer network with convolution to model both local and global dependencies with convolution and attention achieving state-of-the-art results. We aim at reducing this Conformer complexity while coping with some strict constraints: limiting the training resource budget to a maximum of 4 Nvidia RTX 2080 Ti GPUs without harming the recognition performance. Recent works done in ASR to reduce the computation cost of CNNs for faster training and inference[han2020contextnet] show the interest of applying a progressive downsampling from the bottom to the top of the model. We propose to introduce this progressive downsampling and dimension scaling to the Conformer. Following the same patterns proposed in [han2020contextnet], we progressively reduce the length of the encoded sequence by a factor of 8. We also study the benefit of using multi-head attention as a global downsampling operation, instead of using a convolution downsampling.
Yet, one main drawback of applying a progressive subsampling approach to an attention-based architecture is the introduction of a computation asymmetry into the network. As attention complexity is quadratic in the sequence length, earlier attention layers require way more computation than latter layers and result in a time bottleneck. The same problem is found in vision where recent works [bello2019attention, ramachandran2019stand, srinivas2021bottleneck] have proposed to replace or augment convolution with self-attention in the ResNet family backbone [he2016deep]. The adopted solution is then to restrict attention to latter layers with smallest spatial dimension and therefore hit computation and memory constraints. Yet, this may mean a performance degradation in the specific case of the Conformer. A solution would be to build an efficient self-attention mechanism (its original form has a quadratic time complexity). Indeed, [li2021efficient] shows the benefits brought by using efficient attention [shen2021efficient], while [wang2021efficient] proposes a prob-sparse mechanism to decide whether the attention operation should be computed. These approaches greatly deal with the problem of handling longer sequences, but may bring marginal improvements for small sequences [wang2021efficient] (Figure 3). Another alternative to regular attention is local attention [parmar2018image, ramachandran2019stand], which is inspired by CNNs and restricts the positions in the attended positions to a local neighbourhood around the query position. In this work, we take inspiration from all these approaches and propose a sequence-length agnostic attention mechanism that we call grouped attention. Grouped attention reduces attention complexity from to by grouping neighbouring time elements of the sequence along the feature dimension before applying scaled dot-product attention. Therefore, this paper proposes an efficient Conformer which combines both Progressive Subsampling and Grouped Attention. We apply a stronger grouped multi-head self-attention to early attention layers in the encoder first stage, where the sequence is the longest and therefore the complexity the highest. We show that it allows to greatly reduce the computation asymmetry and thus the computation time.
Finally, the original Conformer has initially been trained with a RNN-T criteria, while works using the ESPnet toolkit [li2021efficient, shen2021efficient] propose a training based on a combined CTC and Attention loss. Recent works have shown that it is also possible for fully convolutional models to reach great performance using the single CTC loss [majumdar2021citrinet] and thus to gain an important decoding time. We propose to better investigate these benefits in a resource constrained environment, comparing our encoder trained with CTC and with RNN-T.
This work brings four main contributions: (a) the introduction of a Progressive Downsampling to the Conformer encoder leading to a more efficient architecture achieving better recognition performances with fewer multiply-adds, (b) a novel attention mechanism that we call Grouped Attention, allowing us to further reduce training and decoding time of our Efficient Conformer model while maintaining similar recognition performances, (c) a comparative study of the benefits brought by training the Conformer with the original RNN-T approach versus the CTC one, with respect to a restricted training budget setting and (d) a small efficient Conformer with competitive recognition performance.
We propose two main strategies to reduce the Conformer complexity. Our first strategy is the introduction of progressive downsampling to the Conformer architecture, allowing us to reach better recognition performances and faster decoding. The second strategy aims at increasing the efficiency of earlier self-attention layers using grouped and local attention to balance model overall complexity without hurting accuracy. We experiment with CTC [graves2006connectionist] and RNN-T [graves2012sequence] losses, comparing the impact of the proposed set on methods for both criteria.
2.1 RNN-T and CTC Criteria
RNN-T extends CTC by defining a distribution over output sequences of all lengths, and by jointly modelling both input-output and output-output dependencies. The audio encoder (or transcription network) is combined with a label decoder (or prediction network) and a joint network [graves2013speech]
. The joint network combines the audio encoder and label decoder outputs using a feed forward neural network with a softmax output layer over the vocabulary size. For CTC, the encoder is augmented with a final softmax layer that directly converts the encoder outputs to probabilities.
2.2 Progressive Downsampling
Inspired by recent works done in ASR to reduce the computation cost of CNNs for faster training and inference with progressive downsampling [han2020contextnet, majumdar2021citrinet], we experiment introducing progressive downsampling to the Conformer encoder. Our Efficient Conformer encoder, illustrated in Figure 1, first downsamples audio features with a convolution stem with stride 2. The resulting features are fed to three encoder stages where each stage comprises a number of conformer blocks [gulati2020conformer] of same feature dimension. A conformer block is composed of a multi-head self-attention module and a convolution module sandwiched between two feed-forward networks. Each block is followed by a post layer normalization. Sequence downsampling is performed in the last block of first and second encoder stages. We replace the original convolution module with a convolution downsampling module illustrated in Figure 2. We also experiment with attention downsampling using a strided attention in the multi-head self-attention module, as shown in Figure 4. This results in a 8 progressive downsampling performed along the time dimension. The encoded sequence is progressively projected to wider feature dimension such that the complexity of hidden layers stay the same for each encoder stage. This is achieved in the convolution module of every downsampling block.
2.3 Towards an efficient Self-Attention
Relative Multi-Head Self-Attention Self-attention is used to introduce global dependencies into the network by computing dot-products between each element of the hidden sequence. In the case of multi-head self-attention (MHSA) [vaswani2017attention], a scaled dot-product attention is performed individually for a number of heads to a hidden sequence as:
Where , and are query, key and value linear projections with parameter matrices , , and is the output linear projection matrix. As [gulati2020conformer], we use multi-head self-attention with relative sinusoidal positional encodings, allowing the model to generalize better on different input lengths. We adapt the original relative positional encodings from Transformer-XL [dai2019transformer] to full context using a sinusoidal matrix with positions ranging from to . The output of a head becomes:
Where is a relative position score matrix that satisfy with relative position embedding . This condition is achieved by reindexing
, moving the relative logits to their correct positions. The memory efficient relative to absolute position indexing algorithm for unmasked sequences is described in[bello2019attention] (Appendix A.3).
Strided Multi-Head Self-Attention Downsampling is generally performed using strided convolution or pooling operations. These operations performs local downsampling by processing nearby elements of a hidden sequence. In this work, we experiment the use of MHSA as a global downsampling operation. This is achieved by striding the attention query along the temporal dimension resulting in subsampled query . This results in strided attention maps, as shown in Figure 4, where subsampled query positions can attend to the entire sequence context to perform downsampling. Strided MHSA has a complexity where defines the stride applied to the query projection. Progressive attention downsampling is performed by replacing regular MHSA layers by strided MHSA layers in each downsampling block. Figure 4 illustrates the multi-head self-attention downsamping module.
Grouped Multi-Head Self-Attention While a similar complexity per hidden layer can be obtained for different encoder stages by varying blocks feature dimension for each encoder stage, attention complexity is quadratic in the sequence length which introduces computation asymmetry into the network where earlier attention layers requires way more multiply-adds than latter layers. We propose to solve this problem by defining a novel attention mechanism, which we call grouped attention (Figure 5). Grouped attention reduce attention complexity from to by grouping nearby time elements along the feature dimension before applying scaled dot-product attention. Attention queries, keys, values and relative positional embedding are reshaped from , , and to , , and where and . The output of a head becomes:
And concatenated grouped attention output is reshaped to before the output projection layer. Grouped multi-head self-attention is motivated by the fact that nearby element are supposed to encode similar features and therefore a low resolution attention pattern could be applied to approximate regular dense attention. We apply grouped multi-head self-attention starting from earlier attention layers in the first stage where encoded sequence is the longest before experimenting with second and third stages.
Local Multi-Head Self-Attention Introduced in [parmar2018image, ramachandran2019stand], local attention restricts the attended positions to a local neighborhood around the query position. This is achieved by defining an attention window and segmenting the hidden sequence into blocks of size . Regular MHSA is then performed in parallel for each block where all queries attends to the same content matrix comprised of all block positions.
3.1 Data and Training Setup
Data We train and evaluate our models on the LibriSpeech [panayotov2015librispeech] dataset. LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech with corresponding text transcripts. An additional 800 millions token text-only corpus is provided for language model (LM) training. We use input spectrograms of 80-dimensional mel-scale log filter banks computed over windows of 20ms strided by 10ms. SpecAugment [park2020specaugment] is applied during training to prevent overfitting with two frequency masks with mask size parameter and ten time masks with adaptive size . We only use five time masks for CTC experiments as our models failed to converge with ten masks.
Training Setup We experiment with RNN-Transducer and CTC models of 10M and 13M parameters respectively. Table 1 describes our models hyper-parameters. Transducer models use a single LSTM layer decoder. The decoder and joint network dimensions are set to 320 in every experiments. A byte-pair encoding tokenizer is built from LibriSpeech transcripts using sentencepiece [kudo2018sentencepiece]. Following previous works [gulati2020conformer, majumdar2021citrinet]pytorch].
We train CTC models for 450 epochs with a global batch size of 256 on 4 GPUs, using a batch size of 32 per GPU with 2 accumulated steps. Transducer models are trained for 250 epochs using batch sizes of 16 per GPU and 4 accumulated steps. We use the Adam optimizer[kingma2014adam] with = 0.9, = 0.98, = and a transformer learning rate schedule [vaswani2017attention] with 10k warmup-steps and peak learning rate and for CTC and RNN-T respectively, where is the encoder output dimension. Gaussian weight noise [jim1996analysis] ( = 0, = 0.075) is added to the transducer decoder during training for regularization starting at 20k steps, re-sampling the noise at every training step. We also add a L2 regularization with a weight to all the trainable weights of the model.
We train a 6-gram external language model [heafield2011kenlm] on the LibriSpeech LM corpus for re-scoring during beam search.
|Model||Conformer Transducer||Eff Conf Transducer||Conformer CTC||Eff Conf CTC|
|Num Params (M)||10.3||10.8||13.0||13.2|
|Conv Kernel Size||31||15,15,15||31||15,15,15|
|Att Group Size||-||3,1,1||-||3,1,1|
3.2 Results on LibriSpeech
Table 2 compares the Word Error Rates (WER) of our experiments with state-of-the-art CTC (QuartzNet, CitriNet) and RNN-T (Conformer Transducer, ContextNet) models on the LibriSpeech test-clean and test-other sets. Our Efficient Conformer CTC model achieves competitive results of 3.57/8.99 without a language model for only 13M parameters. It even outperforms the 21M parameter Citrinet-384 using an external 6-gram language model during beam search, achieving WERs of 2.72/6.66. Moreover, we were able to recover similar results compared to the non-grouped version with 35% faster training using grouped attention with parameter in the first stage. Our Efficient Conformer Transducer model achieves satisfying results but still lack behind the original work that was trained with larger batches and more resources. We found RNN-T models to converge faster with fewer epochs than CTC models, achieving lower greedy WER. However, using an external language model during beam search allows CTC models to bridge the gap in WER with RNN-T, which is in line with what was observed in [majumdar2021citrinet].
|Model Architecture||Model Type||LM||test WER||Params (M)|
|w/o Grouped Att||6-gram||2.79||6.65|
|w/o Grouped Att||6-gram||2.79||7.03|
3.3 Ablation Studies
We propose a detailed ablation study to better understand the improvements (in terms of complexity reduction and WER) brought by the different methods composing the Efficient Conformer. We report the number of operations measured by multiply-adds (MAdds) for the encoder to process a ten second audio clip. Inverse Real Time Factor (Inv RTF) is measured on the LibriSpeech dev-clean set by decoding with a batch size 1 on a single Intel Core i9-9940X 3.3GHz CPU thread. We also report experiments training time on 4 Nvidia RTX 2080 Ti GPUs.
Progressive Downsampling We first study the impact of using progressive downsampling with regular MHSA in every stage. Although having a significant computation overhead in earlier layers applying MHSA on long sequences, a progressively downsampled architecture achieves better accuracy with fewer multiply-adds as well as shorter training and decoding time for both CTC and RNN-T experiments. We observe an improvement in WER especially on the dev-other set, as show in Table 3
. These benefits extends to the self-attention models what has already been observed for fully convolutional models in ASR[han2020contextnet].
|Model Architecture||Model Type||dev clean||dev other||MAdds (B)||Inv RTF||Train Time (h)|
|+ Prog Down||RNN-T||3.13||8.05||2.84||43.2||147|
|+ Prog Down||CTC||3.44||9.12||3.91||48.8||191|
Downsampling: Convolution VS Attention We experiment with two downsampling methods, a local downsampling performed with strided depthwise convolution in the convolution module and a global downsampling performed by strided attention. As seen in Table 4, we find attention downsampling to perform as well as convolution downsampling. Furthermore, attention downsampling slightly reduces the decoding time of our Efficient Conformer model. This show that MHSA can successfully be applied as a global downsampling operation to reduce the encoded sequence length.
|Downsampling Method||Model Type||dev clean||dev other||MAdds (B)||Inv RTF||Train Time (h)|
Attention Group Size To study the effect of grouped attention on model complexity and recognition performance, we experiment to gradually increase attention group size in each encoder stage. The results in Table 5 demonstrate the effectiveness of using grouped attention in earlier attention layers to reduce model complexity and memory cost without impacting recognition performances. Introducing grouped attention in the first stage of our progressively downsampled Conformer CTC model results in a 21% speedup in inference time with 35% faster training. Inference time can further be reduced to 29% speedup with 41% faster training by introducing multi-head grouped attention in every stage but results in small performance losses.
|Attention Group Sizes||Model Type||dev clean||dev other||MAdds (B)||Inv RTF||Train Time (h)|
Local Attention We study the impact of local self-attention on recognition performances and inference time. Table 6 shows the results obtained for introducing local attention in encoder stages. We find local attention to perform similarly compared to the regular multi-head attention using a local attention window
in the first stage. However, further restricting the size of the attention window can negatively impact recognition performances. These results show the importance of using a global context for better recognition performances. It also may explain why grouped attention achieves better results. Moreover, we find local attention to be slower than grouped attention at decoding time due to the computation overhead introduced by sequence padding for partitioning sequences into non overlapping blocks of size.
|Attention Window||Model Type||dev clean||dev other||MAdds (B)||Inv RTF||Train Time (h)|
3.4 Models Complexity on Long Sequences
Figure 6 shows the impact of using progressive downsampling and attention variants on memory usage for different sequence lengths. It confirms that a progressively downsampled Conformer architecture can effectively reduce overall memory consumption when sequence length do not grow too large. However, very long sequences can result in higher memory usage due the quadratic cost of applying MHSA in earlier layers. This can be solved using efficient attention variants like local or grouped attention in earlier layers. Grouped multi-head attention can significantly reduce memory consumption for long sequences by being applied in every stages, outperforming our Conformer model using regular MHSA.
In this paper, we proposed a set of methods to reduce the Conformer complexity, leading to a more efficient architecture design, the Efficient Conformer. We showed that progressive downsampling could effectively be introduced to convolution-augmented transformer networks and results in better recognition performances and faster decoding. Then we solved the computation asymmetry caused by attention in earlier layers using a novel attention mechanism named grouped attention. Moreover, we successfully applied strided multi-head self-attention as a global downsampling operation, achieving similar accuracy while being faster compared to convolution downsampling. Finally, we demonstrated the effectiveness of our methods by conducting detailed ablations studies on the LibriSpeech dataset. Our 13M parameters Efficient Conformer CTC model achieves competitive performance of 2.7%/6.7% for test-clean/test-other when trained on a limited computing budget of 4 GPUs while being 29% faster than our CTC Conformer baseline at inference and 36% faster to train.
In the future, we would like to explore other forms of attentions such as efficient attention. We also plan to apply complementary complexity reduction techniques like weights pruning and quantization to further reduce inference time.
Appendix A Additional Experiments
a.1 Model Scaling
In order to study the effect of model scaling on recognition performance, we design larger Efficient Conformer CTC models of 31M and 125M parameters. Tables 8 describes architecture hyper-parameters of Efficient Conformer Small, Medium and Large CTC variants. We similarly identify Medium and Large Conformer CTC models within the same parameter range (Table 8).
|Model||Eff Conf||Eff Conf||Eff Conf|
|Num Params (M)||13.2||31.5||125.6|
|Conv Kernel Size||15,15,15||15,15,15||15,15,15|
|Att Group Size||3,1,1||3,1,1||3,1,1|
|Num Params (M)||13.0||30.5||121.5|
|Conv Kernel Size||31||31||31|
Table 9 compares the Word Error Rates obtained on the LibriSpeech dataset with recently published CTC, Sequence-to-sequence (S2S) and Transducers approaches. Our Efficient Conformer CTC Large model trained on 4 Nvidia RTX 3090 GPUs achieves near state-of-the-art performance of 2.5%/5.8% without using a language model and 2.1%/4.7% with an external n-gram language model for test-clean/test-other. We find Small, Medium and Large Efficient Conformer CTC to reach lower word error rates than Citrinet models using an external 6-gram language model. However, this small gain in accuracy isn’t sufficient to close the gap between SOTA Transducers approaches. We suppose that more computing resources and longer training should help to further reduce this gap and compare these approaches more equitably.
|Model Architecture||Model Type||LM||test WER||Params (M)|
a.2 Models Inference Time
Figure 7 shows inference times of Conformer and Efficient Conformer CTC variants for different input lengths. The Efficient Conformer architecture greatly reduces inference time and allows us to reach better recognition performance while requiring less CPU time to process audio sequences. As shown is Table 2 and Table 9, we find Efficient Conformer CTC models to consistently outperform Conformer variants.
a.3 Multi-Head Linear Self-Attention
Linear attention, also known as efficient attention, has a linear memory and computational complexity with respect to the size of the input. It does not compute a similarity between each pair of positions but global context vectors for each feature dimension. This results in ancomputational complexity which bring an efficiency advantage over regular dot-product attention when is way larger than . We study the use of multi-head linear self-attention as proposed in [li2021efficient] for our small progressively downsampled CTC model where feature dimension is relatively smaller than sequence length. Multi-head linear self-attention is defined as:
Where and denote the operators of applying the softmax function along the rows and columns of a matrix. Table 10 compares the use of linear attention in the progressively downsampled conformer encoder with regular dot-product attention. Linear attention greatly reduce inference and training time for our small model but also hurts recognition performance. A good trade off between accuracy and model complexity can be achieved using grouped attention in earlier layers. An interesting follow-up to this work would be to experiment using multi-head linear self-attention in earlier layers where hidden sequence length is relatively larger than model feature dimension and compare it with grouped attention.
|Attention Type||Model Type||dev clean||dev other||MAdds (B)||Inv RTF||Train Time (h)|