Lightweight and Efficient End-to-End Speech Recognition Using Low-Rank Transformer

10/30/2019 ∙ by Genta Indra Winata, et al. ∙ 0

High performing deep neural networks come at the cost of computational complexity that limits its practicality for deployment on portable devices. We propose Low-Rank Transformer (LRT), a memory-efficient and fast neural architecture that significantly reduces the parameters and boosts the speed in training and inference for end-to-end speech recognition. Our approach reduces the number of parameters of the network by more than 50 speed-up the inference time by around 1.26x compared to the baseline transformer model. The experiments show that LRT models generalize better and yield lower error rates on both validation and test sets compared to the uncompressed transformer model. LRT models outperform existing works on several datasets in an end-to-end setting without using any external language model and acoustic data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

End-to-end automatic speech recognition (ASR) models have shown great success in replacing traditional hybrid HMM-based models by integrating acoustic, pronunciation, and language models into a single model structure. They rely only on paired acoustic and text data without additional acoustic knowledge such as phone sets and dictionaries. There are mainly two kinds of end-to-end encoder-decoder ASR architectures. The first is RNN-based sequence-to-sequence (Seq2Seq) models with attention [3, 8], which learn the alignment between the sequence of audio and their corresponding text. The second [4, 10] applies a fully-attentional feed-forward architecture transformer [17], which improves RNN-based ASR in terms of performance and training speed with multi-head self-attention mechanism and parallel-in-time computation. However, the modeling capacity on both approaches relies on a large number of parameters. Scaling up model size increases the computational overhead, which limits its practicality for deployment on portable devices without connectivity and slows down both the training and inference processes.

Figure 1: Low-Rank Transformer Architecture.
Figure 2: Low-Rank Transformer Unit. Left: Low-Rank Multi-head Attention (LRHMA), Center: Low-Rank Position-wise Feed-forward Network (LRFFN), and Right: Linear Encoder-Decoder (LED).

We propose a novel factorized transformer-based model architecture, low-rank transformer (LRT), to reduce the number of parameters in the transformer model by replacing large high-rank matrices with low-rank matrices to eliminate the computational bottleneck. It eventually optimizes the model in terms of space and time complexity when we choose the factorization rank that is relatively smaller than the original matrix dimensions. We design LRT by taking the idea from the autoencoder that compresses the high-dimensional data input into a compressed vector representation. And then, it decodes back to a high-rank matrix to learn latent space representations of the high-rank matrix. This approach is considered as an in-training compression method, where we compress the parameters of the model prior to the training process. Our contributions are described below.

  • We introduce a novel lightweight transformer architecture by leveraging low-rank matrices that achieve state-of-the-art performance on AiShell-1 and HKUST test sets in an end-to-end setting.

  • We successfully reduce the inference time by up to 1.35x speed-up in GPU and 1.23x speed-up in CPU by shrinking more than 50% parameters from the baseline.

  • Interestingly, based on our experiments, LRT models generalize better and yield lower error rates on both validation and test performance compared to the uncompressed transformer model.

2 Related Work

2.1 Low-Rank Model Compression

Training end-to-end deep learning ASR models require high computational resources and long training time in order to be able to converge. [15] proposed a low-rank matrix factorization of the final weight layer, which reduced up to 31% parameters on a large vocabulary continuous speech recognition. [5]

introduced a reinforcement learning method to compress ASR model iteratively and learn compression ranks, but it requires more than a week to train. In other line of work, a post-training compression method on LSTM using non-negative matrix factorization is proposed by 

[18] to compress large pre-trained models. However, this technique does not speed-up the training process. The aforementioned approaches reduce the number of model parameters while keeping the performance loss low. In this work, we extend the idea of an in-training compression method [9] by implementing low-rank units on the transformer model [17], which is suitable for effectively shrinking the whole network size and at the same time, reducing the computational cost in training and evaluation, with improvements on the error rate.

2.2 End-to-end Speech Recognition

Current end-to-end automatic speech recognition models mainly focus on two types of model: (a) CTC-based models [6, 1], and (b) Seq2Seq-based models such as LAS [3]. A combination of both models is also proposed by [7]. Recent work by [4, 10] employ a different approach by utilizing the transformer block. The present study is related to recent approaches of the transformer ASR, and it leverages the effectiveness of in-training low-rank compression methods, which was not considered in the aforementioned works.

Model Params CER Hybrid approach HMM-DNN [7] - 8.5% End-to-end approach Attention Model [12] - 23.2%         + RNNLM [12] - 22.0% CTC [11] 11.7M 19.43% Framewise-RNN [11] 17.1M 19.38% ACS + RNNLM [12] 14.6M 18.7% Transformer (large) 25.1M 13.49% Transformer (medium) 12.7M 14.47% Transformer (small) 8.7M 15.66% LRT () 12.7M 13.09% LRT () 10.7M 13.23% LRT () 8.7M 13.60% Model Params CER Hybrid approach DNN-hybrid [7] - 35.9% LSTM-hybrid (with perturb.) [7] - 33.5% TDNN-hybrid, lattice-free MMI (with perturb.) [7] - 28.2% End-to-end approach Attention Model [7] - 37.8% CTC + LM [14] 12.7M 34.8% MTL + joint dec. (one-pass) [7] 9.6M 33.9%         + RNNLM (joint train) [7] 16.1M 32.1% Transformer (large) 22M 29.21% Transformer (medium) 11.5M 29.73% Transformer (small) 7.8M 31.30% LRT () 11.5M 28.95% LRT () 9.7M 29.08% LRT () 7.8M 30.74%
Table 1: Results on AiShell-1 (left) and HKUST (right) test sets. For the end-to-end approach, we only limit the evaluation to systems without any external data and perturbation to examine the effectiveness of the approach. We approximate the number of parameters based on the description of the previous works.

3 Low-Rank Transformer ASR

We propose a compact and more generalized low-rank transformer unit by extending the idea of an in-training compression method [9]. In our transformer architecture, we replace the linear feed-forward unit [17] with a factorized linear unit called linear encoder-decoder (LED) unit. Figure 1 shows the architecture of our proposed low-rank transformer, and Figure 2 shows the low-rank version of the multi-head attention and position-wise feed-forward network, including LED. The proposed end-to-end ASR model accepts a spectrogram as the input and produces a sequence of characters as the output similar to [19]. It consists of layers of encoder and layers of the decoder. We employ multi-head attention to allow the model to jointly attend to information from different representation subspaces in a different position.

3.1 Linear Encoder-Decoder (LED)

We propose to leverage encoder-decoder units in the transformer model instead of a single linear layer. The design is based on the matrix factorization by approximating the matrix in the linear feed-forward unit using two smaller matrices and .


The matrix requires parameters and flops, while and require parameters and flops. If we take the rank to be very low , the number of parameters in and are much smaller compared to .

3.2 Low-Rank Multi-Head Attention (LRMHA)

The LED is incorporated in the multi-head attention by factorizing the projection layers of keys , values , queries , and the output layer

. A residual connection from a query

to the output is added.


where is a LRMHA function, is the head of , is the number of head, and the projections are parameter matrices , , , , and , and . and are dimensions of key and value, and denotes the rank.

3.3 Low-Rank Position-wise Feed-Forward Network (LRFFN)

Each encoder and decoder layer has a position-wise feed-forward network that contains two low-rank LED units and applies ReLU function in between. To alleviate the gradient vanishing issue, residual connection is added as shown in Figure.



where is a LRFFN function.

3.4 Training Phase

The encoder module uses a VGG net [16] with a 6-layer CNN architecture. The VGG consists of convolutional layers that are added to learn a universal audio representation and generate input embedding. The input of the unit is spectrogram. The decoder receives the encoder outputs and applies multi-head attention to the decoder input. We apply a mask into the attention layer to avoid any information flow from future tokens. Then, we run a non-autoregressive step and calculate the cross-entropy loss.

3.5 Evaluation Phase

In the inference time, we decode the sequence using autoregressive beam-search by selecting the best sub-sequence scored using the softmax probability of the characters. We define

as the probability of the sentence. A word count is added to avoid generating very short sentences. is calculated as follows.


where is the parameter to control the decoding probability from the decoder , and is the parameter to control the effect of the word count .

4 Experiments

4.1 Dataset

The experiments were conducted on two dataset benchmarks: AiShell-1 [2], a multi-accent Mandarin speech dataset, and HKUST [13], a conversational telephone speech recognition. The former consists of 150 hours, 10 hours, 5 hours in training, validation, and test, respectively. The latter consists of 5 hours test set, and we extract 4.2 hours from the training data as the validation set, and the remaining (152 hours) as the training set.

4.2 Setup

We concatenate all characters in the corpus, including three special tokens such as PAD, SOS, and EOS. For all models, we use two encoder layers and four decoder layers. The large transformer consists of 2048, of 512, and of 512. For the smaller transformers, we select the same parameters as the LRT model with , and . In the beam-search decoding, we take ,

, and a beam size of 8. We evaluate our model using a single GeForce GTX 1080Ti GPU and three Intel Xeon E5-2620 v4 CPU cores. We use character error rate (CER) as the evaluation metric.

5 Results and Discussions

5.1 Evaluation Performance

Table 1 shows the experiment results. LRT models gain slight improvement even after more than 50% compression rate and outperform vanilla transformers in both AiShell-1 and HKUST test sets with 13.09% CER and 28.95% CER respectively. In addition, we further minimize the gap between the HMM-based hybrid and end-to-end approaches without leveraging any perturbation strategy or external language model. Interestingly, our LRT models achieve lower validation loss compared to the uncompressed Transformer (large) baseline model, which implies that our LRT models regularize better, as shown in Figure 3. The models are faster to converge and stop in a better local minimum compared to vanilla transformers.

5.2 Memory and Time Efficiency

As shown in Table 1, our LRT () model achieves similar performance as the large transformer model despite having only one-third of large transformer parameters. In terms of time efficiency, our LRT models gain inference time speed-up for up to 1.35x in GPU and 1.23x in CPU, and 1.10x training time speed-up in GPU compared to the uncompressed Transformer (large) baseline model, as shown in Table 2. We also compute the average length of the generated sequences to get a precise comparison. In general, both LRT and baseline models generate sequences with a similar length, which implies that our speed-up scores are valid.

dataset r CER compress. speed-up
GPU CPU only
AiShell-1 base 0 0 1 1 23.08
100 0.40% 49.40% 1.17x 1.15x 23.15
75 0.26% 57.37% 1.23x 1.16x 23.17
50 -1.10% 65.34% 1.30x 1.23x 23.19
HKUST base 0 0 1 1 22.43
100 0.26% 47.72% 1.21x 1.14x 22.32
75 0.13% 55.90% 1.26x 1.15x 22.15
50 -1.53% 64.54% 1.35x 1.22x 22.49
Table 2: Compression rate and inference speed-up on LRT models vs. Transformer (large). CER and denote the improvement, and the mean length of generated sequences.
Figure 3: Training and validation losses on AiShell-1 data.

6 Conclusion

We propose Low-Rank Transformer (LRT), a memory-efficient and fast neural architecture that compress the network parameters and boosts the speed in inference time by up to 1.26x in GPU and 1.16x in CPU, and training time for end-to-end speech recognition, and even improves the performance after reducing more than 50% parameters of the baseline transformer model. Our approach generalizes better than uncompressed vanilla transformers and achieves state-of-the-art performance on AiShell-1 and HKUST datasets in an end-to-end setting without using additional external data.


  • [1] K. Audhkhasi, B. Kingsbury, B. Ramabhadran, G. Saon, and M. Picheny (2018) Building competitive direct acoustics-to-word models for english conversational speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4759–4763. Cited by: §2.2.
  • [2] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng (2017)

    Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline

    In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. Cited by: §4.1.
  • [3] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §1, §2.2.
  • [4] L. Dong, S. Xu, and B. Xu (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. Cited by: §1, §2.2.
  • [5] Ł. Dudziak, M. Abdelfattah, R. Vipperla, S. Laskaridis, and N. Lane (2019) ShrinkML: end-to-end asr model compression using reinforcement learning. In INTERSPEECH, Cited by: §2.1.
  • [6] A. Graves and N. Jaitly (2014)

    Towards end-to-end speech recognition with recurrent neural networks


    International conference on machine learning

    pp. 1764–1772. Cited by: §2.2.
  • [7] T. Hori, S. Watanabe, Y. Zhang, and W. Chan (2017) Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm. Proc. Interspeech 2017, pp. 949–953. Cited by: §2.2, Table 1, Table 1.
  • [8] S. Kim, T. Hori, and S. Watanabe (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4835–4839. Cited by: §1.
  • [9] O. Kuchaiev and B. Ginsburg (2017) Factorization tricks for lstm networks. ICLR Workshop. Cited by: §2.1, §3.
  • [10] J. Li, X. Wang, Y. Li, et al. (2019) The speechtransformer for large-scale mandarin chinese speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7095–7099. Cited by: §1, §2.2.
  • [11] M. Li, Y. Cao, W. Zhou, and M. Liu (2019) Framewise supervised training towards end-to-end speech recognition models: first results. Proc. Interspeech 2019, pp. 1641–1645. Cited by: Table 1.
  • [12] M. Li, M. Liu, and H. Masanori (2019) End-to-end speech recognition with adaptive computation steps. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6246–6250. Cited by: Table 1.
  • [13] Y. Liu, P. Fung, Y. Yang, C. Cieri, S. Huang, and D. Graff (2006) Hkust/mts: a very large scale mandarin telephone speech corpus. In International Symposium on Chinese Spoken Language Processing, pp. 724–735. Cited by: §4.1.
  • [14] Y. Miao, M. Gowayyed, X. Na, T. Ko, F. Metze, and A. Waibel (2016) An empirical exploration of ctc acoustic models. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2623–2627. Cited by: Table 1.
  • [15] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran (2013) Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6655–6659. Cited by: §2.1.
  • [16] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §3.4.
  • [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1, §3.
  • [18] G. I. Winata, A. Madotto, J. Shin, E. J. Barezi, and P. Fung (2019-13–15September) On the effectiveness of low-rank matrix factorization for lstm model compression. In Proceedings of the 33rd Pacific Asia Conference on Language, Information and Computation, Hakodate, Japan. Cited by: §2.1.
  • [19] G. I. Winata, A. Madotto, C. Wu, and P. Fung (2019-11) Code-switched language models using neural based synthetic data from parallel sentences. In Proceedings of the 23rd Conference on Computational Natural Language Learning, Hong Kong. Cited by: §3.