End-to-end automatic speech recognition (ASR) models have shown great success in replacing traditional hybrid HMM-based models by integrating acoustic, pronunciation, and language models into a single model structure. They rely only on paired acoustic and text data without additional acoustic knowledge such as phone sets and dictionaries. There are mainly two kinds of end-to-end encoder-decoder ASR architectures. The first is RNN-based sequence-to-sequence (Seq2Seq) models with attention [3, 8], which learn the alignment between the sequence of audio and their corresponding text. The second [4, 10] applies a fully-attentional feed-forward architecture transformer , which improves RNN-based ASR in terms of performance and training speed with multi-head self-attention mechanism and parallel-in-time computation. However, the modeling capacity on both approaches relies on a large number of parameters. Scaling up model size increases the computational overhead, which limits its practicality for deployment on portable devices without connectivity and slows down both the training and inference processes.
We propose a novel factorized transformer-based model architecture, low-rank transformer (LRT), to reduce the number of parameters in the transformer model by replacing large high-rank matrices with low-rank matrices to eliminate the computational bottleneck. It eventually optimizes the model in terms of space and time complexity when we choose the factorization rank that is relatively smaller than the original matrix dimensions. We design LRT by taking the idea from the autoencoder that compresses the high-dimensional data input into a compressed vector representation. And then, it decodes back to a high-rank matrix to learn latent space representations of the high-rank matrix. This approach is considered as an in-training compression method, where we compress the parameters of the model prior to the training process. Our contributions are described below.
We introduce a novel lightweight transformer architecture by leveraging low-rank matrices that achieve state-of-the-art performance on AiShell-1 and HKUST test sets in an end-to-end setting.
We successfully reduce the inference time by up to 1.35x speed-up in GPU and 1.23x speed-up in CPU by shrinking more than 50% parameters from the baseline.
Interestingly, based on our experiments, LRT models generalize better and yield lower error rates on both validation and test performance compared to the uncompressed transformer model.
2 Related Work
2.1 Low-Rank Model Compression
Training end-to-end deep learning ASR models require high computational resources and long training time in order to be able to converge.  proposed a low-rank matrix factorization of the final weight layer, which reduced up to 31% parameters on a large vocabulary continuous speech recognition. 
introduced a reinforcement learning method to compress ASR model iteratively and learn compression ranks, but it requires more than a week to train. In other line of work, a post-training compression method on LSTM using non-negative matrix factorization is proposed by to compress large pre-trained models. However, this technique does not speed-up the training process. The aforementioned approaches reduce the number of model parameters while keeping the performance loss low. In this work, we extend the idea of an in-training compression method  by implementing low-rank units on the transformer model , which is suitable for effectively shrinking the whole network size and at the same time, reducing the computational cost in training and evaluation, with improvements on the error rate.
2.2 End-to-end Speech Recognition
Current end-to-end automatic speech recognition models mainly focus on two types of model: (a) CTC-based models [6, 1], and (b) Seq2Seq-based models such as LAS . A combination of both models is also proposed by . Recent work by [4, 10] employ a different approach by utilizing the transformer block. The present study is related to recent approaches of the transformer ASR, and it leverages the effectiveness of in-training low-rank compression methods, which was not considered in the aforementioned works.
3 Low-Rank Transformer ASR
We propose a compact and more generalized low-rank transformer unit by extending the idea of an in-training compression method . In our transformer architecture, we replace the linear feed-forward unit  with a factorized linear unit called linear encoder-decoder (LED) unit. Figure 1 shows the architecture of our proposed low-rank transformer, and Figure 2 shows the low-rank version of the multi-head attention and position-wise feed-forward network, including LED. The proposed end-to-end ASR model accepts a spectrogram as the input and produces a sequence of characters as the output similar to . It consists of layers of encoder and layers of the decoder. We employ multi-head attention to allow the model to jointly attend to information from different representation subspaces in a different position.
3.1 Linear Encoder-Decoder (LED)
We propose to leverage encoder-decoder units in the transformer model instead of a single linear layer. The design is based on the matrix factorization by approximating the matrix in the linear feed-forward unit using two smaller matrices and .
The matrix requires parameters and flops, while and require parameters and flops. If we take the rank to be very low , the number of parameters in and are much smaller compared to .
3.2 Low-Rank Multi-Head Attention (LRMHA)
The LED is incorporated in the multi-head attention by factorizing the projection layers of keys , values , queries , and the output layer
. A residual connection from a queryto the output is added.
where is a LRMHA function, is the head of , is the number of head, and the projections are parameter matrices , , , , and , and . and are dimensions of key and value, and denotes the rank.
3.3 Low-Rank Position-wise Feed-Forward Network (LRFFN)
3.4 Training Phase
The encoder module uses a VGG net  with a 6-layer CNN architecture. The VGG consists of convolutional layers that are added to learn a universal audio representation and generate input embedding. The input of the unit is spectrogram. The decoder receives the encoder outputs and applies multi-head attention to the decoder input. We apply a mask into the attention layer to avoid any information flow from future tokens. Then, we run a non-autoregressive step and calculate the cross-entropy loss.
3.5 Evaluation Phase
In the inference time, we decode the sequence using autoregressive beam-search by selecting the best sub-sequence scored using the softmax probability of the characters. We defineas the probability of the sentence. A word count is added to avoid generating very short sentences. is calculated as follows.
where is the parameter to control the decoding probability from the decoder , and is the parameter to control the effect of the word count .
The experiments were conducted on two dataset benchmarks: AiShell-1 , a multi-accent Mandarin speech dataset, and HKUST , a conversational telephone speech recognition. The former consists of 150 hours, 10 hours, 5 hours in training, validation, and test, respectively. The latter consists of 5 hours test set, and we extract 4.2 hours from the training data as the validation set, and the remaining (152 hours) as the training set.
We concatenate all characters in the corpus, including three special tokens such as PAD, SOS, and EOS. For all models, we use two encoder layers and four decoder layers. The large transformer consists of 2048, of 512, and of 512. For the smaller transformers, we select the same parameters as the LRT model with , and . In the beam-search decoding, we take ,
, and a beam size of 8. We evaluate our model using a single GeForce GTX 1080Ti GPU and three Intel Xeon E5-2620 v4 CPU cores. We use character error rate (CER) as the evaluation metric.
5 Results and Discussions
5.1 Evaluation Performance
Table 1 shows the experiment results. LRT models gain slight improvement even after more than 50% compression rate and outperform vanilla transformers in both AiShell-1 and HKUST test sets with 13.09% CER and 28.95% CER respectively. In addition, we further minimize the gap between the HMM-based hybrid and end-to-end approaches without leveraging any perturbation strategy or external language model. Interestingly, our LRT models achieve lower validation loss compared to the uncompressed Transformer (large) baseline model, which implies that our LRT models regularize better, as shown in Figure 3. The models are faster to converge and stop in a better local minimum compared to vanilla transformers.
5.2 Memory and Time Efficiency
As shown in Table 1, our LRT () model achieves similar performance as the large transformer model despite having only one-third of large transformer parameters. In terms of time efficiency, our LRT models gain inference time speed-up for up to 1.35x in GPU and 1.23x in CPU, and 1.10x training time speed-up in GPU compared to the uncompressed Transformer (large) baseline model, as shown in Table 2. We also compute the average length of the generated sequences to get a precise comparison. In general, both LRT and baseline models generate sequences with a similar length, which implies that our speed-up scores are valid.
We propose Low-Rank Transformer (LRT), a memory-efficient and fast neural architecture that compress the network parameters and boosts the speed in inference time by up to 1.26x in GPU and 1.16x in CPU, and training time for end-to-end speech recognition, and even improves the performance after reducing more than 50% parameters of the baseline transformer model. Our approach generalizes better than uncompressed vanilla transformers and achieves state-of-the-art performance on AiShell-1 and HKUST datasets in an end-to-end setting without using additional external data.
-  (2018) Building competitive direct acoustics-to-word models for english conversational speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4759–4763. Cited by: §2.2.
Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. Cited by: §4.1.
-  (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §1, §2.2.
-  (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. Cited by: §1, §2.2.
-  (2019) ShrinkML: end-to-end asr model compression using reinforcement learning. In INTERSPEECH, Cited by: §2.1.
Towards end-to-end speech recognition with recurrent neural networks. In
International conference on machine learning, pp. 1764–1772. Cited by: §2.2.
-  (2017) Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm. Proc. Interspeech 2017, pp. 949–953. Cited by: §2.2, Table 1, Table 1.
-  (2017) Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4835–4839. Cited by: §1.
-  (2017) Factorization tricks for lstm networks. ICLR Workshop. Cited by: §2.1, §3.
-  (2019) The speechtransformer for large-scale mandarin chinese speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7095–7099. Cited by: §1, §2.2.
-  (2019) Framewise supervised training towards end-to-end speech recognition models: first results. Proc. Interspeech 2019, pp. 1641–1645. Cited by: Table 1.
-  (2019) End-to-end speech recognition with adaptive computation steps. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6246–6250. Cited by: Table 1.
-  (2006) Hkust/mts: a very large scale mandarin telephone speech corpus. In International Symposium on Chinese Spoken Language Processing, pp. 724–735. Cited by: §4.1.
-  (2016) An empirical exploration of ctc acoustic models. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2623–2627. Cited by: Table 1.
-  (2013) Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6655–6659. Cited by: §2.1.
-  (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §3.4.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1, §3.
-  (2019-13–15September) On the effectiveness of low-rank matrix factorization for lstm model compression. In Proceedings of the 33rd Pacific Asia Conference on Language, Information and Computation, Hakodate, Japan. Cited by: §2.1.
-  (2019-11) Code-switched language models using neural based synthetic data from parallel sentences. In Proceedings of the 23rd Conference on Computational Natural Language Learning, Hong Kong. Cited by: §3.