Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks

10/23/2019 ∙ by Xingchen Song, et al. ∙ 0

Self-attention network (SAN) can benefit significantly from the bi-directional representation learning through unsupervised pretraining paradigms such as BERT and XLNet. In this paper, we present an XLNet-like pretraining scheme "Speech-XLNet" for unsupervised acoustic model pretraining to learn speech representations with SAN. The pretrained SAN is finetuned under the hybrid SAN/HMM framework. We conjecture that by shuffling the speech frame orders, the permutation in Speech-XLNet serves as a strong regularizer to encourage the SAN to make inferences by focusing on global structures through its attention weights. In addition, Speech-XLNet also allows the model to explore the bi-directional contexts for effective speech representation learning. Experiments on TIMIT and WSJ demonstrate that Speech-XLNet greatly improves the SAN/HMM performance in terms of both convergence speed and recognition accuracy compared to the one trained from randomly initialized weights. Our best systems achieve a relative improvement of 11.9 the TIMIT and WSJ tasks respectively. In particular, the best system achieves a phone error rate (PER) of 13.3 knowledge, is the lowest PER obtained from a single system.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, unsupervised representation learning paradigms such as BERT [3], XLNet [23] have been highly successful for language modeling with self-attention network (SAN) such as transformer [22]. More specifically, SAN can benefit from the bi-directional context representation learning through unsupervised pretraining with large-scale unlabeled data before finetuning with labeled data. For automatic speech recognition (ASR), SANs have been introduced for acoustic modeling in either attention framework (also known as speech transformer [4]) or with CTC loss [6, 18].

Unsupervised speech representation learning has also been investigated by learning to predict either raw audio samples [19] or features [10, 2]. The pretrained networks are then used as feature extractors for various downstream tasks, such as speaker and phone recognition [2], emotion recognition [10] and speech recognition [19]. Recently, representation learning with SAN was employed in [10] for speech emotion recognition, where the SAN was pretrained in an autoregressive (AR) manner to predict the next frame (Future Observation Prediction (FOP)). The main purpose of FOP is to exploit the local smoothness of the speech signal. While the objective is important for emotion recognition, FOP may not work well for sequence mapping tasks like speech recognition. Firstly, it is well known that neighboring speech frames are highly correlated, exploiting such local smoothness might be already sufficient to predict the next frame [2]. Therefore, without regularization, the pretraining may have difficulties in capturing longer dependencies. Secondly, AR pretraining also suffers from the lack of bi-directional context information, that is quite important for sequence classification as well.

Given the success of XLNet for language modeling, in this paper, we present “Speech-XLNet”, an XLNet-like acoustic model pretraining scheme for SANs. Instead of using a fixed forward order as in [10], Speech-XLNet maximizes the expected log likelihood of a speech feature sequence with respect to all possible permutations of the factorization order. We conjecture that by shuffling the frame orders, the permutation serves as a strong regularizer to encourage the network to explore longer span structures through its attention weights. In addition, the permutation also allows a frame to utilize contextual information from other positions in a particular permutation order to capture bi-directional context. After pretraining, the SAN is finetuned with labeled data to predict senone targets under a SAN/HMM framework rather than to extract features.

Experimental results on TIMIT and WSJ benchmark tasks clearly demonstrate that with Speech-XLNet pretrained weights, the subsequent finetuning is much more stable and converges much faster than training from randomly initialized weights. More importantly, the finetuned SAN consistently outperforms its counterpart trained from randomly initialized weights. To the best of our knowledge, our best system achieves the lowest phone error rate of 13.3% on TIMIT test set among all published results.

2 System overview

The system overview of the proposed Speech-XLNet is shown in Fig.1. The SAN consists of a stack of self-attention blocks as in Fig.1-(b). Each block has two sub-modules including a multi-head attention layer [22] and a position-wise feed-forward layer. As in Fig.1-(c), dropout [21]

, residual connection 

[7] and layer normalization [1] are applied after both the self-attention and feed-forward layers. The permutation-based AR pretraining of SAN is given in Fig.1-(a). Different from most of the previous speech representation learning, we finetuned the pretrained SAN with labelled data for speech recognition directly.

(a)                                  (b)                          (c)

Figure 1: Speech-XLNet system overview. (a) XLNet-like pretraining. (b) Finetuning. (c) Self-attention block.

3 Speech-XLNet

XLNet is a generalized AR pretraining method proposed to overcome the limitations of BERT, namely pretrain-finetune discrepancy and independence assumptions, while retaining its advantage of bi-directional context learning. In this section, we present “Speech-XLNet”, a permutation-based AR acoustic model pretraining scheme adapted from XLNet.

3.1 Pretrain objective function

Instead of density estimation in XLNet, Speech-XLNet aims to “predict” the next acoustic frame,

i.e. a regression task. Specifically, let be the set of all possible permutations of a length- frame sequence. Let , and denote the -th element and the first elements of a permutation . The permutation acoustic modeling objective can be expressed as:


where denotes parameter set, is the predicted frame given the previous frames of the permutation order . For optimization, we adopted the smooth Mean Absolute Error (MAE) loss (Huber loss) as in [16] as our pretraining loss, which is a combination of L1 and L2 losses as given below:

where is a scalar to balance the L1 and L2 losses. As is shared across all orders in , can see all other frames via permutations, which encourages the model to learn bi-directional contexts and longer span structures.

3.2 Permutation via attention masks

Same as XLNet, the permutation is achieved via attention masks while keeping the original sequence order. Fig.2-(a) illustrates a permutation order of

via masking. To reduce the optimization difficulty, we also choose to only predict a certain percentage of frames from the tail portion of the permuted sequence. Formally, we introduce a hyperparameter

to determine the percentage of frames selected for prediction, where is the sequence length and is the number of selected frames. The objective function thus becomes:


For more robust pretraining, instead of performing permutation once during data preprocessing in original XLNet, we adopted a dynamic

permutation strategy where we generate a permutation order every time a speech sequence is fed to the trainer. By using different permutation for each sequence in every epoch, we are effectively increasing the amount of training data, which is helpful if the pretrain data is limited.

– – – – – – – – – – – – (a) Permutation via mask – – – – – – – – – – – – – (b) Two stream self-attention
Figure 2: (a) Permutation order [3, 2, 4, 1] via attention mask for content (upper) and query streams (lower). Shaded blue grid indicates that the content of the frames corresponding to the columns can be used to predict the frame corresponding to the row. (b) Two-stream attention.

3.3 Two-stream attention

The permutation in XLNet introduces the ambiguity in target prediction since the standard hidden representation does not have the information of which position it will predict. This can be addressed by the target-aware representations via a two-stream attention mechanism 

[23], namely the content stream and the query stream as shown in Fig.2-(b). The two streams of representations are schematically updated with a shared set of parameters as follow [23]:


where is the frame position in the original feature sequence corresponding to the -th element of a permutation sequence, is the query stream of layer which uses the position but has no access of the content . On the other hand, the content stream can see both the content and the position as in the standard self-attention. The two stream attentions are again obtained via masks in Fig.2-(a). Note the query stream and permutation are only used for pretraining and discarded during finetuning.

4 Experiments and Analyses

4.1 Pretraining experimental setups

We conducted the pretraining using a pool of Librispeech [12], TED-LIUM release2 [17] and the WSJ-si284 corpora. The features are the 40-dimentional log-Mel filterbanks extracted using Kaldi [14]

with global cepstral mean and variance normalization. No delta or frame stacking is used. Our SAN consists of a stack of six self-attention blocks. Each block has eight attention heads and the model dimension

is 512. The dimension of the feed-forward layer is 2048. Dropout is set to 0.1. Tail prediction percentage

is set to 0.2. Our SAN is implemented with PyTorch 


. Weights are initialized randomly from Xavier normal distribution 

[5]. The network is optimized with Adam [9] with , , and a weight decay of 0.01. The Huber loss is set to 1.0. The network was pretrained with a total of 50 epochs with five warm-up epochs with a linear learning rate decay [3], which translates to a peak learning rate of . The pretraining was conducted using four Tesla M40 GPUs with a batch size of 6000 frames and the model parameters were updated with a gradient accumulation of 10 batches. The pretrained SAN was finetuned with cross entropy loss under the hybrid SAN/HMM setup.

4.2 Phone recognition on TIMIT

For TIMIT phone recognition, we followed the same recipe as Pytorch-Kaldi [15]. The frame-level alignments were obtained using the Kaldi s5 recipe with a triphone GMM/HMM with 1936 senone clusters. For decoding, a bigram phone language model trained on the training data transcriptions was used. The baseline SAN was trained from randomly initialized weights with a batch size of 4000 frames. A total of 40 epochs were conducted with four warm-up epochs and a peak learning rate of . Same hyper-parameters were used for finetuning the pretrained SAN.

Figure 3: PERs(%) on TIMIT dev-set.

The PER trend on the dev set is depicted in Fig.3. We can clearly observe that the pretrained SAN converges much faster than the one with randomly initialized weights. More importantly, the pretrained SAN also consistently outperforms the baseline system.

(a) Randomly initialized weights, start of training222We omit the remaining four heads as they have the same trend. (c) Randomly initialized weights, end of training222We omit the remaining four heads as they have the same trend. (b) Speech-XLNet pretrained weights, start of finetuning222We omit the remaining four heads as they have the same trend. (d) Speech-XLNet pretrained weights, end of finetuning222We omit the remaining four heads as they have the same trend.
Figure 4: Plots of attention scores of the last self-attention block for a TIMIT sentence “FEDW0-SX364”.

We also present some PER comparisons with different hyper-parameter setups in Table 1. It is well known that SANs are very sensitive to hyper-parameters. Therefore, huge efforts have to be devoted to architecture engineering and hyper-parameter tuning. This can also be seen in Table 1 as different learning rates affect the PERs of the randomly initialized SANs significantly. On the contrary, for pretrained SANs, the performance is much more stable. This may indicate that with permutation-based pretraining, the attention weights are more focused on learning discriminative information, making the further classification tasks much easier.

learning rate Randomly Init. Speech-XLNet Pretrained dev test dev test 3e-5 16.7 18.4 11.8 13.3(27.7%111Numbers inside parentheses are relative improvement over randomly initialized weights with the same training hyper-parameter setup.) 1e-4 14.7 16.8 11.8 13.5(19.6%) 2e-4 14.5 16.1 11.7 13.4(16.8%) 1e-3 13.2 15.1 12.2 13.5(10.6%) 2e-3 diverge diverge 13.3 14.8()
Table 1: PER(%) comparison of different learning rates

To validate our conjectures, we visualized the attention scores of the last self-attention block with different configurations in Fig.4. By comparing Fig.4(a) and Fig.4(b), Speech-XLNet clearly learns some prior knowledge of how to represent the data, which helps the subsequent finetune process converge faster. Furthermore, at the end of training with randomly initialized weights, the attention scores of all heads in Fig.4(c) manifest a clear diagonal pattern, which means only limited context around the current frame is explored. On the other hand, it is interesting to note in Fig.4

(d), the attention scores of the second head are all off-diagonal to a large margin with fairly spread out probability distributions. This may indicate that the attention is distributed to learn some “global” structures useful for the following classification task.

Architecture Features Pretrain PER(%) DNN-Sigmoid[11] fMLLR RBM-DBN 16.8


fMLLR Random 16.5
LSTM[11] fMLLR Random 15.0 Li-GRU[15] mfcc/fbank/fMLLR Random 13.8 SAN(this work) fbank Random 15.1 SAN(this work) fbank Speech-XLNet 13.3
Table 2: PER comparison with previous approaches

Lastly, we give a PER comparison with other approaches in Table 2

. Note all the systems in the table use the same Kaldi recipe to generate the training alignments and decoding graphs. The most famous unsupervised pretraining for ASR is the deep belief networks (DBN) 


proposed at the early adoption of DNNs. DBN is a generative model built with stacks of restricted Boltzmann machines (RBMs). However, by comparing the first and second rows in Table 

2, similar to [8], we also observe that similar performance can be obtained from purely discriminative training from randomly initialized weights with proper hyper-parameter settings for DBN-RBM. Although the proposed Speech-XLNet follows the same pretrain-finetune training strategy, the downstream classification task clearly benefits from the permutation-based representation learning as seen from the last two rows. To our best knowledge, the best PER of 13.3% is the lowest among all the published results on the TIMIT test-set.

4.3 Word recognition on WSJ

We further evaluate Speech-XLNet on WSJ-si284 with the same pretrained SAN as TIMIT. For WSJ-si284 finetuning, a batch size of 15000 frames is used and the targets consist of a total number of 3392 senone clusters. For decoding, a pruned 4-gram language model (fgpr in the Kaldi recipe) is trained from the supplied language model training texts. WER performance of different learning rates is shown in Table 3. Similar to the TIMIT task, finetuning with Speech-XLNet consistently outperforms the network trained from randomly initialized weights. The best system achieves a WER of 4.4% with a learning rate of , translating to a 8.33% relative WER reduction compared to the best baseline WER of 4.8%.

Learning rate Randomly Init. Speech-XLNet Pretrained 2e-5 5.2 4.5(13.5%11footnotemark: 1) 3e-5 4.8 4.5(6.3%) 6e-5 5.0 4.5(10.0%) 1e-4 4.9 4.4(10.2%)
Table 3: WER(%) on WSJ eval92

5 Conclusions

In this work, we present Speech-XLNet for unsupervised speech representation learning of self-attention networks (SANs). The effectiveness of the pretraining was evaluated using a hybrid SAN/HMM system with senone targets for two benchmark tasks, namely TIMIT and WSJ. Compared to the SAN trained from randomly initialized weights, finetuning from Speech-XLNet is much more stable and converges much faster. More importantly, the finetuned SAN consistently outperforms the one trained from randomly initialized weights. Our best systems achieve a relative improvement of 11.9% and 8.3% on TIMIT and WSJ respectively. Specifically, our best system achieved a PER of 13.3% on the TIMIT core test set, which is the best-published performance to our best knowledge. In the future, we will apply the proposed Speech-XLNet to end-to-end speech transformers over larger training set. In addition, we believe Speech-XLNet is also very appealing for code-switching ASR [20] by pretraining with a pool of large amount of mono-lingual data of all languages considered and finetuning with code-switching data.


  • [1] J. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. ArXiv abs/1607.06450. Cited by: §2.
  • [2] Y. Chung, W. Hsu, H. Tang, and J. Glass (2019)

    An Unsupervised Autoregressive Model for Speech Representation Learning

    In Proc. Interspeech 2019, pp. 146–150. External Links: Document, Link Cited by: §1.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §1, §4.1.
  • [4] L. Dong, S. Xu, and B. Xu (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. ICASSP 2018, pp. 5884–5888. Cited by: §1.
  • [5] X. Glorot and Y. Bengio (2010)

    Understanding the difficulty of training deep feedforward neural networks

    In AISTATS, Cited by: §4.1.
  • [6] A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    In ICML, Cited by: §1.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. CVPR 2016, pp. 770–778. Cited by: §2.
  • [8] G. E. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. W. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, pp. 82–97. Cited by: §4.2.
  • [9] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §4.1.
  • [10] Z. Lian, J. Tao, B. Liu, and J. Huang (2019) Unsupervised representation learning with future observation prediction for speech emotion recognition. In Proc. Interspeech 2019, pp. 3840–3844. External Links: Document, Link Cited by: §1, §1.
  • [11] J. Michalek and J. Vanek (2018) A survey of recent dnn architectures on the timit phone recognition task. ArXiv abs/1806.07974. Cited by: Table 2.
  • [12] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. ICASSP 2015, pp. 5206–5210. Cited by: §4.1.
  • [13] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.1.
  • [14] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. K. Goel, M. Hannemann, P. Motlícek, Y. Qian, P. Schwarz, J. Silovský, G. Stemmer, and K. Veselý (2011) The kaldi speech recognition toolkit. Cited by: §4.1.
  • [15] M. Ravanelli, T. Parcollet, and Y. Bengio (2018) The pytorch-kaldi speech recognition toolkit. ICASSP 2019, pp. 6465–6469. Cited by: §4.2, Table 2.
  • [16] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, pp. 1137–1149. Cited by: §3.1.
  • [17] A. Rousseau, P. Deléglise, and Y. Estève (2012) TED-lium: an automatic speech recognition dedicated corpus. In LREC, Cited by: §4.1.
  • [18] J. Salazar, K. Kirchhoff, and Z. Huang (2019) Self-attention networks for connectionist temporal classification in speech recognition. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7115–7119. Cited by: §1.
  • [19] S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) Wav2vec: unsupervised pre-training for speech recognition. In Proc. Interspeech 2019, pp. 3465–3469. External Links: Document, Link Cited by: §1.
  • [20] C. Shan, C. Weng, G. Wang, D. Su, M. X. Luo, D. Yu, and L. Xie (2019) Investigating end-to-end speech recognition for mandarin-english code-switching. ICASSP 2019, pp. 6056–6060. Cited by: §5.
  • [21] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, pp. 1929–1958. Cited by: §2.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §1, §2.
  • [23] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. ArXiv abs/1906.08237. Cited by: §1, §3.3.