Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation

07/09/2021 ∙ by Jian Luo, et al. ∙ NetEase 0

Predicting the altered acoustic frames is an effective way of self-supervised learning for speech representation. However, it is challenging to prevent the pretrained model from overfitting. In this paper, we proposed to introduce two dropout regularization methods into the pretraining of transformer encoder: (1) attention dropout, (2) layer dropout. Both of the two dropout methods encourage the model to utilize global speech information, and avoid just copying local spectrum features when reconstructing the masked frames. We evaluated the proposed methods on phoneme classification and speaker recognition tasks. The experiments demonstrate that our dropout approaches achieve competitive results, and improve the performance of classification accuracy on downstream tasks.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep-learning models have shown remarkable success in speech tasks, such as automatic speech recognition, speaker identification, spoken language understanding, etc 

[Gulati2020Conformer, Luo2021Unidirectional, Ding2020AutoSpeech, qin2021cointeractive]

. Among these models, transformer-based architectures have obtained a substantial performance improvement. Despite these achievements, the collection of paired speech data still confuses researchers and engineers. Speech data requires intensive labeling and aligning works which can only be done manually. On the contrary, unpaired speech data are much more available than paired ones. With hardly any data collection cost, it seems to be an appealing solution for the dilemma that the supervised learning is facing. Therefore, the research community is shifting its focus to self-supervised or semi-supervised learning 

[baskar2019semisupervised, fan2020unsupervised, Karita2018Semi, Hori2019Cycle].

Self-Supervised Learning (SSL) is an approach of learning data representation from unlabeled data, and retraining the model on labeled data [baevski2020wav2vec]

. In this paper, we focus on the SSL of transformer network, to extract high-level speech representation. Through SSL pretraining, learned transformer models could be applied to downstream Speech and Language Processing (SLP) tasks. Recent works have proposed several kinds of SSL schemes. Autoregressive Predictive Coding (APC) 

[chung2019unsupervised] and Contrastive Predictive Coding (CPC) [Oord2018Representation]

focus on maximizing the probability of predicting future frames and the contrastive loss from separating negative sample set respectively. APC and CPC are based on unidirectional RNN, which limits speech representation learning without attending future frames.

[Wang2020Unsupervised] proposed to use bidirectional RNN in the pretraining, and incorporated it to bidirectional speech recognition models. Audio Word2Vec [chung2016audio]

generates vector representation for audio segments, which is trained by an RNN-based autoencoder, to reconstruct the input speech audio. VQ-wav2vec

[baevski2020vqwav2vec] learns discrete speech representation of audio segments using VQ-VAE style codebook. VQ-wav2vec takes discrete tokens as the input, and achieves impressive results on speech recognition tasks.

Transformer-based SSL models use multi-layer transformer encoder to predict masked frames or spectrum bands, forcing the model to learn hidden speech features from both directions. The learned features could serve as the input of downstream SLP tasks. Masked Predictive Coding (MPC) [jiang2019improving] applied transformer-based model to unsupervised pretraining, improving the performance of speech recognition task. Mockingjay [Liu2020Mocking] introduced BERT-style masking strategy into the pretraining of speech representation. TERA [liu2020tera] proposed to use multi-target auxiliary task to pretrain the transformer encoder, handling alteration on temporal, channel, and magnitude axes. Speech XLNet[Song2020] presented an XLNet-style pretraining scheme, introducing dynamic permutation for further exploitation of the speech. Despite their outstanding performance on downstream tasks, above mentioned models may still suffer from overfitting or degradation issues. The models may just exploit the local smoothness of the speech, simply copying local spectrum features when reconstructing the masked frames. To solve this issue, APC proposed to predict frames that are several steps away[chung2019unsupervised]. MPC and TERA mask a consecutive range of speech frames instead of a single frame in temporal axis. In this work, we proposed to introduce dropout regularization to alleviate this problem.

Dropout [hinton2012improving, Srivastava2014Dropout]

is a popular method of model regularization for fully-connected neural network. It forces the network to discard some neural units randomly, and learn new pattern using available connections and parameters 

[chen2020selectscale]. Dropout has also been applied in many deep learning network structures, including RNN [Watt2018Dropout] and CNN [cai2020effective]

. Dropout was used in many tasks to prevent model from overfitting. In natural language processing,

[zhang2020token] proposed to use dropout in machine translation, making the model to generate same output with less input. In the [wu2020generating], dropout is used to generate multiple translations that share similar meanings. In terms of model robustness, dropout can be applied to ensure the safety of systems and make them robust to perturbations [goodfellow2015explaining, jayashankar2020detecting].

Different from common dropout which cuts off the co-adaptation between units randomly, some works instead used dropout to discard the most discriminative activation regions. In weak supervised object detection, dropout is added to utilize less significant patterns and avoid overfitting the ground truth bounding box[gao2020cascade, Junsuk2019Attention]. In text classification, DropAttention [Zehui2019DropAttention] regularizes the attention weights in transformer, helping the model to utilize more contextualized word vectors. In this paper, we proposed to introduce attention dropout and layer dropout to the SSL of transformer encoder speech representation. Both of the two dropout methods prevent the transformer encoder from degrading to trivial solution that copies local features. Attention dropout and layer dropout encourage the model to use the features that are far apart from current predicting frame, hence capturing global speech information.

2 Proposed Method

In this paper, we proposed to use dropout regularization for the SSL of speech. The architecture is based on TERA (Transformer Encoder Representations from Alteration) [liu2020tera], which pretrains the model with three auxiliary objectives: (1) time alteration, (2) channel alteration, (3) magnitude alteration. We introduce two dropout methods into the transformer encoder: (1) attention dropout, reconstructing the attention weight matrix of self-attention mechanism, (2) layer dropout, masking the most active elements of feed-forward layer. The transformer encoder network is pretrained by reconstructing the altered acoustic features. After that, the hidden states of last layer are extracted, and will be incorporated to downstream tasks.

Figure 1: The Transformer Encoder Architecture of Self-Supervised Learning with dropout regularization

2.1 Architecture

Transformer [Vaswani2017Attention] has impressive performance in the SSL of speech representation. Our model architecture uses a multi-layer transformer encoder with multi-head self-attention mechanism, illustrated in Figure 1. The input audio sequence are fed into the network, where is the audio frames length, and is the dimension of mel-scale features. Each encoder layer has two sub-layers: (1) a multi-head self-attention network, (2) a feed-forward layer. We apply the attention dropout in self-attention network, and use the layer dropout in feed-forward layer. The total amount of encoder layers are denoted as , and the output of each layer is . The last layer is projected to the reconstructed features . The model is pretrained by directly optimizing L1 loss between the input sequence and output sequence :


2.2 Attention Dropout

For each transformer encoder layer , the input feature sequence is , where is the dimension of self-attention mechanism. The multi-head self-attention mechanism projects into three matrices: the query matrix , the key matrix , and the value matrix .


In which, are learnable parameters of head , and . is denoted as attention weight matrix. As illustrated in Algorithm 1, the attention dropout method attempts to reweight the matrix with probability . At first, the algorithm gets maximum value of weight matrix

by global max-pooling operation. Then, the attention dropout

is applied on each element as following:


We set a threshold ratio . erases the high attentive locations, avoiding the model from overfitting local features. After element-wise dropout, each row vector is renormalized, to ensure that the sum of attention weights remains . Through attention renormalization, the multi-head attention weights will be distributed over the whole spatial dimension, encouraging the model to utilize global information.

1:  Input: : attention weight matrix of head : probability of conducting attention dropout: threshold ratio of attention dropout
2:  pick a random float number
3:  if  then
4:     return  
5:  end if
7:  for each weight element in  do
8:     apply the attention dropout:
9:  end for
10:  for all row vector  do
11:     normalized rescale:
12:  end for
13:  return  
Algorithm 1 Attention Dropout Algorithm
Pretraining Method PhonemeLinear Phoneme1Hidden SpeakerFrame SpeakerUtterance
3L-TERA-base [liu2020tera] 70.65 (65.1) 78.51 (77.3) 99.52 (98.9) 99.47 (99.2)
3L-Encoder + Attention Dropout 0.9 70.56 78.69 99.27 99.26
3L-Encoder + Attention Dropout 0.8 70.91 78.79 99.51 99.35
3L-Encoder + Attention Dropout 0.6 70.85 78.57 99.45 99.30
3L-Encoder + Attention Dropout 0.4 69.08 77.27 99.44 99.36
3L-Encoder + Layer Drop 0.9 70.45 78.54 99.24 99.23
3L-Encoder + Layer Drop 0.8 71.11 78.72 99.51 99.33
3L-Encoder + Layer Drop 0.6 71.19 78.68 99.46 99.42
3L-Encoder + Layer Drop 0.4 69.07 76.90 99.21 98.94
3L-Encoder + Attention & Layer Dropout 0.8 0.6 70.71 78.64 99.37 99.35
3L-Encoder + Attention & Layer Dropout 0.9 0.9 71.12 78.95 99.51 99.31
3L-Encoder + Attention then Layer Dropout 0.8 0.6 70.88 78.76 99.52 99.33
3L-Encoder + Attention then Layer Dropout 0.9 0.9 71.64 79.51 99.50 99.40
3L-Encoder + Layer then Attention Dropout 0.8 0.6 71.22 78.66 99.45 99.44
3L-Encoder + Layer then Attention Dropout 0.9 0.9 70.44 78.54 99.24 99.22
Table 1: Different Configurations on Threshold Ratio, Phoneme and Speaker Classification Results on Librispeech, Accuracy (%)
Pretraining Method PhonemeLinear Phoneme1Hidden SpeakerFrame SpeakerUtterance
CPC [Oord2018Representation] 64.6 72.5 97.4
Modified CPC [Riviere2020Unsupervised] 68.9
AALBERT [Chi2020Audio] 98.79 99.12
Mockingjay [Liu2020Mocking] 64.3 76.8 68.4 96.1
TERA [liu2020tera] 70.65 (65.1) 78.51 (77.3) 99.52 (98.9) 99.47 (99.2)
3L-Encoder + Attention then Layer Dropout (ours) 71.64 79.51 99.50 99.40
Table 2: Compared with Other SSL Methods, Phoneme and Speaker Classification Results on Librispeech, Accuracy (%)

2.3 Layer Dropout

For each transformer encoder layer , the layer dropout method is applied on the output with probability . Similar to attention dropout calculation, we firstly get the maximum absolute value of feature map by spatial max-pooling:


Then, we design a binary masked map to indicate whether each location is dropped or not. Each element of is calculated as:


In which, is the threshold ratio. is the absolute value function, meaning that both positive and negative large value will be discarded. Finally, the binary masked map is multiplied to original map , to get the final feature map:


where is denoted as element-wise matrix multiplication.

3 Experimental Setup

In this work, we focus on the representation extraction approach for downstream speech tasks. Following previous works, the experiments are in two stages: (1) pretrain the transformer encoder network by SSL, reconstructing the altered acoustic features, (2) extract the representations from the last layer of the model, and compare the performance on downstream tasks. In this section, we explored the experimental results of different dropout configurations on threshold ratio, and also visualized the changes of attention weight matrix and layer feature map by dropout regularization.

3.1 Dataset

For most experiments, we used publicly available LibriSpeech corpus [Panayotov2015Librispeech]. The train-clean-100 subset ( hours) of LibriSpeech was used for pretraining. Like previous works of SSL, we used four downstream tasks for evaluation:

  • PhonemeLinear: phoneme classification with linear network

  • Phoneme1Hidden: phoneme classification with one hidden layer and linear network

  • SpeakerFrame: frame-wise speaker recognition

  • SpeakerUtterance: utterance-wise speaker recognition

For phoneme classification task, we used aligned phoneme labels and train/test split provided in the CPC [Oord2018Representation] and Modified CPC [Riviere2020Unsupervised]

. Linear classifier and classifier with a single hidden layer are used to measure the linear separability of phonemes. For speaker recognition task, we also used the same train/test split as provided in the CPC. Two types of task, predicting speaker for each input frame and predicting speaker identity conditioning on averaged vector of each utterance, are provided.

(a) Original Weight Matrix
(b) After Attention Dropout
(c) Difference between (a) and (b)
Figure 2: Visualization of Attention Weight Matrix from Attention Dropout
(a) Original Feature Map
(b) After Layer Dropout
(c) Difference between (a) and (b)
Figure 3: Visualization of Layer Feature Map from Layer Dropout

3.2 Configuration

We conducted all the experiments using the s3prl toolkit [S3PRL]

on Pytorch framework. The parameters of self-supervised pretraining and downstream tasks are listed in Table 


Self-Supervised Pretraining
input mel-scale features
transformer encoder layers
attention hidden size
feed-forward dimension
attention dropout probability
layer dropout probability
batch size
training steps
Phoneme Classification Task
phoneme classes
one hidden layer dimension
batch size
training steps
Speaker Recognition Task
speaker classes
batch size
training steps
Table 3: Parameters of Pretraining and Downstream Tasks

The overall architecture is three-layers transformer encoder network. The input audio is encoded with mel-scale features. Each transformer encoder layer contains two parts: (1) self-attention layer ( dimension and multi-heads) with attention dropout ( probability), (2) feed-forward layer ( dimension) with layer dropout ( probability). The models were pretrained by total steps with batch size .

For phoneme classification task, we adopt the common setup using possible phoneme classes, and dimension for one hidden layer. For speaker recognition task, the dataset consists of speakers. Besides, we trained all the downstream tasks by steps. The parameters of the pretrained models are frozen, when the downstream tasks are trained.

3.3 Results

We conducted the experiments on different configurations of attention threshold ratio and layer threshold ratio . As shown in Table 1, we found that the three-layers transformer encoder model achieves best performance with for attention dropout and for layer dropout. The threshold cannot be set too small, otherwise too much high activation regions will be discarded and the performance will degrade. In addition, the closer threshold is to , the closer results are to 3L-TERA-base [liu2020tera]. For fair comparison, all of the experimental results in Table 1 were performed on the same configurations in Table 3, and we referenced the numbers of TERA in the .

We also investigated three fusion strategies of two dropout regularization, (1) Attention & Layer Dropout, conducting two dropout together with half dropout probability , (2) Attention then Layer Dropout, pretraining steps with attention dropout, then another steps with layer dropout, (3) Layer then Attention Dropout. In our experiments, we found Attention then Layer Dropout with threshold ratio works better than two other fusion strategies, and outperforms the method of attention or layer dropout alone as presented in Table 1.

As depicted in Table 2, we compared our approach with other SSL methods. We choosed the published results using the same training set, train-clean-100 of LibriSpeech. Our best model (Attention then Layer Dropout) achieves relative improvement on the accuracy of PhonemeLinear task, and of Phoneme1Hidden task, over the original TERA-base model. Despite the results of speaker recognition tasks are very close with each other, our approach outperforms most of the listed methods on the downstream tasks.

3.4 Visualization

In Figure 2 and Figure 3, we visualized the attention weight matrix from attention dropout and layer feature map from layer dropout. After the attention dropout, the most nearby attention weights of each location in Figure 2(a) are discarded (see Figure 2(b)). The rest attention weights are distributed to far distant locations (see yellow lines in Figure 2(c)). By contrast, the layer dropout prefers to function as regularization. The layer dropout will suppress the most negative activations (see yellow regions in Figure 3(c)) and discard largest positive values (see blue regions in Figure 3(c)). As a result, the feature map (Figure 3(b)) becomes smoother than the original one (Figure 3(a)). Overall, the visualization demonstrates that with dropout regularization, the model suppresses the overemphasized local features and captures more global information.

4 Conclusions

In this paper, we proposed to use attention dropout and layer dropout in the SSL of speech representation. Attention dropout reweights the multi-head attention matrix of each transformer encoder layer. Layer dropout discards the most discriminative activation regions by spatial max-pooling. The experiments show that downstream phoneme classification and speaker recognition tasks can obtain substantial performance improvements with attention and layer dropout. In future works, we will explore the effect of dropout on other downstream tasks like speech recognition. We are also interested to investigate the performance of dropout regularization on various SSL models besides the transformer encoder architecture.

5 Acknowledgement

This paper is supported by National Key Research and Development Program of China under grant No. 2018YFB0204403 , No. 2017YFB1401202 and No. 2018YFB1003500. Corresponding author is Jianzong Wang from Ping An Technology (Shenzhen) Co., Ltd.