. Among these models, transformer-based architectures have obtained a substantial performance improvement. Despite these achievements, the collection of paired speech data still confuses researchers and engineers. Speech data requires intensive labeling and aligning works which can only be done manually. On the contrary, unpaired speech data are much more available than paired ones. With hardly any data collection cost, it seems to be an appealing solution for the dilemma that the supervised learning is facing. Therefore, the research community is shifting its focus to self-supervised or semi-supervised learning[baskar2019semisupervised, fan2020unsupervised, Karita2018Semi, Hori2019Cycle].
Self-Supervised Learning (SSL) is an approach of learning data representation from unlabeled data, and retraining the model on labeled data [baevski2020wav2vec]
. In this paper, we focus on the SSL of transformer network, to extract high-level speech representation. Through SSL pretraining, learned transformer models could be applied to downstream Speech and Language Processing (SLP) tasks. Recent works have proposed several kinds of SSL schemes. Autoregressive Predictive Coding (APC)[chung2019unsupervised] and Contrastive Predictive Coding (CPC) [Oord2018Representation]
focus on maximizing the probability of predicting future frames and the contrastive loss from separating negative sample set respectively. APC and CPC are based on unidirectional RNN, which limits speech representation learning without attending future frames.[Wang2020Unsupervised] proposed to use bidirectional RNN in the pretraining, and incorporated it to bidirectional speech recognition models. Audio Word2Vec [chung2016audio]baevski2020vqwav2vec] learns discrete speech representation of audio segments using VQ-VAE style codebook. VQ-wav2vec takes discrete tokens as the input, and achieves impressive results on speech recognition tasks.
Transformer-based SSL models use multi-layer transformer encoder to predict masked frames or spectrum bands, forcing the model to learn hidden speech features from both directions. The learned features could serve as the input of downstream SLP tasks. Masked Predictive Coding (MPC) [jiang2019improving] applied transformer-based model to unsupervised pretraining, improving the performance of speech recognition task. Mockingjay [Liu2020Mocking] introduced BERT-style masking strategy into the pretraining of speech representation. TERA [liu2020tera] proposed to use multi-target auxiliary task to pretrain the transformer encoder, handling alteration on temporal, channel, and magnitude axes. Speech XLNet[Song2020] presented an XLNet-style pretraining scheme, introducing dynamic permutation for further exploitation of the speech. Despite their outstanding performance on downstream tasks, above mentioned models may still suffer from overfitting or degradation issues. The models may just exploit the local smoothness of the speech, simply copying local spectrum features when reconstructing the masked frames. To solve this issue, APC proposed to predict frames that are several steps away[chung2019unsupervised]. MPC and TERA mask a consecutive range of speech frames instead of a single frame in temporal axis. In this work, we proposed to introduce dropout regularization to alleviate this problem.
Dropout [hinton2012improving, Srivastava2014Dropout]
is a popular method of model regularization for fully-connected neural network. It forces the network to discard some neural units randomly, and learn new pattern using available connections and parameters[chen2020selectscale]. Dropout has also been applied in many deep learning network structures, including RNN [Watt2018Dropout] and CNN [cai2020effective]
. Dropout was used in many tasks to prevent model from overfitting. In natural language processing,[zhang2020token] proposed to use dropout in machine translation, making the model to generate same output with less input. In the [wu2020generating], dropout is used to generate multiple translations that share similar meanings. In terms of model robustness, dropout can be applied to ensure the safety of systems and make them robust to perturbations [goodfellow2015explaining, jayashankar2020detecting].
Different from common dropout which cuts off the co-adaptation between units randomly, some works instead used dropout to discard the most discriminative activation regions. In weak supervised object detection, dropout is added to utilize less significant patterns and avoid overfitting the ground truth bounding box[gao2020cascade, Junsuk2019Attention]. In text classification, DropAttention [Zehui2019DropAttention] regularizes the attention weights in transformer, helping the model to utilize more contextualized word vectors. In this paper, we proposed to introduce attention dropout and layer dropout to the SSL of transformer encoder speech representation. Both of the two dropout methods prevent the transformer encoder from degrading to trivial solution that copies local features. Attention dropout and layer dropout encourage the model to use the features that are far apart from current predicting frame, hence capturing global speech information.
2 Proposed Method
In this paper, we proposed to use dropout regularization for the SSL of speech. The architecture is based on TERA (Transformer Encoder Representations from Alteration) [liu2020tera], which pretrains the model with three auxiliary objectives: (1) time alteration, (2) channel alteration, (3) magnitude alteration. We introduce two dropout methods into the transformer encoder: (1) attention dropout, reconstructing the attention weight matrix of self-attention mechanism, (2) layer dropout, masking the most active elements of feed-forward layer. The transformer encoder network is pretrained by reconstructing the altered acoustic features. After that, the hidden states of last layer are extracted, and will be incorporated to downstream tasks.
Transformer [Vaswani2017Attention] has impressive performance in the SSL of speech representation. Our model architecture uses a multi-layer transformer encoder with multi-head self-attention mechanism, illustrated in Figure 1. The input audio sequence are fed into the network, where is the audio frames length, and is the dimension of mel-scale features. Each encoder layer has two sub-layers: (1) a multi-head self-attention network, (2) a feed-forward layer. We apply the attention dropout in self-attention network, and use the layer dropout in feed-forward layer. The total amount of encoder layers are denoted as , and the output of each layer is . The last layer is projected to the reconstructed features . The model is pretrained by directly optimizing L1 loss between the input sequence and output sequence :
2.2 Attention Dropout
For each transformer encoder layer , the input feature sequence is , where is the dimension of self-attention mechanism. The multi-head self-attention mechanism projects into three matrices: the query matrix , the key matrix , and the value matrix .
In which, are learnable parameters of head , and . is denoted as attention weight matrix. As illustrated in Algorithm 1, the attention dropout method attempts to reweight the matrix with probability . At first, the algorithm gets maximum value of weight matrix
by global max-pooling operation. Then, the attention dropoutis applied on each element as following:
We set a threshold ratio . erases the high attentive locations, avoiding the model from overfitting local features. After element-wise dropout, each row vector is renormalized, to ensure that the sum of attention weights remains . Through attention renormalization, the multi-head attention weights will be distributed over the whole spatial dimension, encouraging the model to utilize global information.
|3L-TERA-base [liu2020tera]||–||–||70.65 (65.1)||78.51 (77.3)||99.52 (98.9)||99.47 (99.2)|
|3L-Encoder + Attention Dropout||0.9||–||70.56||78.69||99.27||99.26|
|3L-Encoder + Attention Dropout||0.8||–||70.91||78.79||99.51||99.35|
|3L-Encoder + Attention Dropout||0.6||–||70.85||78.57||99.45||99.30|
|3L-Encoder + Attention Dropout||0.4||–||69.08||77.27||99.44||99.36|
|3L-Encoder + Layer Drop||–||0.9||70.45||78.54||99.24||99.23|
|3L-Encoder + Layer Drop||–||0.8||71.11||78.72||99.51||99.33|
|3L-Encoder + Layer Drop||–||0.6||71.19||78.68||99.46||99.42|
|3L-Encoder + Layer Drop||–||0.4||69.07||76.90||99.21||98.94|
|3L-Encoder + Attention & Layer Dropout||0.8||0.6||70.71||78.64||99.37||99.35|
|3L-Encoder + Attention & Layer Dropout||0.9||0.9||71.12||78.95||99.51||99.31|
|3L-Encoder + Attention then Layer Dropout||0.8||0.6||70.88||78.76||99.52||99.33|
|3L-Encoder + Attention then Layer Dropout||0.9||0.9||71.64||79.51||99.50||99.40|
|3L-Encoder + Layer then Attention Dropout||0.8||0.6||71.22||78.66||99.45||99.44|
|3L-Encoder + Layer then Attention Dropout||0.9||0.9||70.44||78.54||99.24||99.22|
|Modified CPC [Riviere2020Unsupervised]||68.9||–||–||–|
|TERA [liu2020tera]||70.65 (65.1)||78.51 (77.3)||99.52 (98.9)||99.47 (99.2)|
|3L-Encoder + Attention then Layer Dropout (ours)||71.64||79.51||99.50||99.40|
2.3 Layer Dropout
For each transformer encoder layer , the layer dropout method is applied on the output with probability . Similar to attention dropout calculation, we firstly get the maximum absolute value of feature map by spatial max-pooling:
Then, we design a binary masked map to indicate whether each location is dropped or not. Each element of is calculated as:
In which, is the threshold ratio. is the absolute value function, meaning that both positive and negative large value will be discarded. Finally, the binary masked map is multiplied to original map , to get the final feature map:
where is denoted as element-wise matrix multiplication.
3 Experimental Setup
In this work, we focus on the representation extraction approach for downstream speech tasks. Following previous works, the experiments are in two stages: (1) pretrain the transformer encoder network by SSL, reconstructing the altered acoustic features, (2) extract the representations from the last layer of the model, and compare the performance on downstream tasks. In this section, we explored the experimental results of different dropout configurations on threshold ratio, and also visualized the changes of attention weight matrix and layer feature map by dropout regularization.
For most experiments, we used publicly available LibriSpeech corpus [Panayotov2015Librispeech]. The train-clean-100 subset ( hours) of LibriSpeech was used for pretraining. Like previous works of SSL, we used four downstream tasks for evaluation:
PhonemeLinear: phoneme classification with linear network
Phoneme1Hidden: phoneme classification with one hidden layer and linear network
SpeakerFrame: frame-wise speaker recognition
SpeakerUtterance: utterance-wise speaker recognition
For phoneme classification task, we used aligned phoneme labels and train/test split provided in the CPC [Oord2018Representation] and Modified CPC [Riviere2020Unsupervised]
. Linear classifier and classifier with a single hidden layer are used to measure the linear separability of phonemes. For speaker recognition task, we also used the same train/test split as provided in the CPC. Two types of task, predicting speaker for each input frame and predicting speaker identity conditioning on averaged vector of each utterance, are provided.
We conducted all the experiments using the s3prl toolkit [S3PRL]
on Pytorch framework. The parameters of self-supervised pretraining and downstream tasks are listed in Table3.
|input mel-scale features|
|transformer encoder layers|
|attention hidden size|
|attention dropout probability|
|layer dropout probability|
|Phoneme Classification Task|
|one hidden layer dimension|
|Speaker Recognition Task|
The overall architecture is three-layers transformer encoder network. The input audio is encoded with mel-scale features. Each transformer encoder layer contains two parts: (1) self-attention layer ( dimension and multi-heads) with attention dropout ( probability), (2) feed-forward layer ( dimension) with layer dropout ( probability). The models were pretrained by total steps with batch size .
For phoneme classification task, we adopt the common setup using possible phoneme classes, and dimension for one hidden layer. For speaker recognition task, the dataset consists of speakers. Besides, we trained all the downstream tasks by steps. The parameters of the pretrained models are frozen, when the downstream tasks are trained.
We conducted the experiments on different configurations of attention threshold ratio and layer threshold ratio . As shown in Table 1, we found that the three-layers transformer encoder model achieves best performance with for attention dropout and for layer dropout. The threshold cannot be set too small, otherwise too much high activation regions will be discarded and the performance will degrade. In addition, the closer threshold is to , the closer results are to 3L-TERA-base [liu2020tera]. For fair comparison, all of the experimental results in Table 1 were performed on the same configurations in Table 3, and we referenced the numbers of TERA in the .
We also investigated three fusion strategies of two dropout regularization, (1) Attention & Layer Dropout, conducting two dropout together with half dropout probability , (2) Attention then Layer Dropout, pretraining steps with attention dropout, then another steps with layer dropout, (3) Layer then Attention Dropout. In our experiments, we found Attention then Layer Dropout with threshold ratio works better than two other fusion strategies, and outperforms the method of attention or layer dropout alone as presented in Table 1.
As depicted in Table 2, we compared our approach with other SSL methods. We choosed the published results using the same training set, train-clean-100 of LibriSpeech. Our best model (Attention then Layer Dropout) achieves relative improvement on the accuracy of PhonemeLinear task, and of Phoneme1Hidden task, over the original TERA-base model. Despite the results of speaker recognition tasks are very close with each other, our approach outperforms most of the listed methods on the downstream tasks.
In Figure 2 and Figure 3, we visualized the attention weight matrix from attention dropout and layer feature map from layer dropout. After the attention dropout, the most nearby attention weights of each location in Figure 2(a) are discarded (see Figure 2(b)). The rest attention weights are distributed to far distant locations (see yellow lines in Figure 2(c)). By contrast, the layer dropout prefers to function as regularization. The layer dropout will suppress the most negative activations (see yellow regions in Figure 3(c)) and discard largest positive values (see blue regions in Figure 3(c)). As a result, the feature map (Figure 3(b)) becomes smoother than the original one (Figure 3(a)). Overall, the visualization demonstrates that with dropout regularization, the model suppresses the overemphasized local features and captures more global information.
In this paper, we proposed to use attention dropout and layer dropout in the SSL of speech representation. Attention dropout reweights the multi-head attention matrix of each transformer encoder layer. Layer dropout discards the most discriminative activation regions by spatial max-pooling. The experiments show that downstream phoneme classification and speaker recognition tasks can obtain substantial performance improvements with attention and layer dropout. In future works, we will explore the effect of dropout on other downstream tasks like speech recognition. We are also interested to investigate the performance of dropout regularization on various SSL models besides the transformer encoder architecture.
This paper is supported by National Key Research and Development Program of China under grant No. 2018YFB0204403 , No. 2017YFB1401202 and No. 2018YFB1003500. Corresponding author is Jianzong Wang from Ping An Technology (Shenzhen) Co., Ltd.