have become the most powerful network architecture in many fields, including speech, natural language processing (NLP), computer vision (CV), etc. Its outstanding performance is based not only on capturing long-term dependencies of input sequence but also on the remarkable training efficiency. Among all transformer-based models, BERT[devlin2018bert]
is probably the most famous one. It can learn strong language representations from unlabeled text. The idea of training transformer-based models on unlabeled audio data has also been widely studied[song2019speech, jiang2019improving, baevski2019effectiveness, schneider2019wav2vec, pascual2019learning]. The basic idea is the transformer-based models should learn to recover the masked audio as pre-training, and the pre-trained model is fine-tuned in the downstream applications.
Nonetheless, transformer-based models are usually based on pretty high computational cost because of their self-attention mechanism. Despite the effectiveness of self-attention, vanilla self-attention [lin2017structured], also known as full attention, suffers severely from both quadratic memory and computation requirements with respect to the input sequence length. The input sentences in NLP come across this problem, not to mention those much longer input sequence in speech. Therefore, several new attention mechanisms are proposed to reduce the time complexity.
Here we summarize variants of attention mechanisms for the transformer. Sparse Transformers [child2019generating] tried to craft new attentions by imitating the original attention pattern, to reduce the memory usage and the computation. There are two attention masks proposed in this work, both of which have lower time complexity and satisfactory performance. Routing Transformers [roy2020efficient]
uses K-means clustering to determine the candidates for each query in self-attention, while Reformer[kitaev2020reformer] tries to address this issue by introducing the hashing algorithm.
Moreover, we introduce locality-sensitive hashing (LSH), which can solve the Maximum Inner Product Search (MIPS) problem efficiently [huang2018accurate]; the self-attention mechanism in transformers can be regarded as a MIPS problem. Asymmetric LSH (ALSH) [shrivastava2014asymmetric] provides the first provably sublinear time hashing algorithm for MIPS, and they successfully prove that there does not exist any LSH family for MIPS. Subsequently, the same authors propose an improved ALSH algorithm [shrivastava2014improved], which reaches better performance. XBOX [bachrach2014speeding] is another ALSH algorithm. At the same time, Simple LSH [neyshabur2014symmetric] argues that neither LSH nor ALSH can deal with MIPS well and proposes two stronger LSH/ALSH algorithms. Then, Query Normalized First (QNF) [huang2018accurate] is proposed, which is a method that can be used together with those previously mentioned LSH algorithms and shows superior empirical performance.
Furthermore, aside from those hashing methods and handcrafted attention masks, there still exist some other algorithms that can reduce the time complexity of self-attention. Adaptive Attention [sukhbaatar2019adaptive] is a self-attention mechanism that can learn its optimal attention span, allowing models to deal with longer sequences. Longformer [beltagy2020longformer] introduces the attention mechanism scaling linearly with the input length. This modification makes it easy to process longer input sequences. Also, they bring the idea of dilated attention analogous to dilated CNNs [oord2016wavenet]. Lite Transformer [wu2020lite] uses CNNs along with the original self-attention to accelerate and boost the performance. Last but not least, SYNTHESIZER [tay2020SYNTHESIZER] proposes two simple yet effective ways to generate attention weights directly without token-token interactions. Note that SYNTHESIZER differs from those LSH/ALSH attentions and Sparse attentions, all of which try to generate attention masks from token-token interactions. This modification accelerates both the training and inference speed drastically.
Among these attentions, some of them have already been realized on NLP or CV tasks, whereas the others are merely examined by the theories and some simple empirical evaluation. In this paper, we have the following key contributions:
We implement these attentions respectively, trying to figure out their efficiency and effectiveness on self-supervised transformer-based models learned from the unlabelled audio. Table 1 summarizes the attention implemented in this paper.
We propose a new attention mechanism, which is inspired by the previous two works [tay2020SYNTHESIZER, raganato2020fixed], yielding the competitive performance yet with a great reduction in the training time.
Each transformer layer takes an inptut vector sequence, and , where is input sequence length and stands for the hidden dimension. Next, three different linear layers project the input vector sequence to its corresponding query matrix (), key matrix (), and value matrix (), respectively. Each vector in has a query vector (-th row of ), key vector (-th row of ), and value vector (-th row of ). For standard single-head attention [vaswani2017attention], the attention weight matrix is generated by multiplying query and key transpose. An element in is computed as below.
For multi-head attention, are geneated, where stands for the number of attention heads.
Although multi-head attention is powerful, it requires an enormous amount of computations. This situation deteriorates as input sequences become longer and longer. The primitive idea is to restrict the number of keys attended to for each query. More specifically, the number of keys attended to should not have a strong positive correlation with the input length. This problem can be solved efficiently by locality-sensitive hashing (LSH). In the following subsections, we will elaborate on all LSH algorithms we implement in this paper.
2.2 Locality-sensitive hashing (LSH)
In (1), the dot-products of all pairs of and have to be computed. The basic idea of LSH is to quickly identify pairs of and leading to large enough dot-products. Only the dot-products of the identified pairs have to be computed, while the rest are directly set to zero.
In general, there exist two different transformations and in asymmetric LSH (ALSH) and exactly one transformation in symmetric LSH (LSH). Both and takes a vector as input and outputs another vector. In the asymmetric case, query vectors and key vectors are encoded with different transformations and respectively111 can be or ., while we encode and with the same transformation in the symmetric one. Then, we define the hash function as:
where is a random vector with . For ALSH, if , query will attend to this key ; otherwise, we take no action. For LSH, query attend to the key only if
. Specifically, we define the hyperparameterto control the number of keys being attended to. That is, we choose only keys instead of all the keys that meet the condition . Here we directly choose the top values of keys:
Due to space limitations, we can briefly list the formulation of the LSH algorithm evaluated in this paper. For further explanation of each algorithm, please refer to the original paper. All queries and keys have to be normalized before the hashing functions. Note that for different LSH algorithms, the normalization methods may differ, but the normalization is not explicitly formulated in the following description for simplicity.
There are two hyperparameters and ; we let and , which is shown to have the best empirical result [shrivastava2014improved]. Next, to get rid of the norm of and , let and define the transformations and :
XBOX [bachrach2014speeding] is an asymmetric LSH; neither normalization nor hyperparameters is in XBOX. Then, define transformations :
2.2.3 Simple LSH & Simple ALSH
There is no hyperparameter in Simple LSH and Simple ALSH [neyshabur2014symmetric]. As for the normalization, let , then define the transformation . For Simple LSH, define the transformation :
whereas for Simple ALSH, define two transformations :
2.2.4 Query Normalized First (QNF)
In QNF [huang2018accurate], let and define transformations :
2.3 Sparse attention
Instead of using algorithms to determine which keys should be attended to, Sparse Attention [child2019generating] merely crafts attention masks . With this attention matrix, the attention weight matrix is defined as below,
where is multiplied with the attention weight in (1). If is very sparse, in a real implementation, the computation of (13) only has to consider those elements which are not masked, so the computational speed can be greatly increased. There are two different masks proposed in [child2019generating], which are shown in Figure 0(b) and Figure 0(c).
Instead of learning attention masks by algorithms, SYNTHESIZER [tay2020SYNTHESIZER] learns the attention weights directly. We compare the typical attention weight generation process of Transformer and SYNTHESIZER in Figure 3. We implement two versions of SYNTHESIZER in this paper.
2.4.1 Dense SYNTHESIZER
2.4.2 Random SYNTHESIZER
Here () is a learnable matrix222In other words, the elements in are considered as the network parameters., which is a randomly initialized matrix, learned with other parts of the network. Note that does not depend on the input sequences. It is the same across all the inputs. Namely, data in a batch share the same attention weights. .
2.4.3 Proposed method
The proposed method is based on Random SYNTHESIZER, yet the initialization is from the crafted attention masks in the previous work [raganato2020fixed]. Among the 12 attention heads in our model, seven of them are initialized according to the proposed patterns [raganato2020fixed], while the others are randomly initialized.
|Attention||Time Complexity (Training)||Time Complexity (Inference)||Application Fields|
|Baseline (QK) [vaswani2017attention]||Speech,CV,NLP|
|Baseline (Q) [kitaev2020reformer]||CV,NLP|
|Sparse (strided) [child2019generating]||CV|
|Sparse (fixed) [child2019generating]||CV|
|Sign-ALSH [shrivastava2014improved]||RS‡‡‡Recommender Systems|
|XBOX (QNF) [huang2018accurate]||RS|
|Simple LSH [neyshabur2014symmetric]||RS|
|Simple ALSH [neyshabur2014symmetric]||RS|
|SYN. (Dense) [tay2020SYNTHESIZER]||NLP|
|SYN. (Dense+M§§§M denotes multi-head attention. ) [tay2020SYNTHESIZER]||NLP|
|SYN. (Random) [tay2020SYNTHESIZER]||-||NLP|
|Baseline (Mel¶¶¶Apply input acoustic features (Mel features) directly to downstream models.)||0.0060||0.0033||0.5246||0.5768|
All attentions and SYNTHESIZER models are compared in Table 1. There are three main groups, which stand for Sparse Attention, LSH, and SYNTHESIZER respectively. We compare the time complexity of both training and inference; we also list the corresponding application fields each attention has been applied.
Here, we can make three key observations: 1) For Sparse Attention and LSH, though inference time is a bit longer, they require less training time as grows larger. 2) All second terms in LSH is for hashing, and we use float 16 to implement. This is why we multiply here. Also, there is no gradient in hashing function; thus, the training time complexity is the same as inference complexity. 3) For Syn. (Dense), letting can accelerate training dramatically; we let in this work. 4) Syn. (Random) does not need any computation to generate attention weights during inference.
Following the idea of two-stage training process of BERT [devlin2018bert], we pre-train the transformer-based models on the unlabelled audio data (Librispeech [panayotov2015librispeech] train-clean-360 subset) and fine-tune for the downstream tasks. All transformer-based models have the same architecture as a 6-layered Audio ALBERT (AALBERT) [chi2019aalbert]. We adopt the shared-QK attention [kitaev2020reformer] for all the methods mentioned in Sections 2.2 and Sections 2.3, which ties the weights of Q and K to further reduce the computation requirement. All the other settings of the pre-training stage follow those of [chi2019aalbert]. The models are trained for 500k steps with a batch size of 50 and a learning rate of 5e-5. The optimizer is LAMB [you2019large].
3.2 Performance of Downstream tasks
We evaluate the self-supervised models on three downstream tasks, including utterance-level speaker classification (on the train-clean-100 subset), frame-level speaker classification (on the train-clean-100), and phoneme classification (on the train-clean-360
subset with phoneme labels). In downstream tasks, the pre-trained models are used as feature-extractors whose parameters are fixed during training. For speaker classification, the extracted representations in the utterance-level task are passed to an average pooling layer, and the mean representation is then fed to a linear classifier. As in frame-level task, we simply train a linear classifier to predict the speaker label of every audio frame. As for the phoneme classification, we utilize both a one-hidden-layer model and a two-hidden-layer model in downstream tasks; a layer consists of a linear layer along with a ReLU activation function here.
The results are shown in Table 2. Baseline (QK) and Baseline (Q) (shared-QK attention) remarkably outperform Baseline (Mel), which shows the importance of pre-training. LSH/ALSH algorithms have negative influences on most downstream tasks, showing that restricting the attention by LSH/ALSH algorithm is not effective enough. For utterance-level speaker classification, the average pooling layer in the downstream model acts like a global attention mechanism, which compensates the effects of LSH/ALSH.
Sparse Attention obtains higher accuracy than LSH/ALSH, which shows that local information might be important since Sparse Attention always contains a fixed-size local window whereas LSH/ALSH does not.
SYNTHESIZER models perform even better on average than the other two groups; however, they fail to match Baseline (Q) on the frame-level speaker classification task. Our model, combining a SYNTHESIZER random model with some hand-crafted masks, achieves competitive performances compared to Baseline (Q), and even outperforms Baseline (QK) on the frame-level speaker classification task. It is noticeable that the training time of a SYNTHESIZER random model can be less than that of Baseline (QK), as well as less memory usage.
Figure 4 is the utterance-level speaker classification versus the number of keys attended. Sparse attention obtains better performance than LSH/ALSH but attends to more keys. The number of keys for all the SYNTHESIZERs is considered as zero because they generate the attention weights directly. This figure shows that the proposed approach achieves better performance with less computation.
In Figure 2, we visualize the attention weights of all 12 heads in our model. It is obvious that Head to Head are very similar to their corresponding patterns at initialization, while other heads are rather random. This outcome may be due partially to the fact that the patterns of Head to Head can capture the information well enough; therefore, the others head only need to learn those details neglected by the previous seven heads.
We explore the possibility of reducing computation complexity in transformer-based models for self-supervised representation learning. We try LSH and Sparse attention to limit the number of token-token interaction in self-attention. We also introduce recently proposed attention modules SYNTHESIZER. Then, we propose to combine the SYNTHESIZER random model with hand-crafted patterns. In the experiments, our proposed architecture not only performs comparably to vanilla transformer-based models on downstream tasks but requires less training time and less memory usage.