1 Introduction
Transformers [vaswani2017attention]
have become the most powerful network architecture in many fields, including speech, natural language processing (NLP), computer vision (CV), etc. Its outstanding performance is based not only on capturing longterm dependencies of input sequence but also on the remarkable training efficiency. Among all transformerbased models, BERT
[devlin2018bert]is probably the most famous one. It can learn strong language representations from unlabeled text. The idea of training transformerbased models on unlabeled audio data has also been widely studied
[song2019speech, jiang2019improving, baevski2019effectiveness, schneider2019wav2vec, pascual2019learning]. The basic idea is the transformerbased models should learn to recover the masked audio as pretraining, and the pretrained model is finetuned in the downstream applications.Nonetheless, transformerbased models are usually based on pretty high computational cost because of their selfattention mechanism. Despite the effectiveness of selfattention, vanilla selfattention [lin2017structured], also known as full attention, suffers severely from both quadratic memory and computation requirements with respect to the input sequence length. The input sentences in NLP come across this problem, not to mention those much longer input sequence in speech. Therefore, several new attention mechanisms are proposed to reduce the time complexity.
Here we summarize variants of attention mechanisms for the transformer. Sparse Transformers [child2019generating] tried to craft new attentions by imitating the original attention pattern, to reduce the memory usage and the computation. There are two attention masks proposed in this work, both of which have lower time complexity and satisfactory performance. Routing Transformers [roy2020efficient]
uses Kmeans clustering to determine the candidates for each query in selfattention, while Reformer
[kitaev2020reformer] tries to address this issue by introducing the hashing algorithm.Moreover, we introduce localitysensitive hashing (LSH), which can solve the Maximum Inner Product Search (MIPS) problem efficiently [huang2018accurate]; the selfattention mechanism in transformers can be regarded as a MIPS problem. Asymmetric LSH (ALSH) [shrivastava2014asymmetric] provides the first provably sublinear time hashing algorithm for MIPS, and they successfully prove that there does not exist any LSH family for MIPS. Subsequently, the same authors propose an improved ALSH algorithm [shrivastava2014improved], which reaches better performance. XBOX [bachrach2014speeding] is another ALSH algorithm. At the same time, Simple LSH [neyshabur2014symmetric] argues that neither LSH nor ALSH can deal with MIPS well and proposes two stronger LSH/ALSH algorithms. Then, Query Normalized First (QNF) [huang2018accurate] is proposed, which is a method that can be used together with those previously mentioned LSH algorithms and shows superior empirical performance.
Furthermore, aside from those hashing methods and handcrafted attention masks, there still exist some other algorithms that can reduce the time complexity of selfattention. Adaptive Attention [sukhbaatar2019adaptive] is a selfattention mechanism that can learn its optimal attention span, allowing models to deal with longer sequences. Longformer [beltagy2020longformer] introduces the attention mechanism scaling linearly with the input length. This modification makes it easy to process longer input sequences. Also, they bring the idea of dilated attention analogous to dilated CNNs [oord2016wavenet]. Lite Transformer [wu2020lite] uses CNNs along with the original selfattention to accelerate and boost the performance. Last but not least, SYNTHESIZER [tay2020SYNTHESIZER] proposes two simple yet effective ways to generate attention weights directly without tokentoken interactions. Note that SYNTHESIZER differs from those LSH/ALSH attentions and Sparse attentions, all of which try to generate attention masks from tokentoken interactions. This modification accelerates both the training and inference speed drastically.
Among these attentions, some of them have already been realized on NLP or CV tasks, whereas the others are merely examined by the theories and some simple empirical evaluation. In this paper, we have the following key contributions:

We implement these attentions respectively, trying to figure out their efficiency and effectiveness on selfsupervised transformerbased models learned from the unlabelled audio. Table 1 summarizes the attention implemented in this paper.

We propose a new attention mechanism, which is inspired by the previous two works [tay2020SYNTHESIZER, raganato2020fixed], yielding the competitive performance yet with a great reduction in the training time.
2 Methodology
2.1 Selfattention
Each transformer layer takes an inptut vector sequence
, and , where is input sequence length and stands for the hidden dimension. Next, three different linear layers project the input vector sequence to its corresponding query matrix (), key matrix (), and value matrix (), respectively. Each vector in has a query vector (th row of ), key vector (th row of ), and value vector (th row of ). For standard singlehead attention [vaswani2017attention], the attention weight matrix is generated by multiplying query and key transpose. An element in is computed as below.(1) 
For multihead attention, are geneated, where stands for the number of attention heads.
Although multihead attention is powerful, it requires an enormous amount of computations. This situation deteriorates as input sequences become longer and longer. The primitive idea is to restrict the number of keys attended to for each query. More specifically, the number of keys attended to should not have a strong positive correlation with the input length. This problem can be solved efficiently by localitysensitive hashing (LSH). In the following subsections, we will elaborate on all LSH algorithms we implement in this paper.
2.2 Localitysensitive hashing (LSH)
In (1), the dotproducts of all pairs of and have to be computed. The basic idea of LSH is to quickly identify pairs of and leading to large enough dotproducts. Only the dotproducts of the identified pairs have to be computed, while the rest are directly set to zero.
In general, there exist two different transformations and in asymmetric LSH (ALSH) and exactly one transformation in symmetric LSH (LSH). Both and takes a vector as input and outputs another vector. In the asymmetric case, query vectors and key vectors are encoded with different transformations and respectively^{1}^{1}1 can be or ., while we encode and with the same transformation in the symmetric one. Then, we define the hash function as:
(2) 
where is a random vector with . For ALSH, if , query will attend to this key ; otherwise, we take no action. For LSH, query attend to the key only if
. Specifically, we define the hyperparameter
to control the number of keys being attended to. That is, we choose only keys instead of all the keys that meet the condition . Here we directly choose the top values of keys:(3) 
Due to space limitations, we can briefly list the formulation of the LSH algorithm evaluated in this paper. For further explanation of each algorithm, please refer to the original paper. All queries and keys have to be normalized before the hashing functions. Note that for different LSH algorithms, the normalization methods may differ, but the normalization is not explicitly formulated in the following description for simplicity.
2.2.1 SignALSH
There are two hyperparameters and ; we let and , which is shown to have the best empirical result [shrivastava2014improved]. Next, to get rid of the norm of and , let and define the transformations and :
(4)  
(5) 
2.2.2 Xbox
XBOX [bachrach2014speeding] is an asymmetric LSH; neither normalization nor hyperparameters is in XBOX. Then, define transformations :
(6)  
(7) 
where .
2.2.3 Simple LSH & Simple ALSH
There is no hyperparameter in Simple LSH and Simple ALSH [neyshabur2014symmetric]. As for the normalization, let , then define the transformation . For Simple LSH, define the transformation :
(8) 
whereas for Simple ALSH, define two transformations :
(9)  
(10) 
2.2.4 Query Normalized First (QNF)
In QNF [huang2018accurate], let and define transformations :
(11)  
(12) 
2.3 Sparse attention
Instead of using algorithms to determine which keys should be attended to, Sparse Attention [child2019generating] merely crafts attention masks . With this attention matrix, the attention weight matrix is defined as below,
(13) 
where is multiplied with the attention weight in (1). If is very sparse, in a real implementation, the computation of (13) only has to consider those elements which are not masked, so the computational speed can be greatly increased. There are two different masks proposed in [child2019generating], which are shown in Figure 0(b) and Figure 0(c).
2.4 Synthesizer
Instead of learning attention masks by algorithms, SYNTHESIZER [tay2020SYNTHESIZER] learns the attention weights directly. We compare the typical attention weight generation process of Transformer and SYNTHESIZER in Figure 3. We implement two versions of SYNTHESIZER in this paper.
2.4.1 Dense SYNTHESIZER
The attention weights are generated by feeding to a function with 2 hidden layers, as the left flow in Figure 2(b). Here, and :
(14) 
where
is the ReLU activation function and
is a userdefined hyperparameter.2.4.2 Random SYNTHESIZER
Here () is a learnable matrix^{2}^{2}2In other words, the elements in are considered as the network parameters., which is a randomly initialized matrix, learned with other parts of the network. Note that does not depend on the input sequences. It is the same across all the inputs. Namely, data in a batch share the same attention weights. .
2.4.3 Proposed method
The proposed method is based on Random SYNTHESIZER, yet the initialization is from the crafted attention masks in the previous work [raganato2020fixed]. Among the 12 attention heads in our model, seven of them are initialized according to the proposed patterns [raganato2020fixed], while the others are randomly initialized.
Attention  Time Complexity (Training)  Time Complexity (Inference)  Application Fields 

Baseline (QK) [vaswani2017attention]  Speech,CV,NLP  
Baseline (Q) [kitaev2020reformer]  CV,NLP  
Sparse (strided) [child2019generating]  CV  
Sparse (fixed) [child2019generating]  CV  
SignALSH [shrivastava2014improved]  RS^{‡}^{‡}‡Recommender Systems  
XBOX [bachrach2014speeding]  RS  
XBOX (QNF) [huang2018accurate]  RS  
Simple LSH [neyshabur2014symmetric]  RS  
Simple ALSH [neyshabur2014symmetric]  RS  
SYN. (Dense) [tay2020SYNTHESIZER]  NLP  
SYN. (Dense+M^{§}^{§}§M denotes multihead attention. ) [tay2020SYNTHESIZER]  NLP  
SYN. (Random) [tay2020SYNTHESIZER]    NLP  
SYN. (Ours)     
Attention  Speaker  Phoneme  

Utterance  Frame  1hidden  2hidden  
Baseline (Mel^{¶}^{¶}¶Apply input acoustic features (Mel features) directly to downstream models.)  0.0060  0.0033  0.5246  0.5768 
Baseline (QK)  0.9926  0.9824  0.6460  0.6887 
Baseline (Q)  0.9898  0.9622  0.5893  0.6345 
Sparse (Strided)  0.9786  0.9039  0.6048  0.6450 
Sparse (Fixed)  0.9597  0.7960  0.6069  0.6846 
SignALSH  0.9716  0.8237  0.5863  0.6393 
XBOX  0.9639  0.7994  0.5860  0.6262 
XBOX (QNF)  0.9667  0.7958  0.5819  0.6241 
Simple LSH  0.9628  0.7370  0.5771  0.6189 
Simple ALSH  0.9678  0.7999  0.5783  0.6214 
SYN. (Dense)  0.9660  0.9027  0.6180  0.6287 
SYN. (Dense+M^{fn:mh})  0.9509  0.9135  0.6073  0.6471 
SYN. (Random)  0.9803  0.8868  0.5820  0.6237 
SYN. (Ours)  0.9842  0.9855  0.6157  0.6492 
3 Experiments
All attentions and SYNTHESIZER models are compared in Table 1. There are three main groups, which stand for Sparse Attention, LSH, and SYNTHESIZER respectively. We compare the time complexity of both training and inference; we also list the corresponding application fields each attention has been applied.
Here, we can make three key observations: 1) For Sparse Attention and LSH, though inference time is a bit longer, they require less training time as grows larger. 2) All second terms in LSH is for hashing, and we use float 16 to implement. This is why we multiply here. Also, there is no gradient in hashing function; thus, the training time complexity is the same as inference complexity. 3) For Syn. (Dense), letting can accelerate training dramatically; we let in this work. 4) Syn. (Random) does not need any computation to generate attention weights during inference.
3.1 Pretraining
Following the idea of twostage training process of BERT [devlin2018bert], we pretrain the transformerbased models on the unlabelled audio data (Librispeech [panayotov2015librispeech] trainclean360 subset) and finetune for the downstream tasks. All transformerbased models have the same architecture as a 6layered Audio ALBERT (AALBERT) [chi2019aalbert]. We adopt the sharedQK attention [kitaev2020reformer] for all the methods mentioned in Sections 2.2 and Sections 2.3, which ties the weights of Q and K to further reduce the computation requirement. All the other settings of the pretraining stage follow those of [chi2019aalbert]. The models are trained for 500k steps with a batch size of 50 and a learning rate of 5e5. The optimizer is LAMB [you2019large].
3.2 Performance of Downstream tasks
We evaluate the selfsupervised models on three downstream tasks, including utterancelevel speaker classification (on the trainclean100 subset), framelevel speaker classification (on the trainclean100), and phoneme classification (on the trainclean360
subset with phoneme labels). In downstream tasks, the pretrained models are used as featureextractors whose parameters are fixed during training. For speaker classification, the extracted representations in the utterancelevel task are passed to an average pooling layer, and the mean representation is then fed to a linear classifier. As in framelevel task, we simply train a linear classifier to predict the speaker label of every audio frame. As for the phoneme classification, we utilize both a onehiddenlayer model and a twohiddenlayer model in downstream tasks; a layer consists of a linear layer along with a ReLU activation function here.
The results are shown in Table 2. Baseline (QK) and Baseline (Q) (sharedQK attention) remarkably outperform Baseline (Mel), which shows the importance of pretraining. LSH/ALSH algorithms have negative influences on most downstream tasks, showing that restricting the attention by LSH/ALSH algorithm is not effective enough. For utterancelevel speaker classification, the average pooling layer in the downstream model acts like a global attention mechanism, which compensates the effects of LSH/ALSH.
Sparse Attention obtains higher accuracy than LSH/ALSH, which shows that local information might be important since Sparse Attention always contains a fixedsize local window whereas LSH/ALSH does not.
SYNTHESIZER models perform even better on average than the other two groups; however, they fail to match Baseline (Q) on the framelevel speaker classification task. Our model, combining a SYNTHESIZER random model with some handcrafted masks, achieves competitive performances compared to Baseline (Q), and even outperforms Baseline (QK) on the framelevel speaker classification task. It is noticeable that the training time of a SYNTHESIZER random model can be less than that of Baseline (QK), as well as less memory usage.
Figure 4 is the utterancelevel speaker classification versus the number of keys attended. Sparse attention obtains better performance than LSH/ALSH but attends to more keys. The number of keys for all the SYNTHESIZERs is considered as zero because they generate the attention weights directly. This figure shows that the proposed approach achieves better performance with less computation.
In Figure 2, we visualize the attention weights of all 12 heads in our model. It is obvious that Head to Head are very similar to their corresponding patterns at initialization, while other heads are rather random. This outcome may be due partially to the fact that the patterns of Head to Head can capture the information well enough; therefore, the others head only need to learn those details neglected by the previous seven heads.
4 Conclusion
We explore the possibility of reducing computation complexity in transformerbased models for selfsupervised representation learning. We try LSH and Sparse attention to limit the number of tokentoken interaction in selfattention. We also introduce recently proposed attention modules SYNTHESIZER. Then, we propose to combine the SYNTHESIZER random model with handcrafted patterns. In the experiments, our proposed architecture not only performs comparably to vanilla transformerbased models on downstream tasks but requires less training time and less memory usage.
Comments
There are no comments yet.