1 Introduction
Sequence modelling is a fundamental task of machine learning, integral in a variety of applications such as neural machine translation
(Bahdanau et al., 2014)(Xu et al., 2015), summarization (Maybury, 1999), automatic speech recognition (Dong et al., 2018) and synthesis (Oord et al., 2016) etc. Transformers (Vaswani et al., 2017)have been proven a powerful tool significantly advancing the stateoftheart for the majority of the aforementioned tasks. In particular, transformers employ selfattention that allows them to handle long sequences without the vanishinggradient problem inherent in RNNs
(Hochreiter et al., 2001; Arjovsky et al., 2016).Nonetheless, despite their impressive performance, the use of selfattention comes with computational and memory requirements that scale quadratic to the sequence length, limiting their applicability to long sequences. The quadratic complexity becomes apparent if we consider the core mechanism of selfattention, namely splitting the input sequence into queries and keys and then each query attending to all keys. To this end, recently, there has been an increasing interest for developing methods that address this limitation (Dai et al., 2019; Sukhbaatar et al., 2019; Child et al., 2019; Kitaev et al., 2020).
These methods can be broadly categorized into two distinct lines of work, those that focus on improving the asymptotic complexity of the selfattention computation (Child et al., 2019; Kitaev et al., 2020) and those that aim at developing techniques that make transformers applicable to longer sequences without addressing the quadratic complexity of selfattention (Dai et al., 2019; Sukhbaatar et al., 2019). The former limits the amount of keys that each query attends to, thus reducing the asymptotic complexity. The latter increases the length of the sequence that a transformer can attend to without altering the underlying complexity of the selfattention mechanism.
In this work, we propose clustered attention
which is a fast approximation of selfattention. Clustered attention makes use of similarities between queries and groups them in order to reduce the computational cost. In particular, we perform fast clustering using localitysensitive hashing and KMeans and only compute the attention once per cluster. This results in linear complexity for a fixed number of clusters (§
3.2). In addition, we showcase that we can further improve the quality of our approximation by separately considering the keys with the highest attention per cluster (§ A.2). Finally, we provide theoretical bounds of our approximation quality with respect to the full attention (§ 3.2.1, § B) and show that our model can be applied for inference of pretrained transformers with minimal loss in performance.We evaluate our model on two automatic speech recognition datasets and showcase that clustered attention consistently achieves better performance than vanilla attention when the computational budget is equalized. Moreover, we demonstrate that our proposed attention can approximate a pretrained BERT model on the popular GLUE and SQuAD benchmarks with only 25 clusters and without loss in performance.
2 Related Work
In this section, we discuss the most relevant works on scaling transformers to larger sequences. We start by presenting approaches that aim to speed up the attention computation in general. Subsequently, we discuss approaches that speed up transformers without changing the complexity of the attention layer and finally, we summarize the most related works on improving the asymptotic complexity of the attention layer in transformer models.
2.1 Attention Improvements Before Transformers
Attention has been an integral component of neural networks for sequence modelling for several years
(Bahdanau et al., 2014; Xu et al., 2015; Chan et al., 2016). However, its quadratic complexity with respect to the sequence length hinders its applicability on large sequences.Among the first attempts to address this was the work of Britz et al. (2017)
that propose to aggregate the information of the input sequence into fewer vectors and perform attention with these fewer vectors, thus speeding up the attention computation and reducing the memory requirements. However, the input aggregation is performed using a learned but fixed matrix that remains constant for all sequences, hence significantly limiting the expressivity of the model. Similarly,
Chiu & Raffel (2017) limit the amount of accessible elements to the attention, by attending monotonically from the past to the future. Namely, if timestep attends to position then timestep cannot attend to any of the earlier positions. Note that in order to speed up the attention computation, the above methods are limiting the number of elements that each layer attends to. Recently, some of these approaches have also been applied in the context of transformers (Ma et al., 2020).2.2 Nonasymptotic Improvements
In this section, we summarize techniques that seek to apply transformers to long sequences without focusing on improving the quadratic complexity of selfattention. The most important are Adaptive Attention Span Transformers (Sukhbaatar et al., 2019) and TransformerXL (Dai et al., 2019).
Sukhbaatar et al. (2019) propose to limit the selfattention context to the closest samples (attention span), in terms of relative distance with respect to the time step, thus reducing both the time and memory requirements of selfattention computation. This is achieved using a masking function with learnable parameters that allows the network to increase the attention span if necessary. TransformerXL (Dai et al., 2019), on the other hand, seeks to increase the effective sequence length by introducing segmentlevel recurrent training, namely splitting the input into segments and attending jointly to the previous and the current segment. The above, combined with a new relative positional encoding results in models that attend to more distant positions than the length of the segment used during training.
Although both approaches have been proven effective, the underlying limitations of selfattention still remains. Attending to an element that is timesteps away requires memory and computation. In contrast, our model tradesoff a small error in the computation of the full attention for an improved linear asymptotic complexity. This makes processing long sequences possible.
2.3 Improvements in Asymptotic Complexity
Child et al. (2019)
factorize the selfattention mechanism in local and strided attention. The local attention is computed between the
nearest positions and the strided attention is computed between positions that are steps away from each other. When is set to the total asymptotic complexity becomes both in terms of memory and computation time. With the aforementioned factorization, in order for any position to attend to any other position, two selfattention layers are required. As a result Child et al. (2019) propose to double the number of layers in a given model. In addition, the factorization is fixed and data independent. This makes it intuitive for certain signals (e.g. images), however in most cases it is arbitrary. In contrast, our method automatically groups the input queries that are similar without the need for a manually designed factorization. Moreover, in our model, information flows always from every position to every other position.Recently, Kitaev et al. (2020) introduced Reformer. A method that groups positions based on their similarity using localitysensitive hashing (LSH) and only computes the attention within groups. For groups of fixed size, the asymptotic complexity of Reformer becomes linear with respect to the sequence length. Note that Reformer constrains the queries and keys of selfattention to be equal. As a result, it cannot be applied to neural machine translation, image captioning or memory networks, or generally any application with heterogenous queries and keys. In addition, as it uses hash collisions to form groups it can only handle a small number of bits, thus significantly reducing the quality of the grouping. Instead, our method uses clustering to group the queries, resulting in significantly better groups compared to hash collisions. We show that our method has the same asymptotic complexity, while it can also be used to speed up inference of pretrained models without additional training.
3 Scaling Attention with Fast Clustering
In this section, we formalize the proposed method for approximate softmax attention. In § 3.1, we first discuss the attention mechanism in vanilla transformers and present its computational complexity. We then introduce clustered attention in § 3.2 and show that for queries close in the Euclidean space, the attention difference can be bounded by the distance between the queries. This property allows us to reduce the computational complexity by clustering the queries. Subsequently, in § A.2 we show that we can further improve the approximation by first extracting the top keys with the highest attention per cluster and then computing the attention on these keys separately for each query that belongs to the cluster. A graphical illustration of our method is provided in the supplementary material.
3.1 Vanilla Attention
For any sequnce of length , the standard attention mechanism that is used in transformers is the dot product attention introduced by Vaswani et al. (2017). Following standard notation, we define the attention matrix as,
(1) 
where denotes the queries and denotes the keys. Note that is applied rowwise. Using the attention weights and the values , we compute the new values as follows,
(2) 
An intuitive understanding of the attention, as described above, is that given we create new values as the weighted average of the old ones, where the weights are defined by the attention matrix . Computing equation 1 requires operations and the weighted average of equation 2 requires . This results in an asymptotic complexity of .
3.2 Clustered Attention
Instead of computing the attention matrix for all queries, we group them into clusters and compute the attention only for these clusters. Then, we use the same attention weights for queries that belong to the same cluster. As a result, the attention computation now becomes , where .
More formally, let us define , a partitioning of the queries into nonoverlapping clusters, such that, , if the th query belongs to the th cluster and otherwise. Using this partitioning, we can now compute the clustered attention. First, we compute the cluster centroids as follows,
(3) 
where is the centroid of the th cluster. Let us denote as the centroid matrix. Now, we can compute the clustered attention as if were the queries. Namely, we compute the clustered attention matrix
(4) 
and the new values
(5) 
Finally, the value of the th query becomes the value of its closest centroid, namely,
(6) 
From the above analysis, it is evident that we only need to compute the attention weights and the weighted average of the values once per cluster. Then, we can broadcast the same value to all queries belonging to the same cluster. This allows us to reduce the number of dot products from for each query to for each cluster, which results in an asymptotic complexity of .
3.2.1 Quality of the approximation
From the above, we show that grouping queries into clusters can speedup the selfattention computation. However, in the previous analysis, we do not consider the effects of clustering on the attention weights . To address this, we derive a bound for the approximation error. In particular, we show that the difference in attention can be bounded as a function of the Euclidean distance between the queries.
Proposition 1.
Given two queries and such that ,
(7) 
where denotes the spectral norm of .
Proof.
Proposition 1 shows that queries that are close in Euclidean space have similar attention distributions. As a result, the error in the attention approximation for the th query assigned to the th cluster can be bounded by its distance from the cluster centroid .
3.2.2 Grouping the Queries
From the discussion, we have shown that given a representative set of queries, we can approximate the attention with fewer computations. Thus, now the problem becomes finding this representative set of queries. KMeans clustering minimizes the sum of squared distances between the cluster members, which would be optimal given our analysis from § 3.2.1. However, for a sequence of length one iteration of Lloyd’s algorithm for the KMeans optimization problem has an asymptotic complexity . To speed up the distance computations, we propose to use LocalitySensitive Hashing (LSH) on the queries and then KMeans in Hamming space. In particular, we use the sign of random projections (Shrivastava & Li, 2014) to hash the queries followed by KMeans clustering with hamming distance as the metric. This results in an asymptotic complexity of , where is the number of Lloyd iterations and is the number of bits used for hashing.
3.3 Improving clustered attention
In the previous section, we show that clustered attention provides a fast approximation for softmax attention. In this section, we discuss how this approximation can be further improved by considering separately the keys with the highest attention. To intuitively understand the importance of the above, it suffices to consider a scenario where a key with low attention for some query gets a high attention as approximated with the cluster centroid. This can happen when the number of clusters are too low or due to the convergence failure of KMeans. For the clustered attention, described in § 3.2, this introduces significant error in the computed value. The variation discussed below addresses such limitations.
After having computed the clustered attention from equation 4, we find the keys with the highest attention for each cluster. The main idea then is to improve the attention approximation on these top keys for each query that belongs to the cluster. To do so, we first compute the dot product attention as defined in equation 1 on these top keys for all queries belonging to this cluster. For any query, the computed attention on these top
keys will sum up to one. This means that it cannot be directly used to substitute the clusteredattention on these keys. To address this, before substition, we scale the computed attention by the total probability mass assigned by the clustered attention to these top
keys.More formally, we start by introducing , where if the th key is among the top keys for the th cluster and 0 otherwise. We can then compute the probability mass, let it be , of the top keys for the th cluster, as follows
(9) 
Now we formulate an improved attention matrix approximation as follows
(10) 
Note that in the above, denotes the th query belonging to the th cluster and is ommited for clarity. In particular, equation 19 selects the clustered attention of equation 4 for keys that are not among the top keys for a given cluster. For the rest, it redistributes the mass according to the dot product attention of the queries with the top keys. The corresponding new values, , are a simple matrix product of with the values,
(11) 
Equation 11 can be decomposed into clustered attention computation and two sparse dot products, one for every query with the top keys and one for the top attention weights with the corresponding values. This adds to the asymptotic complexity of the attention approximation of equation 4.
3.3.1 Quality of the approximation
In the following, we provide proof that improved clustered attention (eq. 19) is a direct improvement over the clustered attention (eq. 4), in terms of the distance from the attention matrix .
Proposition 2.
For the th query belonging to the th cluster, the improved clustered attention and clustered attention relate to the full attention as follows,
(12) 
Due to lack of space, the proof of the above proposition is presented in the supplementary material. From equation 12 it becomes evident that improved clustered attention will always approximate the full attention better compared to clustered attention.
4 Experiments
In this section, we analyze experimentally the performance of our proposed method. Initially, we show that our model outperforms our baselines for a given computational budget on a realworld sequence to sequence task, namely automatic speech recognition on two datasets, the Wall Street Journal dataset (§ 4.1) and the Switchboard dataset (§ 4.2). Subsequently, in § 4.3, we demonstrate that our model can approximate a pretrained BERT model (Liu et al., 2019) on the GLUE (Wang et al., 2019) and SQuAD (Rajpurkar et al., 2018) benchmarks with minimal loss in performance even when the number of clusters is less than one tenth of the sequence length. Due to lack of space, we also provide, in the supplementary material, a thorough benchmark that showcases the linear complexity of clustered attention and an ablation study regarding how the number of clusters scales with respect to the sequence length.
We compare our model with the vanilla transformers (Vaswani et al., 2017), which we refer to as full and the Reformer (Kitaev et al., 2020), which we refer to as lshX, where denotes the rounds of hashing. We refer to clustered attention, introduced in § 3.2, as clusteredX and to improved clustered attention, introduced in § A.2, as iclusteredX, where denotes the number of clusters. Unless mentioned otherwise we use for the top keys with improved clustered.
All experiments are conducted using NVidia GTX 1080 Ti with
GB of memory and all models are implemented in PyTorch
(Paszke et al., 2019). For Reformer we use a PyTorch port of the published code. Note that we do not use reversible layers since it is a technique that could be applied to all methods. Our PyTorch code can be found at https://clusteredtransformers.github.io.4.1 Evaluation on Wall Street Journal (WSJ)
In our first experiment, we employ the WallStreet Journal dataset (Paul & Baker, 1992). The input to all transformers is dimensional filterbank features with fixed positional embeddings. We train using Connectionist Temporal Classification (CTC) (Graves et al., 2006) loss with phonemes as groundtruth labels. The approximate average and maximum sequence lengths for the training inputs are and respectively.
Speed Accuracy Tradeoff: We start by comparing the performance of our proposed model with various transformer variants under an equalized computational budget. To this end, we train full with , and layers to get a range of the required computation time and achieved phone error rate (PER). Similarly, we train iclustered with and layers. Both models are trained with and clusters. We also train clustered with layers, and , and clusters. Finally, we train Reformer with layers, and and hashing rounds. We refer the reader to our supplementary for the specifics of all transformer architectures as well as their training details. In figure 0(a), we plot the achieved PER on the validation set with respect to the required time to perform a full forward pass. Our iclustered achieves lower PER than all other baselines for a given computational budget.
Approximation Quality:
To assess the approximation capabilities of our method, we train different transformer variants on the aforementioned task and evaluate them using other selfattention implementations during inference. As the Reformer requires the queries to be identical to the keys to evaluate its approximation ability we also train a full attention model with shared queries and keys, which we refer to as
sharedfull. Note that both clustered attention and improved clustered attention can be used for approximating sharedfull, simply by setting keys to be equal to queries. Table 1 summarizes the results. We observe that improved clustered attention (78 rows) achieves the lowest phone error rate in every comparison. This implies that it is the best choice for approximating pretrained models. In addition, we also note that as the number of clusters increases, the approximation improves as well.Train with  

full  sharedfull  lsh1  lsh4  clustered100  iclustered100  
Evaluate with 
full  5.14        7.10  5.56 
sharedfull    6.57  25.16  41.61      
lsh1    71.40  10.43  13.76      
lsh4    64.29  9.35  9.33      
clustered100  44.88  40.86  68.06  66.43  7.06  18.83  
clustered200  21.76  25.86  57.75  57.24  6.34  8.95  
iclustered100  9.29  13.22  41.65  48.20  8.80  5.95  
iclustered200  6.38  8.43  30.09  42.43  7.71  5.60  
oracletop  17.16  77.18  43.35  59.38  24.32  6.96 
Furthermore, to show that the top keys alone are not sufficient for approximating full, we also compare with an attention variant, that for each query only keeps the keys with the highest attention. We refer to the latter as oracletop. We observe that oracletop achieves significantly larger phone error rate than improved clustered in all cases. This implies that improved clustered attention also captures the significant long tail of the attention distribution.
Convergence Behaviour: In Table 2
, we report the required time per epoch as well as the total training time for all transformer variants with 9 layers. For completeness, we also provide the corresponding phone error rates on the test set. We observe that clustered attention is more than two times faster than full (per epoch) and achieves significantly lower PER than both Reformer variants (lsh
and lsh). Improved clustered is the only method that is not only faster per epoch but also in total wallclock time required to converge.full  lsh1  lsh4  clustered100  iclustered100  
PER (%)  5.03  9.43  8.59  7.50  5.61 
Time/Epoch (s)  2514  1004  2320  803  1325 
Convergence Time (h)  87.99  189.64  210.09  102.15  72.14 
4.2 Evaluation on Switchboard
We also evaluate our model on the Switchboard dataset (Godfrey et al., 1992), which is a collection of telephone conversations on common topics among strangers. All transformers are trained with latticefree MMI loss (Povey et al., 2016) and as inputs we use dimensional filterbank features with fixed positional embeddings. The average input sequence length is roughly and the maximum sequence length is approximately . Details regarding the transformer architectures as well as their training details are provided in the supplementary.
Speed Accuracy Tradeoff: Similar to § 4.1, we compare the performance of various transformer models given a specific computational budget. To this end, we train full with and layers. Similarly, we train iclustered with and layers; both with and clusters. Finally, we also train clustered with layers, and and clusters. In figure 0(b), we plot the achieved word error rate (WER) in the validation set of Switchboard with respect to the required time to perform a full forward pass. Our iclustered is consistently better than full for a given computational budget. In particular, for a budget of approximately seconds, improved clustered achieves more than percentage points lower WER. Furthermore, we note that it is consistently better than clustered attention for all computational budgets.
Convergence Behaviour: Table 3 summarizes the computational cost of training the transformer models with 12 layers in the Switchboard dataset as well as the WER in the test set. We observe that due to the larger sequences in this dataset both clustered and iclustered are faster to train per epoch and with respect to total required wallclock time.
full  clustered100  iclustered100  

WER (%)  15.0  18.5  15.5 
Time/Epoch (h)  3.84  1.91  2.57 
Convergence Time (h)  228.05  132.13  127.44 
4.3 RoBERTa Approximation
To highlight the ability of our model to approximate arbitrarily complicated attention distributions, we evaluate our proposed method on the approximation of a finetuned RoBERTa model (Liu et al., 2019) on the GLUE (Wang et al., 2019) and SQuAD (Rajpurkar et al., 2018) benchmarks. In particular, we evaluate on different tasks, among which there are tasks such as question answering (SQuAD) and textual entailment (RTE), which exhibit arbitrary and sparse attention patterns. We refer the reader to Wang et al. (2019); Rajpurkar et al. (2018) for a detailed analysis of all tasks.
For the GLUE tasks, the maximum sequence length is 128 while for SQuAD, it is 384. For each task, we use clusters for approximation which is less than and of the input sequence length for GLUE and SQuAD tasks respectively. In Table 4, we summarize the performance per task. We observe that improved clustered performs as well as the full transformer in all tasks but SQuAD, in which it is only marginally worse. Moreover, we note that clustered performs significantly worse in tasks that require more complicated attention patterns such as SQuAD and RTE.
CoLA  MNLI  MRPC  QNLI  QQP  RTE  SST2  STSB  WNLI  SQuAD  

full  0.601  0.880  0.868  0.929  0.915  0.682  0.947  0.900  0.437  0.904 
clustered25  0.598  0.794  0.436  0.746  0.894  0.498  0.944  0.789  0.437  0.006 
iclustered25  0.601  0.880  0.873  0.930  0.915  0.704  0.947  0.900  0.437  0.876 
5 Conclusions
We have presented clustered attention a method that approximates vanilla transformers with significantly lower computational requirements. In particular, we have shown that our model can be up to faster during training and inference with minimal loss in performance. In contrast to recent fast variations of transformers, we have also shown that our method can efficiently approximate pretrained models with full attention while retaining the linear asymptotic complexity.
The proposed method opens several research directions towards applying transformers on long sequence tasks such as music generation, scene flow estimation etc. We consider masked language modeling for long texts to be of particular importance, as it will allow finetuning for downstream tasks that need a context longer than the commonly used 512 tokens.
Broader Impact
This work contributes towards the wider adoption of transformers by reducing their computational requirements; thus enabling their use on embedded or otherwise resource constrained devices. In addition, we have shown that for long sequences clustered attention can result to almost reduction in GPU training time which translates to equal reduction in CO2 emmisions and energy consumption.
Acknowledgements
Apoorv Vyas was supported by the Swiss National Science Foundation under grant number FNS30213 ”SHISSM”. Angelos Katharopoulos was supported by the Swiss National Science Foundation under grant numbers FNS30209 ”ISUL” and FNS30224 ”CORTI”.
References

Arjovsky et al. (2016)
Arjovsky, M., Shah, A., and Bengio, Y.
Unitary evolution recurrent neural networks.
In International Conference on Machine Learning, pp. 1120–1128, 2016.  Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Britz et al. (2017) Britz, D., Guan, M. Y., and Luong, M.T. Efficient attention using a fixedsize memory representation. arXiv preprint arXiv:1707.00110, 2017.
 Chan et al. (2016) Chan, W., Jaitly, N., Le, Q., and Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. IEEE, 2016.
 Child et al. (2019) Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
 Chiu & Raffel (2017) Chiu, C.C. and Raffel, C. Monotonic chunkwise attention. arXiv preprint arXiv:1712.05382, 2017.
 Dai et al. (2019) Dai, Z., Yang, Z., Yang, Y., Cohen, W. W., Carbonell, J., Le, Q. V., and Salakhutdinov, R. Transformerxl: Attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860, 2019.
 Dong et al. (2018) Dong, L., Xu, S., and Xu, B. Speechtransformer: a norecurrence sequencetosequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. IEEE, 2018.

Gao & Pavel (2017)
Gao, B. and Pavel, L.
On the properties of the softmax function with application in game theory and reinforcement learning, 2017.
 Godfrey et al. (1992) Godfrey, J. J., Holliman, E. C., and McDaniel, J. Switchboard: Telephone speech corpus for research and development. In [Proceedings] ICASSP92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pp. 517–520. IEEE, 1992.
 Graves et al. (2006) Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, 2006.
 Hochreiter et al. (2001) Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. Gradient flow in recurrent nets: the difficulty of learning longterm dependencies, 2001.
 Kitaev et al. (2020) Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB.

Liu et al. (2020)
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J.
On the variance of the adaptive learning rate and beyond.
In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgz2aEKDr.  Liu et al. (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
 Ma et al. (2020) Ma, X., Pino, J. M., Cross, J., Puzon, L., and Gu, J. Monotonic multihead attention. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hyg96gBKPS.

Maybury (1999)
Maybury, M.
Advances in automatic text summarization
. MIT press, 1999.  Oord et al. (2016) Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.

Paszke et al. (2019)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,
T., Lin, Z., Gimelshein, N., Antiga, L., et al.
Pytorch: An imperative style, highperformance deep learning library.
In Advances in Neural Information Processing Systems, pp. 8024–8035, 2019.  Paul & Baker (1992) Paul, D. B. and Baker, J. M. The design for the wall street journalbased csr corpus. In Proceedings of the Workshop on Speech and Natural Language, HLT ’91, 1992.
 Povey et al. (2011) Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., and Vesely, K. The kaldi speech recognition toolkit. 2011.
 Povey et al. (2015) Povey, D., Zhang, X., and Khudanpur, S. Parallel training of dnns with natural gradient and parameter averaging. In In International Conference on Learning Representations: Workshop track, 2015.
 Povey et al. (2016) Povey, D., Peddinti, V., Galvez, D., Ghahremani, P., Manohar, V., Na, X., Wang, Y., and Khudanpur, S. Purely sequencetrained neural networks for asr based on latticefree mmi. In Interspeech, pp. 2751–2755, 2016.
 Rajpurkar et al. (2018) Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 784–789, 2018.
 Shrivastava & Li (2014) Shrivastava, A. and Li, P. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pp. 2321–2329, 2014.
 Sukhbaatar et al. (2019) Sukhbaatar, S., Grave, E., Bojanowski, P., and Joulin, A. Adaptive attention span in transformers. arXiv preprint arXiv:1905.07799, 2019.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
 Wang et al. (2019) Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. GLUE: A multitask benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, 2019.
 Xu et al. (2015) Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057, 2015.
Appendix A Scaling Attention with Fast Clustering
In this section we present graphical illustrations for the proposed clustered and iclustered attention models in § A.1 and § A.2 respectively.
a.1 Clustered attention
In figure 2, we present the steps involved in clustered attention computation for an example sequence with queries and the number of clusters set to . We first cluster the queries using the Kmeans clustering to output which indicates the membership of queries to different clusters. We use different colors to represent different clusters. After clustering, the centroids are used to compute the attention weights and the new values for the centroids. Finally, the values are broadcasted to get the new values corresponding to each query.
a.2 Improved clustered attention
In this section, we first describe how we can efficiently compute the iclustered attention using sparse dot products with the top keys and values. We then present the flow chart demonstrating the same.
As discussed in the § 3.3 of the main paper, the improved attention matrix approximation for the query, belonging to the cluster is computed as follows:
(13) 
where, , stores the top keys for each cluster. if the th key is among the top keys for the th cluster and 0 otherwise.
As described in the main paper, is the total probability mass on the top keys for the th cluster given by:
(14) 
Note that we can compute the attention weights on the top keys by first taking sparse dotproduct of with the top keys followed by the softmax activation and rescaling with total probablity mass . For the rest of the keys, the attention weight is the clusteredattention weight .
Similarly, the new values can be decomposed into the following two terms,
(15) 
where is weighted average of the values corresponding to the top keys with weights being the improved attention on the top keys. is the weighted average of the rest of the values with weights being the clustered attention . The following equations show how we compute and ,
(16) 
(17) 
Note that is weighted average of values for each query and thus requires operations. only needs to be computed once percluster centroid and thus requires operations.
In figure 3 we present the iclustered attention computation for the same example sequence with queries and the number of clusters and top keys set to 3. The lower half of the figure shows the new value computed by first taking sparse dotproducts with the top 3 keys to get the attention weights. This is followed by taking the weighted average of the 3 correponding values. The top half of the figure shows the computation. This is same as clustered attention computation but with attention weights corresponding to top keys set to for . The resulting values is the sum of and .
Appendix B Quality of the approximation
Proposition 3.
For the th query belonging to the th cluster, the improved clustered attention and clustered attention relate to the full attention as follows,
(18) 
Proof.
As discussed before, the improved attention matrix approximation for the query, is computed as follows:
(19) 
where, , stores the top keys for each cluster, if the th key is among the top keys for the th cluster and 0 otherwise. is the total probability mass on the top keys for the th cluster, computed as follows:
(20) 
Given the full attention , equation 19 can be simplified to
(21) 
where, is the total probability mass on the same top keys for the th query, computed using the true attention , as follows:
(22)  
(23) 
Without loss of generality, let us assume, and .
In this case, equation 21 can be written as:
(24) 
The total probability masses on the top keys, and can now be expressed as:
(25)  
(26) 
From equation 24 it is clear that the clustered attention, , and the improved clustered attention, , only differ on the keys . Thus, it suffices to show that has lower approximation error on these keys. The approximation error on the top keys , let it be , between the iclustered attention and the full attention is as follows:
(27)  
(28)  
(29)  
(30)  
(31)  
(32)  
(33)  
(34) 
Therefore,
(35)  
(36)  
(37)  
(38) 
∎
Appendix C Experiments
c.1 Time and Memory Benchmark
To measure the computational cost, we compare the memory consumption and computation time on artificially generated sequences of various lengths. For clustered attention we use clusters, bits for the LSH, and Lloyd iterations for the KMeans. For the improved clustered attention, we use the same configuration with . For Reformer, we evaluate on two variants using and rounds of hashing. All models consist of layer with attention heads, embedding dimension of for each head, and a feedforward dimension of .
In this experiment, we measure the required memory and GPU time per single sequence element to perform a forward/backward pass for the various selfattention models. Figure 4 illustrates how these metrics evolve as the sequence length increases from to . For a fair comparison, we use the maximum possible batch size for each method and we divide the computational cost and memory with the number of samples in each batch and the sequence length.
We note that, in contrast to all other methods, vanilla transformer scales quadratically with respect to the sequence length and does not fit in GPU memory for sequences longer than elements. All other methods scale linearly. Clustered attention becomes faster than the vanilla transformer for sequences with elements or more, while improved clustered attention surpasses it for sequences with elements. Note that with respect to per sample memory, both clustered and improved clustered attention perform better than all other methods. This can be explained by the fact that our method does not require storing intermediate results to compute the gradients from multiple hashing rounds as Reformer does. It can be seen, that lsh is faster than the improved clustered clustered attention, however, as also mentioned by (Kitaev et al., 2020) Reformer requires multiple hashing rounds to generalize.
c.2 Ablation on clusters and sequence length
Following (Kitaev et al., 2020), we introduce a synthetic task to analyze the relationship between the number of clusters and sequence length. In our task, the transformer models need to copy some symbols that are masked out from either the first or second half of the sequence. In particular, we generate a random sequence of tokens and we prepend a unique separator token, let it be . The sequence is then copied to get a target of the form , where , is the number of possible symbols and is the sequence length. To generate the input, we replace some symbols from the first half of the sequence and some different symbols from the second half, such that the target sequence can be reconstructed from the input. An example of an input output pair with can be seen in figure 6. Note that to solve this task, transformers simply need to learn to attend to the corresponding tokens in the two identical halves of the sequence.
Input  0  4  M  2  2  0  4  5  M  2 

Output  0  4  5  2  2  0  4  5  2  2 
We set the sequence length to one of which means the input length varies between and . For each sequence, we sample tokens uniformly from and randomly mask out of the tokens. To analyze the impact of number of clusters on performance, we train full transformer as well as clustered variants with different number of clusters and Reformer with different number of hashing rounds.
All transformer variants consist of layers, attention heads, embedding dimension of for each head, and feedforward dimension of . For both clustered and improved clustered attention, we set the number of bits for LSH to and the number of Lloyd iterations for the KMeans to . Both clustered and improved clustered attention are trained with , , and clusters. We also train Reformer with , , and hashing rounds. Finally, all models are trained using RAdam optimizer (Liu et al., 2020) with a learning rate of , batch size of for iterations.
In figure 5, we illustrate the results of this experiment as heatmaps depicting the achieved accuracy for a given combination of number of clusters and sequence length for clustered transformers and number of hashing rounds and sequence length for Reformer. Note that the vanilla transformer solves the task perfectly for all sequence lengths. We observe that both clustered (Fig. 4(b)) and Reformer (Fig. 4(c)) require more clusters or more rounds as the sequence length increases. However, improved clustered achieves the same performance as vanilla transformers, namely perfect accuracy, for every number of clusters and sequence length combination. This result increases our confidence that the required number of clusters for our method is not a function of the sequence length but of the task at hand.
c.3 Automatic Speech Recognition
In this section, we present the details for the ASR experiments such as transformer architecture, optimizer and learning rate schedule. As mentioned in the main paper, for iclustered, unless specified, is set to 32. Furthermore, all transformers have heads with an embedding dimension of on each head and feedforward dimension of . Other architectural details specific to experiments are described later.
c.3.1 Wall Street Journal
Convergence Behaviour:
For this experiment, we train transformer with full, clustered and Reformer attention variants. All models consist of layers. For Reformer, we train two variants with and rounds of hashing with chunk size fixed to as suggested. For clustered and improved clustered attention we set the number of clusters to . We also set the number of Lloyd iterations for KMeans to and the bits for LSH to . All models are trained to convergence using the RAdam optimizer (Liu et al., 2020) with a learning rate of , max gradient norm set to and and weight decay of . The learning rate is dropped when the validation loss plateaus. For each model we select the largest batch size that fits the GPU. The full attention model was trained with a batch size of while the clustered variants: clustered and iclustered could fit batch sizes of and respectively. For Reformer variants: lsh and lsh, batch sizes of and were used.
In figure 6(a), we show the training loss convergence for different transformer variants. It can be seen that iclustered has a much faster convergence than the clustered attention. This shows that the improved clustered attention indeed approximates the full attention better. More importantly, only the iclustered attention has a comparable wallclock convergence. Given that full has a much smaller batch size, it make many more updates perepoch. We think that a slightly smaller batchsize with more updates would have been a better choice for the clustered transformers w.r.t. the wallclock convergence. This is reflected in the Switchboard experiments where the batchsizes for clustered variants were smaller due to more layers. Finally, as can be seen from the wallclock convergence, the clustered transformers significantly outperform the Reformer variants.
SpeedAccuracy Tradeoff:
As described in the main paper, for this task we additionally train full with and layers. Similary, we train clustered with 9 layers, and and clusters. We also train an iclustered model with 9 layer and clusters, and smaller models with 6 layers, and and clusters.
For clustered and iclustered variants with 9 layers, we finetuned the previously described models trained with clusters. We finetuned for epochs with a learning rate of . We train full with and layers to convergence in a similar fashion to the full with layers described previously. Finally, for iclustered, we first trained model with layers and clusters using the training strategy used for layers and clusters. We then finetuned this model for epochs using clusters and a learning rate of .
c.3.2 Switchboard
Convergence Behaviour:
For this experiment, we train transformer with full and clustered attention variants. All models consist of layers. For clustered and improved clustered attention we set the number of clusters to . We also set the number of Lloyd iterations for KMeans to and the bits for LSH to .
Following common practice for flatstart latticefree MMI training, we train over multiple gpus with weight averaging for synchronization as described in (Povey et al., 2015). Specfically, we modify the e2e training recipe for the Wall Street Journal in Kaldi (Povey et al., 2011)
with the following two key differences: first, the acoustic model training is done in PyTorch and second, we use RAdam optimizer instead on natural stochastic gradient descent.
All models are trained using the RAdam optimizer with a learning rate of , max gradient norm set to and and weight decay of . The learning rate is dropped when the validation loss plateaus. We use the word error rate (WER) on the validation set for early stopping and model selection. The full attention model is trained with a batch size of while the clustered variants: clustered and iclustered are trained with a batch size of .
In figure 6(b), we show the training loss convergence for different transformer variants. It can be seen that iclustered has the fastest convergence for this setup. Note that the overall training time for clustered attention is still less than that of full as it starts to overfit early on the validation set WER.
SpeedAccuracy Tradeoff:
For this task we additionally train full with and layers. Similary, we train clustered with 12 layers, and and clusters. We also train iclustered with 12 layer and clusters, and smaller models with 8 layers, and and clusters.
For clustered and iclustered variants with 12 layers, we finetuned the previously described models trained with clusters. We finetuned for epochs with a learning rate of . Once again, full with and layers were trained to convergence similar to full with layers described previously. Finally, for iclustered with layers, we first train a model with clusters using the training strategy used for layers and clusters. We then finetuned this model for epochs using clusters and a learning rate of .
c.4 RoBERTa Approximation
In this section we provide a qualitative comparison between the full attention, and the clustered attention variants clustered and iclustered used for approximation. As described in main paper, we use clusters for both attention variants. In Figure 8 we show the attention distribution for the question tokens for a randomly selected questioncontext tuple from the SQuAD dataset. For each token in the question we show the attention distribution over the input sequence formed by concatenating question and context tokens with CLS and SEP tokens appended. It can be seen that with only few clusters, improved clustered approximates the full attention very closely even when the attention distribution has complicated and sparse patterns. In contrast, clustered attention fails to capture such attention distribution during approximation. Moreover, it can further be seen that for almost all question tokens, both full and improved clustered have the same tokens with the highest attention weights. This further strengthens our believe that improved clustered attention can approximate a wide range of complicated attention patterns.
Comments
There are no comments yet.