1 Introduction
The Transformer model proposed in (Vaswani et al., 2017)
is the architecture of choice for a number of tasks in natural language processing (NLP) as well as vision. The “selfattention” mechanism within Transformers, and specifically, its extension referred to as MultiHead SelfAttention, enable modeling dynamic token dependencies, and play a key role in the overall performance profile of the model. Despite these advantages, the
complexity of the selfattention (where is the sequence length) is a bottleneck, especially for longer sequences. Consequently, over the last year or so, extensive effort has been devoted in deriving methods that mitigate this quadratic cost – a wide choice of algorithms with linear (in the sequence length) complexity are now available (Wang et al., 2020; Choromanski et al., 2021; Zeng et al., 2021; Beltagy et al., 2020; Zaheer et al., 2020).The aforementioned body of work on efficient transformers leverages the observation that the selfattention matrix has a parsimonious representation – the mechanics of how this is modeled and exploited at the algorithm level, varies from one method to the other. For instance, we may ask that selfattention has a prespecified form of sparsity (for instance, diagonal or banded), see (Beltagy et al., 2020; Zaheer et al., 2020). Alternatively, we may model selfattention globally as a lowrank matrix, successfully utilized in (Wang et al., 2020; Xiong et al., 2021; Choromanski et al., 2021). Clearly, each modeling choice entails its own form of approximation error, which we can measure empirically on existing benchmarks. Progress has been brisk in improving these approximations: recent proposals have investigated a hybrid global+local strategy based on invoking a robust PCA style model (Chen et al., 2021) and hierarchical (or H) matrices (Zhu and Soricut, 2021), which we will draw a contrast with.
MRA. Recall that the hybrid global+local intuition above has a rich classical treatment formalized under Multiresolution analysis (MRA) methods, and Wavelets are a prominent example (Mallat, 1999). While the use of wavelets for signal processing goes back at least three decades (if not longer), their use in machine learning especially for graph based datasets, as a numerical preconditioner, and even for matrix decomposition problems has seen a resurgence (Kondor et al., 2014; Ithapu et al., 2017; Gavish et al., 2010; Lee and Nadler, 2007; Hammond et al., 2011; Coifman and Maggioni, 2006)
. The extent to which classical MRA ideas (or even their heuristic forms) can guide efficiency in Transformers is largely unknown. Fig.
1 shows that, given a representative selfattention matrix, only of the coefficients are sufficient for a high fidelity reconstruction. At a minimum, we see that the hypothesis of evaluating a MRAbased selfattention within a Transformer model may have merit.Contributions. The goal of this paper is to investigate the specific modifications, adjustments and approximations needed to make the MRA idea operational in Transformers. It turns out that modulo some small compromises (on the theoretical side), MRAbased selfattention shows excellent performance across the board – it is competitive with standard selfattention and outperforms most of baselines while maintaining significantly high time and memory efficiency on both short and long sequences.
2 Preliminaries: Selfattention and Wavelets
We briefly review self attention and wavelet decomposition, two concepts that we will use throughout the paper.
2.1 Selfattention
Given embedding matrices representing
dimensional feature vectors for queries, key and values, respectively, selfattention is defined as
(1) 
where is a diagonal matrix that normalizes each row of the matrix such that the row entries sum up to . For notational simplicity, the scaling factor and linear projections applied to are omitted. An explicit calculation of and taking its product with incurs a cost (if is treated as fixed/constant for complexity analysis purposes), a serious resource bottleneck and the core focus of the existing literature on efficient Transformers.
Related Work on Efficient Transformers. A number of efficient selfattention methods have been proposed to reduce the cost. Much of this literature can be, roughly speaking, divided into two categories: low rank and sparsity. Linformer (Wang et al., 2020) shows that selfattention matrices are low rank and proposes to learn a projection – projecting the sequence length dimension to lower dimensions. Performer (Choromanski et al., 2021) and Random Feature Attention (Peng et al., 2021) view selfattention matrices as kernel matrices of infinite feature maps and propose using finite random feature maps to approximate the kernel matrices. Nyströmformer (Xiong et al., 2021) and SOFT (Lu et al., 2021) use a Nyström method for matrix approximation to approximate the selfattention matrices.
A number of methods also leverage the sparsity of selfattention matrices. By exploiting a high dependency within a local context, Longformer (Beltagy et al., 2020) proposes a sliding window attention with global attention on manually selected tokens. In addition to the sparsity used in Longformer, Big Bird (Zaheer et al., 2020) adds random sparse attention to further refine the approximation. Instead of a prespecified sparsity support, Reformer (Kitaev et al., 2020) uses hashing (LSH) to compute selfattention only within approximately nearby tokens, and YOSO (Zeng et al., 2021)
uses the collision probability of LSH as attention weights and then, a LSHbased sampler to achieve linear complexity.
The discussion in (Chen et al., 2021) suggests that approximations relying solely on low rank or sparsity are limited and a hybrid model via robust PCA offers better approximation. Scatterbrain (Chen et al., 2021) uses a sparse attention + low rank attention strategy to avoid the cost of robust PCA. In §A.2, we discuss some limitations of low rank and sparsity for selfattention approximation, and show that a special form of our MRA approximation can offer a good solution for a relaxation of robust PCA.
Independent of our work, recently, HTransformer1D (Zhu and Soricut, 2021) proposed a hierarchical selfattention where the selfattention matrices have a low rank structure on the offdiagonal entries and attention is precisely calculated for the ondiagonal entries. This is also a form of multiresolution approximation for selfattention although the lower resolution for distant tokens may limit its ability to capture precise long range dependencies. While a prespecified structure can indeed provide an effective approximation scheme in specific settings, it would be desirable to avoid restriction to a fixed structure, if possible.
2.2 Wavelet Transform
A wavelet transform decomposes a signal into different scales and locations represented by a set of scaled and translated copies of a fixed function. This fixed function is called a mother wavelet, and the scaled and translated copies are called child wavelets specified by two factors, scale and translation .
(2) 
Here, controls the “dilation” or inverse of the frequency of wavelet, while controls the location (e.g., time). These scaled/translated versions of mother wavelets play a key role in MRA. Given a choice of , the wavelet transform maps a function to coefficients
(3) 
The coefficient captures the measurement of at scale and location .
3 MRA view of Selfattention
To motivate the use of MRA, in §1, we used Fig. 1 to check how a 2D Haar wavelet basis decomposes the target matrix into terms involving different scales and translations, and terms with larger coefficients suffice for a good approximation of . But the reader will notice that the calculation of the coefficients requires access to the full matrix . Our discussion below will start from a formulation which will still need the full matrix . Later, in §4, by exploiting the locality of and , we will be able to derive an approximation with reduced complexity (without access to ).
For simplicity, we assume that the sequence length for some integer . Inspired by the Haar basis and its ability to adaptively approximate while preserving locality, we apply a pyramidal MRA scheme. We consider a decomposition of using a set of simpler unnormalized components defined as
(4) 
for and . Here, is the support of , and represents the scale of the components, i.e., a smaller denotes higher resolutions and vice versa. Also, denote the translation of the components.
Why not Haar basis? The main reason for using the form in (4) instead of a 2D Haar basis directly is implementation driven, and will be discussed shortly in Remark 3.1
. For the moment, we can observe that (
4) is an overcomplete frame for . As shown in Fig. 2, frame (4) has one extra scale (with support on a single entry) compared to the Haar basis (4 rows versus 3 rows). Except for this extra scale, (4) has the same support as the Haar basis at different scales. In addition, (4) provides scaled and translated copies of the “mother” component, similar to Haar.Let be a set of components for the possible scales and translations, then we decompose
(5) 
for some set of coefficients . Since (4) is overcomplete, the coefficients are not unique. We specifically compute the coefficients as follows. Let and
(6) 
Here, the denotes the residuals of the higher frequencies. At each scale , is the optimal solution of the least squares problem minimizing . Intuitively, the approximation procedure starts from the coarsest approximation of which only consists of the lowest frequency, then the procedure refines the approximation by adding residuals of higher frequencies.
Parsimony. We empirically observe that the coefficients of most components are near zero, so we can represent with only a few components while maintaining the accuracy of approximation. Specifically, we can find a small subset , the corresponding coefficients computed following (6), and the resulting approximation
(7) 
with a negligible approximation error .
But in (7) does not suggest any interesting property of the approximation. But we can check an equivalent form of . Denote the average of over the support of to be
(8) 
It turns out that the entries of in (7) can be rewritten as
(9) 
where is the index of that has the smallest support region and is supported on . But if for all , then . The entry of is precisely approximated by the average of over the smallest support region for containing . In other words, the procedure uses the highest resolution possible for a given as an approximation. We discuss and show how we obtain (9) from (7) in §A.3. The reader will notice that rewriting as (9) is possible due to our modifications to the Haar basis in (4).
Remark 3.1.
Consider using a Haar decomposition and let be the subset of basis with nonzero coefficients. The approximation depends on all which are supported on . For example, in the worst case, the coefficients of all which are supported on need to be nonzero to have . We find that a hardware friendly and efficient approximation scheme in this case is challenging. On the other hand, when using the decomposition (6) over the overcomplete frame (4), depends on only one that has the smallest support region and is supported on . This makes constructing the set easier and more flexible.
4 A Practical Approximation scheme
Given that we now understand all relevant modules, we can focus on practical considerations. Notice that each requires averaging over entries of the matrix , so in the worst case, we would need access to the entire matrix to compute all for . Nonetheless, suppose that we still compute all the coefficients and then posthoc truncate the small coefficients to construct the set . This approach will clearly be inefficient. In this section, we discuss two strategies where the main goal is efficiency.
4.1 Can we approximate quickly?
We first discuss calculating . To avoid accessing the full matrix , instead of computing the average of exponential (8), we compute a lower bound (due to convexity of exponential), i.e., exponential of average (10), as an approximation.
(10) 
We can verify that the expression in (10) can be computed efficiently as follows. Define where , , and
(11) 
Here, and denote the th row of the matrix and , respectively. Interestingly, (10) is simply,
(12) 
Then, the approximation using (10) is,
(13) 
where is the same as (9) and otherwise. Each only requires an inner product between one row of and one row of and applying an exponential, so the cost of a single is when and are provided. We will discuss the overall complexity in §4.4.
While efficient, this modification will incur a small amount of error. However, by using the property of inherited from and , we can quantify the error.
Lemma 4.1.
Assume for all , and where , then where for all and some , and
where .
Lemma 4.1 suggests that the approximation error depends on the “spread” or numerical range of values (range, for short) in the entries within a region and . If is small or is small, then the approximation error is small. The range of a region is influenced by properties of and . The range is bounded by the norm and spread of and for . This relies on the locality assumption that spatially nearby tokens should also be semantically similar which commonly holds in many applications. Of course, this can be avoided if needed – it is easy to reduce the spread of and in local regions simply by permuting the order of and . For example, we can use Locality Sensitive Hashing (LSH) to reorder and such that similar vectors are in nearby positions, e.g., see (Kitaev et al., 2020). While the range is data/problem dependent, we can control the range by using a smaller since the range of a smaller region will be smaller. In the extreme case, when , the range is . So, this offers guidance that when is large, we should approximate the region at a higher resolution such that the range is smaller.
Remark 4.2.
Observe that the numerical range, which is defined as a bound on finite differences over sets of indices, is closely related to the concept of smoothness, which is defined using finite differences amongst adjacent indices. Indeed, it is possible to adapt Lemma 4.1 and its proof to the theory of wavelets, which are useful for characterizing signal smoothness. Please see §A.5 for more details.
Remark 4.3.
The underlying assumption of diagonal attention structure from Longformer, Big Bird, and HTransformer1D is that tokens are highly dependent on the nearby tokens and only the nearby tokens, which is more important than attention w.r.t. distant tokens. This might appear similar to the locality assumption discussed earlier, but this is incorrect. Our locality does not assume that semantically similar tokens must be spatially close, i.e., we allow high and precise dependence on distant tokens.
4.2 Can we construct quickly?
So far, we have assumed that the set is given which is not true in practice. We now place a mild restriction on the set as a simplification heuristic. We allow each entry of to be included in the support of exactly one . Consider a , if each entry of the support region of is included in the support of some with a smaller , then can be safely removed from without affecting the approximation, by construction. This restriction allows us to avoid searching for the with the smallest among multiple candidates. Then, the overall approximation can be written as
(14) 
Remark 4.4.
Under this restriction, is a subset of an orthogonal basis of .
Mechanics of constructing . Now, we can discuss how the set is constructed. Let us first consider the approximation error , by factoring out ,
(15) 
Since the goal is to minimize the error , the optimal solution is to fix the computation budget and solve the optimization problem which minimizes over all possible . However, this might not be efficiently solvable.
Instead, we consider finding a good solution greedily. Consider the error . We can analyze the specific term to get an insight into how approximation error can be reduced. Note that
(16) 
is the deviation of from the mean of the support region, so the approximation error (15) is determined by and the deviation of within the region, which coincides with the conclusion of Lemma 4.1. Computing this deviation would incur a cost, so we avoid using it as a criteria for construction. We found that we can make a reasonable assumption that the deviation of in a support of for the same are similar, and the deviation of a region for a smaller is smaller. Then, a sensible heuristic is to use as a criteria such that if is large, then we must approximate the region using a higher resolution. The approximation procedure is described in Alg. 1, and the approximation result is shown in Fig. 3. Broadly, this approximation starts with a coarse approximation of a selfattention matrix, then for regions with a large , we successively refines the regions to a higher resolution.
With the approximation procedure in place, we can quantify the error of this multiresolution approximation. We only show the approximation error for for some , but the analysis easily extends beyond .
Proposition 4.5.
Let for some and be the th largest , assume for all , for some and , then
where
Proposition 4.5 again emphasizes the relation between the numerical range of and the quality of an approximation. With some knowledge of the range and , we can control the error using an appropriate budget .
Remark 4.6.
The procedure shares some commonalities with the correction component of Geometric Multigrid methods (Saad, 2003; Hackbusch, 1985)
. Coarsening is similar to our low resolution approximation, but the prolongation step is different. Rather than interpolate the entire coarse grid to finer grids, our method replaces some regions of coarse grid with its higher resolution approximation.
4.3 How do we compute ?
We obtained an approximation , but we should not instantiate this matrix (to avoid the cost). So, we discuss a simple procedure for computing without constructing the matrix. Define where and
(17) 
similar to (11). Then, the steps follow Alg. 2. We again start with multiplying coarse components of with , then successively add the multiplication of higher resolution components of and , and finally compute .
4.4 What is the overall complexity?
We have now described the overall procedure of our approximation approach. In this section, we analyze the complexity of our procedure. Following convention in efficient selfattention papers, we treat as a constant and it does not influence the complexity.
We first need to compute for . Since each row of requires averaging over two rows from , the total cost of computing for all is simply .
Given all , in Alg. 1, there are possible entries of . And at scale for , there are entries of since there are number of of satisfying and there are regions at scale to be refined. Note that computing each takes , and selecting topk elements is linear in the input size. Therefore, the cost of constructing is . Once is constructed, is simple since for a .
Finally, multiplying and in Alg. 2 takes , also. The cost of creating a is , so the cost of creating all for is . Then, for each , adding to takes . The size of is , so the total complexity of Alg. 2 is as stated.
Therefore, the total complexity of our approach is . For example, when , the complexity becomes . The parameter adjusts the tradeoff between approximation accuracy and runtime similar to other efficient methods, e.g., window size in for Longformer (Beltagy et al., 2020) and projection size in for Linformer (Wang et al., 2020) or Performer (Choromanski et al., 2021).
5 Experiments
We perform a broad set of experiments to evaluate the practical performance profile of our MRAbased selfattention module. First, we compare our approximation accuracy with several other baselines. Then, we evaluate our method on the RoBERTa language model pretraining (Liu et al., 2019) and downstream tasks on both short and long sequences. Finally, as is commonly reported in most evaluations of efficient selfattention methods, we discuss our evaluations on the Long Range Arena (LRA) benchmark (Tay et al., 2021)
. All hyperparameters are reported in §
A.4.Overview. Since the efficiency is a core focus of efficient selfattention methods, time and memory efficiency is taken into account when evaluating performance. Whenever possible, we include runtime and memory consumption of a single instance for each method alongside the accuracy it achieves (in each table). Since the models are exactly the same (except which selfattention module is used), we only profile the efficiency of one training step consumed by these modules. See §A.4 for more details on profiling.
Baselines. For a rigorous comparison, we use an extensive list of baselines, including Linformer (Wang et al., 2020), Performer (Choromanski et al., 2021), Nyströmformer (Xiong et al., 2021), SOFT (Lu et al., 2021), YOSO (Zeng et al., 2021), Reformer (Kitaev et al., 2020), Longformer (Beltagy et al., 2020), Big Bird (Zaheer et al., 2020), HTransformer1D (Zhu and Soricut, 2021), and Scatterbrain (Chen et al., 2021). Since Nystromformer, SOFT, and YOSO also have a variant which involves convolution, we perform evaluations for both cases. We use our multiresolution approximation with for our method denoted in experiments as MRA2. Further, we found that in tasks with limited dataset sizes, sparsity provides a regularization towards better performance. So, we include a MRA2s, which only computes
(18) 
after finding . We use different methodspecific hyperparameters for some methods to better understand their efficiencyperformance trade off. Takeaway: These detailed comparisons suggest that our MRAbased selfattention offers top performance and top efficiency among the baselines.
5.1 How good is the approximation accuracy?
We show that our method give the best tradeoff between approximation accuracy and efficiency by a significant margin compared to other baselines. The approximation accuracy of each method, compared to the standard selfattention, provides us a direct indication of the performance of approximation methods. To evaluate accuracy, we use and length , , and from a pretrained model and compute the relative error . As shown in Fig. 4, our MRA2(s) has the lowest approximation error while maintaining the fastest runtime and smallest memory consumption by a large margin compared to other baselines in both short and long sequences. See §A.4 for more details, sequence lengths, and baselines.
Next, we evaluate the effect of the spread (or entropy) of selfattention on the approximation for different methods. The result is shown in Fig. 5. We see one limitation of low rank or sparsitybased schemes (discussed in §A.2 and Chen et al. (2021)). Our MRA2 performs well across attention instances with different entropy settings and significantly better than Scatterbrain (Chen et al., 2021).
5.2 RoBERTa Language Modeling
Method  Time  Mem  MLM  MNLI  

ms  MB  Before  After  m  mm  
Transformer  0.86  71.0  73.1  74.0  87.4  87.3 
Performer  1.29  62.8  6.8  63.1  32.7  33.0 
Linformer  0.74  54.5  1.0  5.6  35.4  35.2 
SOFT  0.86  34.0  10.9  25.0  32.7  33.0 
SOFT + Conv  1.02  35.5  1.0  65.5  74.9  75.0 
Nystromformer  0.71  34.8  17.2  68.2  35.4  35.2 
Nystrom + Conv  0.88  37.2  1.4  70.9  85.1  84.6 
YOSO  0.97  29.8  13.0  68.4  35.4  35.2 
YOSO + Conv  1.20  32.9  3.0  69.0  83.2  83.1 
Reformer  1.23  59.4  0.7  69.5  84.9  85.0 
Longformer  1.30  43.3  66.0  71.2  85.6  85.4 
2.31  62.5  71.9  73.2  87.0  87.1  
Big Bird  2.03  63.9  71.6  73.3  87.1  87.0 
HTransformer1D  0.97  29.3  0.5  6.1  35.4  35.2 
Scatterbrain  2.23  78.7  60.6       
MRA2  0.73  28.1  68.9  73.1  86.8  87.1 
0.86  34.3  71.9  73.8  87.1  87.2  
MRA2s  0.66  23.8  67.2  72.8  87.0  87.0 
0.80  29.1  71.8  73.8  87.4  87.4 
Here, we use RoBERTa language modeling (Liu et al., 2019) to assess the performance and efficiency trade off of our method and baselines. We use a pretrained RoBERTabase to evaluate the compatibility of each method with the existing Transformer models and overall feasibility for direct deployment. For fair comparisons, we also check the performance of models trained from scratch. Then, MNLI (Williams et al., 2018) is used to test the model’s ability on downstream tasks. Further, we extend the 512 length models to 4096 length for a set of best performing methods and use the WikiHop (Welbl et al., 2018)
task as an assessment on long sequence language models.
Standard Sequence Length. Since efficient selfattention approximates standard selfattention, we could simply substitute the standard selfattention of a trained model. This would allow us to minimize the training cost for new methods. To evaluate compatibility with the existing models, we use a pretrained 512 length RoBERTabase model (Liu et al., 2019) and replace its selfattention module with efficient alternatives and measure the validation Masked Language Modeling (MLM) accuracy. Then, we check accuracy after finetuning the model on English Wikipedia and Bookcorpus (Zhu et al., 2015). Eventually, we finetune the model on the downstream task MNLI (Williams et al., 2018).
Method  Time  Mem  MLM  MNLI  

ms  MB  m  mm  
Transformer  0.41  35.47  57.0  72.70.6  73.80.2 
Performer  0.63  31.38  48.6  69.80.4  70.50.1 
Linformer  0.35  27.23  53.5  72.50.8  73.20.4 
SOFT  0.43  17.02  42.8  63.82.2  64.72.6 
SOFT + Conv  0.53  17.77  56.7  70.80.5  71.80.4 
Nystromformer  0.34  17.40  53.1  71.40.6  72.00.3 
Nystrom + Conv  0.45  18.60  57.3  73.00.4  73.90.6 
YOSO  0.47  14.91  53.4  72.90.8  73.20.4 
YOSO + Conv  0.58  16.42  57.2  72.50.4  72.90.5 
Reformer  0.39  16.43  52.4  73.70.4  74.60.3 
0.61  29.65  55.6  75.00.2  75.60.3  
Longformer  0.61  21.60  54.7  72.00.4  73.50.2 
1.10  31.44  57.4  75.80.5  76.70.6  
Big Bird  1.02  31.91  57.6  75.00.5  75.60.6 
HTransformer1D  0.47  14.65  43.7  62.92.7  63.43.9 
Scatterbrain  1.04  78.66  20.5  42.68.1  43.49.5 
MRA2  0.36  14.05  56.4  73.20.2  74.10.5 
0.43  17.15  57.3  73.01.0  73.90.8  
MRA2s  0.31  11.93  56.7  73.61.6  74.31.1 
0.38  14.57  57.5  73.90.6  74.60.8 
Only a handful of schemes including Longformer, Big Bird, and MRA2(s) are fully compatible with pretrained models. Scatterbrain has a reasonable accuracy without further finetuning, but the training diverges when finetuning the model. The other methods cannot get a satisfactory level of accuracy. These statements also hold for the downstream finetuning results, shown in Tab. 1. Our method has the best performance among baselines for both MLM and MNLI. Meanwhile, it has a much better time and memory efficiency.
Method  Time (ms)  Mem (GB)  MLM  WikiHop 

Transformer  30.88  3.93  74.3  74.6 
Longformer  10.20  0.35  71.1  60.8 
Big Bird  17.53  0.59     
MRA2  7.03  0.28  73.1  71.2 
9.25  0.38  73.7  73.4  
MRA2s  6.37  0.23  73.0  71.8 
8.62  0.38  73.8  74.1 
Since many baselines are not compatible with the trained model weights (performance degrades when substituting the selfattention module), to make the comparison fair for all methods, we also evaluate models trained from scratch. Due to the large number of baselines we use, we train a small variant of RoBERTa on English Wikipedia and BookCorpus (Zhu et al., 2015) to keep the training cost reasonable. Then, we again finetune the model on downstream task (MNLI). Results are summarized in Tab. 2. Only a few methods (including ours) achieve both good performance and efficiency.
Method  Time (ms)  Mem (GB)  MLM  WikiHop 

Transformer  15.36  1.96  55.8  54.61.6 
Performer  5.13  0.24  23.2  43.70.6 
Linformer  2.85  0.21  13.8  11.00.4 
SOFT  2.46  0.11  25.9  14.08.6 
5.92  0.24  31.0  12.11.9  
SOFT + Conv  3.33  0.11  52.8  30.829.3 
Nystromformer  2.38  0.11  34.7  44.00.2 
4.34  0.27  46.8  46.00.8  
Nystrom + Conv  3.23  0.12  53.1  54.60.8 
YOSO  4.15  0.12  47.8  52.40.1 
5.07  0.17  49.9  52.80.5  
YOSO + Conv  5.45  0.13  55.1  53.20.7 
Reformer  5.04  0.24  52.2  53.70.9 
Longformer  4.88  0.17  52.4  52.30.7 
Big Bird  8.68  0.29  54.4  54.30.7 
HTransformer1D  3.93  0.12  41.1  43.70.7 
Scatterbrain  8.83  0.31  35.8  12.10.9 
MRA2  3.43  0.14  54.2  52.60.9 
4.52  0.19  55.2  54.00.9  
MRA2s  3.12  0.12  53.8  51.80.9 
4.13  0.19  55.1  53.60.8 
Longer Sequences. To evaluate the performance of our MRA2(s) on longer sequences, we extend the 512 length models to 4096 length. We extend the positional embedding and further train the models on English Wikipedia, Bookcorpus (Zhu et al., 2015), one third of Stories dataset (Trinh and Le, 2018), and one third of RealNews dataset (Zellers et al., 2019) following (Beltagy et al., 2020). Then, the 4096 length models are finetuned on WikiHop dataset (Welbl et al., 2018) to assess the performance of these models on downstream tasks. The results are summarized in Tab. 3 for base models and Tab. 4 for small models. Our MRA2 is again one of the top performing methods with high efficiency among baselines. Note that the difference in WikiHop performance of Longformer (Beltagy et al., 2020) from the original paper is due to a much larger window size which has an even slower runtime. Linformer (Wang et al., 2020) does not seem to be able to adapt the weights from its 512 length model to a 4096 model. It is interesting that the convolution in Nystromförmer (Xiong et al., 2021) seems to play an important role in boosting performance.
5.3 Long Range Arena
Method  Listops  Text  Retrieval  Image  Pathfinder  Avg 

Transformer  37.10.4  65.20.6  79.61.7  38.50.7  72.81.1  58.70.3 
Performer  36.70.2  65.20.9  79.51.4  38.60.7  71.40.7  58.30.1 
Linformer  37.40.3  57.01.1  78.40.1  38.10.3  67.20.1  55.60.3 
SOFT  36.31.4  65.20.0  83.31.0  35.31.3  67.71.1  57.50.5 
SOFT + Conv  37.10.4  65.20.4  82.90.0  37.14.7  68.10.4  58.10.9 
Nystromformer  24.717.5  65.70.1  80.20.3  38.82.9  73.10.1  56.52.8 
Nystrom + Conv  30.68.9  65.70.2  78.91.2  43.23.4  69.11.0  57.51.5 
YOSO  37.00.3  63.10.2  78.30.7  40.80.8  72.90.6  58.40.3 
YOSO + Conv  37.20.5  64.91.2  78.50.9  44.60.7  69.53.5  59.01.1 
Reformer  18.92.4  64.90.4  78.21.6  42.40.4  68.91.1  54.70.2 
Longformer  37.20.3  64.10.1  79.71.1  42.60.1  70.70.8  58.90.1 
Big Bird  37.40.3  64.31.1  79.90.1  40.91.1  72.60.7  59.00.3 
HTransformer1D  30.48.8  66.00.2  80.10.4  42.10.8  70.70.1  57.81.8 
Scatterbrain  37.50.1  64.40.3  79.60.1  38.00.9  54.87.8  54.91.4 
MRA2  37.20.3  65.40.1  79.60.6  39.50.9  73.60.4  59.00.3 
MRA2s  37.40.5  64.30.8  80.30.1  41.10.4  73.80.6  59.40.2 
The Long Range Arena (LRA) (Tay et al., 2021) has been proposed to provide a lightweight benchmark to quickly compare the capability of long sequence modeling for Transformers. Due to a consistency issue and code compatibility of official LRA benchmark (see Issue34, Issue35, and LeeThorp et al. (2021)), we use the LRA code provided by (Xiong et al., 2021) and follow exactly the same hyperparameter setting. The results are shown in Tab. 5. Our method has the best performance compared to others.
Caveats. A reader may ask why Longformer, Big Bird, and MRA2s perform better than standard Transformers (Vaswani et al., 2017) despite being approximations. The performance difference is most obvious on the image task. We also found that Longformer with a smaller local attention window (, , ) tends to offer better performance (, , , respectively) on the image task. One reason is that standard selfattention needs larger datasets to compensate for its lack of locality bias (Xu et al., 2021; d’Ascoli et al., 2021). Hence, due to the small datasets (i.e., its lightweight nature), the LRA accuracy metrics should be interpreted with caution.
Method  Time (ms)  Mem (MB)  Top1  Top5 

Transformer  1.24  45.5  48.7  73.7 
Reformer  1.14  19.1  39.6  65.5 
Longformer  1.12  13.7  49.1  73.9 
HTransformer1D  1.03  9.8  48.7  73.9 
MRA2  1.00  11.8  48.9  73.6 
MRA2s  0.98  9.7  49.2  73.9 
Summary of ImageNet results trained on 4layer Transformers. We reports both top1 and top5 accuracy.
ImageNet
. To test the performance on large datasets, we use ImageNet
(Russakovsky et al., 2015)as a large scale alternative to CIFAR10
(Krizhevsky et al., 2009) used in image task of LRA. Further, data augmentation is used to increase the dataset size. Like LRA, we focus on small models and use a 4layer Transformer (see §A.4 for more details). Model specific hyperparameters are the same as the ones used on LRA. The results are shown in Tab. 6. MRA2s is the top performing approach. Standard selfattention and MRA2 can clearly perform better on a large dataset.6 Conclusion
We show that Multiresolution analysis (MRA) provides fresh ideas for efficiently approximating selfattention, which subsumes many piecemeal approaches in the literature. We expect that exploiting the links to MRA will allow leveraging a vast body of technical results developed over many decades. But we show that there are tangible practical benefits available immediately. When some consideration is given to which design choices or heuristics for a MRAbased selfattention scheme will interface well with mature software stacks and modern hardware, we obtain a procedure with strong advantages across both performance/accuracy and efficiency
. Further, our implementation can be directly plugged into existing Transformers, a feature missing in some existing efficient transformer implementations. We show use cases on longer sequence tasks and in resource limited setting but believe that various other applications of Transformers will also benefit in the short term. Finally, we should note the lack of integrated software support for MRA as well as our specialized model in current deep learning libraries. Overcoming this limitation required implementing custom CUDA kernels for some generic block sparsity operators. Therefore, extending our algorithm for other use cases may involve reimplementing the kernel. We hope that with broader use of MRAbased methods, the software support will improve thereby reducing this implementation barrier.
Acknowledgments
This work was supported by the UW AmFam Data Science Institute through funds from American Family Insurance. VS was supported by NIH RF1 AG059312. We thank Sathya Ravi for discussions regarding multigrid methods, Karu Sankaralingam for suggestions regarding hardware support for sparsity/block sparsity, and Pranav Pulijala for integrating our algorithm within the HuggingFace library.
References
 Longformer: the longdocument transformer. arXiv preprint arXiv:2004.05150. Cited by: §1, §1, §2.1, §4.4, §5.2, §5.

Robust principal component analysis?
. Journal of the ACM (JACM) 58 (3), pp. 1–37. Cited by: §A.2.  Scatterbrain: unifying sparse and lowrank attention. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Figure 7, §A.2, §1, §2.1, §5.1, §5.
 Rethinking attention with performers. In International Conference on Learning Representations (ICLR), Cited by: §1, §1, §2.1, §4.4, §5.
 Diffusion wavelets. Applied and computational harmonic analysis 21 (1), pp. 53–94. Cited by: §1.
 ConViT: improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning (ICML), Cited by: §5.3.
 Ten lectures on wavelets. CBMSNSF Regional Conference Series in Applied Mathematics, Society for Industrial and Applied Mathematics. Cited by: §A.5.
 BERT: pretraining of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Cited by: §A.2.
 Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association 90 (432), pp. 1200–1224. Cited by: §A.5.
 Denoising by softthresholding. IEEE Transactions on Information Theory 41 (3), pp. 613–627. Cited by: §A.5.

Multiscale wavelets on trees, graphs and high dimensional data: theory and applications to semi supervised learning
. In The International Conference on Machine Learning (ICML), Cited by: §1.  Multigrid methods and applications. Vol. 4, Springer Science & Business Media. Cited by: Remark 4.6.
 Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30 (2), pp. 129–150. Cited by: §1.

Relations of the nuclear norm of a tensor and its matrix flattenings
. Linear Algebra and its Applications 478, pp. 188–199. Cited by: §A.2. 
The incremental multiresolution matrix factorization algorithm.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, Cited by: §1.  Reformer: the efficient transformer. In International Conference on Learning Representations (ICLR), Cited by: §2.1, §4.1, §5.
 Multiresolution matrix factorization. In International Conference on Machine Learning (ICML), Cited by: §1.
 Revealing the dark secrets of bert. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §A.2.
 Learning multiple layers of features from tiny images. Cited by: §5.3.

Treelets: a tool for dimensionality reduction and multiscale analysis of unstructured data.
In
International Conference on Artificial Intelligence and Statistics
, Cited by: §1. 
FNet: mixing tokens with fourier transforms
. arXiv preprint arXiv:2105.03824. Cited by: §5.3.  Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §A.2, §5.2, §5.2, §5.
 SOFT: softmaxfree transformer with linear complexity. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.1, §5.
 A wavelet tour of signal processing. Elsevier. Cited by: §1.
 Random feature attention. In International Conference on Learning Representations (ICLR), Cited by: §2.1.

Connections between nuclearnorm and frobeniusnormbased representations.
IEEE Transactions on Neural Networks and Learning Systems
29 (1), pp. 218–224. Cited by: §A.2.  ImageNet Large Scale Visual Recognition Challenge. In International Journal of Computer Vision (IJCV), Cited by: §5.3.
 Iterative methods for sparse linear systems. Second edition, Society for Industrial and Applied Mathematics, . Cited by: Remark 4.6.
 Jensen’s inequality and new entropy bounds. Applied Mathematics Letters 22 (8), pp. 1262–1265. Cited by: §A.3.
 Long range arena : a benchmark for efficient transformers. In International Conference on Learning Representations (ICLR), Cited by: §5.3, §5.
 A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847. Cited by: §5.2.
 Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §5.3.
 Linformer: selfattention with linear complexity. arXiv preprint arXiv:2006.04768. Cited by: §1, §1, §2.1, §4.4, §5.2, §5.
 Constructing datasets for multihop reading comprehension across documents. Transactions of the Association for Computational Linguistics 6, pp. 287–302. Cited by: §5.2, §5.2.
 A broadcoverage challenge corpus for sentence understanding through inference. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Cited by: §5.2, §5.2.
 Transformers: stateoftheart natural language processing. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §A.4.
 Nyströmformer: a nyströmbased algorithm for approximating selfattention. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §1, §2.1, §5.2, §5.3, §5.
 ViTAE: vision transformer advanced by exploring intrinsic inductive bias. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §5.3.
 Big bird: transformers for longer sequences. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §1, §2.1, §5.
 Defending against neural fake news. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §5.2.
 You only sample (almost) once: linear cost selfattention via bernoulli sampling. In International Conference on Machine Learning (ICML), Cited by: §1, §2.1, §5.
 Aligning books and movies: towards storylike visual explanations by watching movies and reading books. In International Conference on Computer Vision (ICCV), Cited by: §5.2, §5.2, §5.2.
 Htransformer1D: fast onedimensional hierarchical attention for sequences. In Annual Meeting of the Association for Computational Linguistics, Cited by: §1, §2.1, §5.
Appendix A Appendix
The appendix includes more details regarding the formulation, analysis, and experiments.
a.1 Visualizing Approximation Procedure in Linear Scale
We also show a visualization of our approximation procedure using a linear scale in Fig. 6. This figure gives a better illustration of how approximation quality increases as approximation procedure proceeds.
a.2 Link to Sparsity and Low Rank
Low rank and sparsity are two popular directions for efficient selfattention. To explore the potential of these two types of approximations, we set aside the efficiency consideration and use the best possible methods for each type of approximation. Specifically, subject to , we use sparse approximation which minimizes by finding support on the largest entries of , and low rank approximation which minimizes via truncated SVD. As shown in Fig. 7, these two types of methods are limited for approximating selfattention. The low rank method requires superlinear cost to maintain the approximation accuracy and fails when the entropy of selfattention is smaller. In many cases, sparse approximation is sufficient, but in some cases when the selfattention matrices are less sparse and have larger entropy, the sparse approximation would fail as well. This motivates the use of sparse + low rank approximation. It can be achieved via robust PCA which decomposes approximation to by solving an optimization objective . A convex relaxation (Candès et al., 2011) is used to make the optimization tractable, but the cost of finding a good solution is still more than , which is not suitable for efficient selfattention. Scatterbrain (Chen et al., 2021) proposes to combine an existing sparse attention with an existing low rank attention to obtain a sparse + low rank approximation and avoid the expensive cost of robust PCA.
Interestingly, a special form of our work offers an alternative to Scatterbrain’s approach for sparse + low rank approximation. When for some , our MRA2 can be viewed as a sparse + low rank approximation. Specifically, let
(19) 
for a resolution , then serves as a reasonably good solution for a relaxed version of sparse and low rank decomposition.
Let us consider an alternative relaxation of robust PCA objective,
(20) 
Note that is easier to compute. And we have (Hu, 2015). In fact, Peng et al. (2018) shows that solutions obtained by minimizing and are two solutions of a low rank recovery problem and are identical in some situations. The optimal solution for objective (20) can be easily obtained. For , there exists a such that the optimal has support on the largest entries of , and . However, for practical use, the recovered sparsity cannot be efficiently used on a GPU due to scattered support. Further, the complexity is since we found that we still need to find the largest entries of . Suppose we restrict to be a block sparse matrix, namely, supported on a subset of , then a GPU can exploit this block sparsity structure to significantly accelerate computation. Further, the optimal is supported on the regions with the largest defined as
(21) 
which is similar to (8). As a result, similar to approximating (8) with (10), we use the lower bound as an proxy for (21). Then, the cost of locating support blocks is only . Consequently, the resulting solution is supported on the regions with the largest . Note that this is exactly with an appropriate budget . And the is a reasonable solution for since is small and .
We empirically evaluate the quality of this sparse solution. (Kovaleva et al., 2019) showed that the BERT model (Devlin et al., 2019) has multiple selfattention patterns capturing different semantic information. We investigate the types of possible selfattention of a pretrained RoBERTa model (Liu et al., 2019). And we show the optimal sparsity supports for selfattention matrices generated from RoBERTabase with 4096 sequence in the top plots of Fig. 8. Our MRA2, as shown in the bottom plots of Fig. 8, is able to find a reasonably good sparse solution for (20). Notice that while many selfattention matrices tend out to be diagonally banded matrices, which Longformer and Big Bird can approximate well, they are not the only possible structure. Diagonally banded structure is not sufficient to approximate the last two selfattention patterns well.
a.3 Analysis
In this section, we discuss in more detail some manipulations described in the main paper.
Observation A.1.
We can rewrite as where is the index of that has the smallest support region and is supported on .
Proof.
We describe the details next. First, notice that at each scale ,
(22) 
If for all , then and thus . Further, at each scale , the supports of are disjoint, and there is exactly one whose support includes . Thus, if