Log In Sign Up

Multi Resolution Analysis (MRA) for Approximate Self-Attention

by   Zhanpeng Zeng, et al.

Transformers have emerged as a preferred model for many tasks in natural langugage processing and vision. Recent efforts on training and deploying Transformers more efficiently have identified many strategies to approximate the self-attention matrix, a key module in a Transformer architecture. Effective ideas include various prespecified sparsity patterns, low-rank basis expansions and combinations thereof. In this paper, we revisit classical Multiresolution Analysis (MRA) concepts such as Wavelets, whose potential value in this setting remains underexplored thus far. We show that simple approximations based on empirical feedback and design choices informed by modern hardware and implementation challenges, eventually yield a MRA-based approach for self-attention with an excellent performance profile across most criteria of interest. We undertake an extensive set of experiments and demonstrate that this multi-resolution scheme outperforms most efficient self-attention proposals and is favorable for both short and long sequences. Code is available at <>.


page 1

page 12


Understanding The Robustness in Vision Transformers

Recent studies show that Vision Transformers(ViTs) exhibit strong robust...

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

Transformers have emerged as a powerful tool for a broad range of natura...

Visual Attention Emerges from Recurrent Sparse Reconstruction

Visual attention helps achieve robust perception under noise, corruption...

Dilated Neighborhood Attention Transformer

Transformers are quickly becoming one of the most heavily applied deep l...

GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers

There has been a growing interest in interpreting the underlying dynamic...

DCT-Former: Efficient Self-Attention with Discrete Cosine Transform

Since their introduction the Trasformer architectures emerged as the dom...

GraphiT: Encoding Graph Structure in Transformers

We show that viewing graphs as sets of node features and incorporating s...

1 Introduction

The Transformer model proposed in (Vaswani et al., 2017)

is the architecture of choice for a number of tasks in natural language processing (NLP) as well as vision. The “self-attention” mechanism within Transformers, and specifically, its extension referred to as Multi-Head Self-Attention, enable modeling dynamic token dependencies, and play a key role in the overall performance profile of the model. Despite these advantages, the

complexity of the self-attention (where is the sequence length) is a bottleneck, especially for longer sequences. Consequently, over the last year or so, extensive effort has been devoted in deriving methods that mitigate this quadratic cost – a wide choice of algorithms with linear (in the sequence length) complexity are now available (Wang et al., 2020; Choromanski et al., 2021; Zeng et al., 2021; Beltagy et al., 2020; Zaheer et al., 2020).

The aforementioned body of work on efficient transformers leverages the observation that the self-attention matrix has a parsimonious representation – the mechanics of how this is modeled and exploited at the algorithm level, varies from one method to the other. For instance, we may ask that self-attention has a pre-specified form of sparsity (for instance, diagonal or banded), see (Beltagy et al., 2020; Zaheer et al., 2020). Alternatively, we may model self-attention globally as a low-rank matrix, successfully utilized in (Wang et al., 2020; Xiong et al., 2021; Choromanski et al., 2021). Clearly, each modeling choice entails its own form of approximation error, which we can measure empirically on existing benchmarks. Progress has been brisk in improving these approximations: recent proposals have investigated a hybrid global+local strategy based on invoking a robust PCA style model (Chen et al., 2021) and hierarchical (or H-) matrices (Zhu and Soricut, 2021), which we will draw a contrast with.

MRA. Recall that the hybrid global+local intuition above has a rich classical treatment formalized under Multiresolution analysis (MRA) methods, and Wavelets are a prominent example (Mallat, 1999). While the use of wavelets for signal processing goes back at least three decades (if not longer), their use in machine learning especially for graph based datasets, as a numerical preconditioner, and even for matrix decomposition problems has seen a resurgence (Kondor et al., 2014; Ithapu et al., 2017; Gavish et al., 2010; Lee and Nadler, 2007; Hammond et al., 2011; Coifman and Maggioni, 2006)

. The extent to which classical MRA ideas (or even their heuristic forms) can guide efficiency in Transformers is largely unknown. Fig.

1 shows that, given a representative self-attention matrix, only of the coefficients are sufficient for a high fidelity reconstruction. At a minimum, we see that the hypothesis of evaluating a MRA-based self-attention within a Transformer model may have merit.

Contributions. The goal of this paper is to investigate the specific modifications, adjustments and approximations needed to make the MRA idea operational in Transformers. It turns out that modulo some small compromises (on the theoretical side), MRA-based self-attention shows excellent performance across the board – it is competitive with standard self-attention and outperforms most of baselines while maintaining significantly high time and memory efficiency on both short and long sequences.

2 Preliminaries: Self-attention and Wavelets

We briefly review self attention and wavelet decomposition, two concepts that we will use throughout the paper.

2.1 Self-attention

Given embedding matrices representing

-dimensional feature vectors for queries, key and values, respectively, self-attention is defined as


where is a diagonal matrix that normalizes each row of the matrix such that the row entries sum up to . For notational simplicity, the scaling factor and linear projections applied to are omitted. An explicit calculation of and taking its product with incurs a cost (if is treated as fixed/constant for complexity analysis purposes), a serious resource bottleneck and the core focus of the existing literature on efficient Transformers.

Related Work on Efficient Transformers. A number of efficient self-attention methods have been proposed to reduce the cost. Much of this literature can be, roughly speaking, divided into two categories: low rank and sparsity. Linformer (Wang et al., 2020) shows that self-attention matrices are low rank and proposes to learn a projection – projecting the sequence length dimension to lower dimensions. Performer (Choromanski et al., 2021) and Random Feature Attention (Peng et al., 2021) view self-attention matrices as kernel matrices of infinite feature maps and propose using finite random feature maps to approximate the kernel matrices. Nyströmformer (Xiong et al., 2021) and SOFT (Lu et al., 2021) use a Nyström method for matrix approximation to approximate the self-attention matrices.

A number of methods also leverage the sparsity of self-attention matrices. By exploiting a high dependency within a local context, Longformer (Beltagy et al., 2020) proposes a sliding window attention with global attention on manually selected tokens. In addition to the sparsity used in Longformer, Big Bird (Zaheer et al., 2020) adds random sparse attention to further refine the approximation. Instead of a prespecified sparsity support, Reformer (Kitaev et al., 2020) uses hashing (LSH) to compute self-attention only within approximately nearby tokens, and YOSO (Zeng et al., 2021)

uses the collision probability of LSH as attention weights and then, a LSH-based sampler to achieve linear complexity.

The discussion in (Chen et al., 2021) suggests that approximations relying solely on low rank or sparsity are limited and a hybrid model via robust PCA offers better approximation. Scatterbrain (Chen et al., 2021) uses a sparse attention + low rank attention strategy to avoid the cost of robust PCA. In §A.2, we discuss some limitations of low rank and sparsity for self-attention approximation, and show that a special form of our MRA approximation can offer a good solution for a relaxation of robust PCA.

Independent of our work, recently, H-Transformer-1D (Zhu and Soricut, 2021) proposed a hierarchical self-attention where the self-attention matrices have a low rank structure on the off-diagonal entries and attention is precisely calculated for the on-diagonal entries. This is also a form of multiresolution approximation for self-attention although the lower resolution for distant tokens may limit its ability to capture precise long range dependencies. While a prespecified structure can indeed provide an effective approximation scheme in specific settings, it would be desirable to avoid restriction to a fixed structure, if possible.

2.2 Wavelet Transform

A wavelet transform decomposes a signal into different scales and locations represented by a set of scaled and translated copies of a fixed function. This fixed function is called a mother wavelet, and the scaled and translated copies are called child wavelets specified by two factors, scale and translation .


Here, controls the “dilation” or inverse of the frequency of wavelet, while controls the location (e.g., time). These scaled/translated versions of mother wavelets play a key role in MRA. Given a choice of , the wavelet transform maps a function to coefficients


The coefficient captures the measurement of at scale and location .

3 MRA view of Self-attention

To motivate the use of MRA, in §1, we used Fig. 1 to check how a 2D Haar wavelet basis decomposes the target matrix into terms involving different scales and translations, and terms with larger coefficients suffice for a good approximation of . But the reader will notice that the calculation of the coefficients requires access to the full matrix . Our discussion below will start from a formulation which will still need the full matrix . Later, in §4, by exploiting the locality of and , we will be able to derive an approximation with reduced complexity (without access to ).

For simplicity, we assume that the sequence length for some integer . Inspired by the Haar basis and its ability to adaptively approximate while preserving locality, we apply a pyramidal MRA scheme. We consider a decomposition of using a set of simpler unnormalized components defined as


for and . Here, is the support of , and represents the scale of the components, i.e., a smaller denotes higher resolutions and vice versa. Also, denote the translation of the components.

Why not Haar basis? The main reason for using the form in (4) instead of a 2D Haar basis directly is implementation driven, and will be discussed shortly in Remark 3.1

. For the moment, we can observe that (

4) is an overcomplete frame for . As shown in Fig. 2, frame (4) has one extra scale (with support on a single entry) compared to the Haar basis (4 rows versus 3 rows). Except for this extra scale, (4) has the same support as the Haar basis at different scales. In addition, (4) provides scaled and translated copies of the “mother” component, similar to Haar.

Figure 2: The left plot is the overcomplete frame defined in (4), which consists of matrices for . The right plot is a 2D generalization of Haar basis, which consists of three groups of self-similar matrices plus a constant matrix (all entries equal to ). The color red, blue, and gray means positive, negative, and , respectively. Notice that (4) does not include negative entries: it involves more components but makes the formulation simpler.

Let be a set of components for the possible scales and translations, then we decompose


for some set of coefficients . Since (4) is overcomplete, the coefficients are not unique. We specifically compute the coefficients as follows. Let and


Here, the denotes the residuals of the higher frequencies. At each scale , is the optimal solution of the least squares problem minimizing . Intuitively, the approximation procedure starts from the coarsest approximation of which only consists of the lowest frequency, then the procedure refines the approximation by adding residuals of higher frequencies.

Parsimony. We empirically observe that the coefficients of most components are near zero, so we can represent with only a few components while maintaining the accuracy of approximation. Specifically, we can find a small subset , the corresponding coefficients computed following (6), and the resulting approximation


with a negligible approximation error .

But in (7) does not suggest any interesting property of the approximation. But we can check an equivalent form of . Denote the average of over the support of to be


It turns out that the entries of in (7) can be rewritten as


where is the index of that has the smallest support region and is supported on . But if for all , then . The entry of is precisely approximated by the average of over the smallest support region for containing . In other words, the procedure uses the highest resolution possible for a given as an approximation. We discuss and show how we obtain (9) from (7) in §A.3. The reader will notice that rewriting as (9) is possible due to our modifications to the Haar basis in (4).

Remark 3.1.

Consider using a Haar decomposition and let be the subset of basis with nonzero coefficients. The approximation depends on all which are supported on . For example, in the worst case, the coefficients of all which are supported on need to be nonzero to have . We find that a hardware friendly and efficient approximation scheme in this case is challenging. On the other hand, when using the decomposition (6) over the overcomplete frame (4), depends on only one that has the smallest support region and is supported on . This makes constructing the set easier and more flexible.

4 A Practical Approximation scheme

Given that we now understand all relevant modules, we can focus on practical considerations. Notice that each requires averaging over entries of the matrix , so in the worst case, we would need access to the entire matrix to compute all for . Nonetheless, suppose that we still compute all the coefficients and then post-hoc truncate the small coefficients to construct the set . This approach will clearly be inefficient. In this section, we discuss two strategies where the main goal is efficiency.

4.1 Can we approximate quickly?

We first discuss calculating . To avoid accessing the full matrix , instead of computing the average of exponential (8), we compute a lower bound (due to convexity of exponential), i.e., exponential of average (10), as an approximation.


We can verify that the expression in (10) can be computed efficiently as follows. Define where , , and


Here, and denote the -th row of the matrix and , respectively. Interestingly, (10) is simply,


Then, the approximation using (10) is,


where is the same as (9) and otherwise. Each only requires an inner product between one row of and one row of and applying an exponential, so the cost of a single is when and are provided. We will discuss the overall complexity in §4.4.

While efficient, this modification will incur a small amount of error. However, by using the property of inherited from and , we can quantify the error.

Lemma 4.1.

Assume for all , and where , then where for all and some , and

where .

Lemma 4.1 suggests that the approximation error depends on the “spread” or numerical range of values (range, for short) in the entries within a region and . If is small or is small, then the approximation error is small. The range of a region is influenced by properties of and . The range is bounded by the norm and spread of and for . This relies on the locality assumption that spatially nearby tokens should also be semantically similar which commonly holds in many applications. Of course, this can be avoided if needed – it is easy to reduce the spread of and in local regions simply by permuting the order of and . For example, we can use Locality Sensitive Hashing (LSH) to reorder and such that similar vectors are in nearby positions, e.g., see (Kitaev et al., 2020). While the range is data/problem dependent, we can control the range by using a smaller since the range of a smaller region will be smaller. In the extreme case, when , the range is . So, this offers guidance that when is large, we should approximate the region at a higher resolution such that the range is smaller.

Remark 4.2.

Observe that the numerical range, which is defined as a bound on finite differences over sets of indices, is closely related to the concept of smoothness, which is defined using finite differences amongst adjacent indices. Indeed, it is possible to adapt Lemma 4.1 and its proof to the theory of wavelets, which are useful for characterizing signal smoothness. Please see §A.5 for more details.

Remark 4.3.

The underlying assumption of diagonal attention structure from Longformer, Big Bird, and H-Transformer-1D is that tokens are highly dependent on the nearby tokens and only the nearby tokens, which is more important than attention w.r.t. distant tokens. This might appear similar to the locality assumption discussed earlier, but this is incorrect. Our locality does not assume that semantically similar tokens must be spatially close, i.e., we allow high and precise dependence on distant tokens.

4.2 Can we construct quickly?

Figure 3: Illustration of our approximation scheme for . A log scale is used for a better visualization. A linear scale visualization is shown in §A.1.

So far, we have assumed that the set is given which is not true in practice. We now place a mild restriction on the set as a simplification heuristic. We allow each entry of to be included in the support of exactly one . Consider a , if each entry of the support region of is included in the support of some with a smaller , then can be safely removed from without affecting the approximation, by construction. This restriction allows us to avoid searching for the with the smallest among multiple candidates. Then, the overall approximation can be written as

Remark 4.4.

Under this restriction, is a subset of an orthogonal basis of .

Mechanics of constructing . Now, we can discuss how the set is constructed. Let us first consider the approximation error , by factoring out ,


Since the goal is to minimize the error , the optimal solution is to fix the computation budget and solve the optimization problem which minimizes over all possible . However, this might not be efficiently solvable.

Instead, we consider finding a good solution greedily. Consider the error . We can analyze the specific term to get an insight into how approximation error can be reduced. Note that


is the deviation of from the mean of the support region, so the approximation error (15) is determined by and the deviation of within the region, which coincides with the conclusion of Lemma 4.1. Computing this deviation would incur a cost, so we avoid using it as a criteria for construction. We found that we can make a reasonable assumption that the deviation of in a support of for the same are similar, and the deviation of a region for a smaller is smaller. Then, a sensible heuristic is to use as a criteria such that if is large, then we must approximate the region using a higher resolution. The approximation procedure is described in Alg. 1, and the approximation result is shown in Fig. 3. Broadly, this approximation starts with a coarse approximation of a self-attention matrix, then for regions with a large , we successively refines the regions to a higher resolution.

  Input: in descending order
  Input: Budget for each for
  Input: Initial (empty or prespecified via priors)
  Compute for all possible and add to
  for  to  do
     Pop elements with the largest
     for each  do
         Compute for all
         Add to
     end for
  end for
Algorithm 1 Constructing the set

With the approximation procedure in place, we can quantify the error of this multiresolution approximation. We only show the approximation error for for some , but the analysis easily extends beyond .

Proposition 4.5.

Let for some and be the -th largest , assume for all , for some and , then


Proposition 4.5 again emphasizes the relation between the numerical range of and the quality of an approximation. With some knowledge of the range and , we can control the error using an appropriate budget .

Remark 4.6.

The procedure shares some commonalities with the correction component of Geometric Multigrid methods (Saad, 2003; Hackbusch, 1985)

. Coarsening is similar to our low resolution approximation, but the prolongation step is different. Rather than interpolate the entire coarse grid to finer grids, our method replaces some regions of coarse grid with its higher resolution approximation.

4.3 How do we compute ?

We obtained an approximation , but we should not instantiate this matrix (to avoid the cost). So, we discuss a simple procedure for computing without constructing the matrix. Define where and


similar to (11). Then, the steps follow Alg. 2. We again start with multiplying coarse components of with , then successively add the multiplication of higher resolution components of and , and finally compute .

  Input: in descending order

to be zero matrix

  for  to  do
     Duplicate rows of to create
     for each  do
     end for
  end for
Algorithm 2 Computing

4.4 What is the overall complexity?

We have now described the overall procedure of our approximation approach. In this section, we analyze the complexity of our procedure. Following convention in efficient self-attention papers, we treat as a constant and it does not influence the complexity.

We first need to compute for . Since each row of requires averaging over two rows from , the total cost of computing for all is simply .

Given all , in Alg. 1, there are possible entries of . And at scale for , there are entries of since there are number of of satisfying and there are regions at scale to be refined. Note that computing each takes , and selecting top-k elements is linear in the input size. Therefore, the cost of constructing is . Once is constructed, is simple since for a .

Finally, multiplying and in Alg. 2 takes , also. The cost of creating a is , so the cost of creating all for is . Then, for each , adding to takes . The size of is , so the total complexity of Alg. 2 is as stated.

Therefore, the total complexity of our approach is . For example, when , the complexity becomes . The parameter adjusts the trade-off between approximation accuracy and runtime similar to other efficient methods, e.g., window size in for Longformer (Beltagy et al., 2020) and projection size in for Linformer (Wang et al., 2020) or Performer (Choromanski et al., 2021).

5 Experiments

We perform a broad set of experiments to evaluate the practical performance profile of our MRA-based self-attention module. First, we compare our approximation accuracy with several other baselines. Then, we evaluate our method on the RoBERTa language model pretraining (Liu et al., 2019) and downstream tasks on both short and long sequences. Finally, as is commonly reported in most evaluations of efficient self-attention methods, we discuss our evaluations on the Long Range Arena (LRA) benchmark (Tay et al., 2021)

. All hyperparameters are reported in §


Overview. Since the efficiency is a core focus of efficient self-attention methods, time and memory efficiency is taken into account when evaluating performance. Whenever possible, we include runtime and memory consumption of a single instance for each method alongside the accuracy it achieves (in each table). Since the models are exactly the same (except which self-attention module is used), we only profile the efficiency of one training step consumed by these modules. See §A.4 for more details on profiling.

Baselines. For a rigorous comparison, we use an extensive list of baselines, including Linformer (Wang et al., 2020), Performer (Choromanski et al., 2021), Nyströmformer (Xiong et al., 2021), SOFT (Lu et al., 2021), YOSO (Zeng et al., 2021), Reformer (Kitaev et al., 2020), Longformer (Beltagy et al., 2020), Big Bird (Zaheer et al., 2020), H-Transformer-1D (Zhu and Soricut, 2021), and Scatterbrain (Chen et al., 2021). Since Nystromformer, SOFT, and YOSO also have a variant which involves convolution, we perform evaluations for both cases. We use our multiresolution approximation with for our method denoted in experiments as MRA-2. Further, we found that in tasks with limited dataset sizes, sparsity provides a regularization towards better performance. So, we include a MRA-2-s, which only computes


after finding . We use different method-specific hyperparameters for some methods to better understand their efficiency-performance trade off. Takeaway: These detailed comparisons suggest that our MRA-based self-attention offers top performance and top efficiency among the baselines.

5.1 How good is the approximation accuracy?

Figure 4: Approximation Error vs Runtime vs Memory. Red/dotted vertical line is the runtime of standard self-attention. Note that for any points to the right of the red vertical line, the approximation is slower than computing the true self-attention.

We show that our method give the best trade-off between approximation accuracy and efficiency by a significant margin compared to other baselines. The approximation accuracy of each method, compared to the standard self-attention, provides us a direct indication of the performance of approximation methods. To evaluate accuracy, we use and length , , and from a pretrained model and compute the relative error . As shown in Fig. 4, our MRA-2(-s) has the lowest approximation error while maintaining the fastest runtime and smallest memory consumption by a large margin compared to other baselines in both short and long sequences. See §A.4 for more details, sequence lengths, and baselines.

Figure 5: Entropy vs Approximation Error plots. Hyperparameters of each method is set such that the runtime is and , resp. The fastest setting of Big Bird is still , so omitted from the second plot. Note that the runtime of standard self-attention is roughly 30ms.

Next, we evaluate the effect of the spread (or entropy) of self-attention on the approximation for different methods. The result is shown in Fig. 5. We see one limitation of low rank or sparsity-based schemes (discussed in §A.2 and Chen et al. (2021)). Our MRA-2 performs well across attention instances with different entropy settings and significantly better than Scatterbrain (Chen et al., 2021).

5.2 RoBERTa Language Modeling

Method Time Mem MLM MNLI
ms MB Before After m mm
Transformer 0.86 71.0 73.1 74.0 87.4 87.3
Performer 1.29 62.8 6.8 63.1 32.7 33.0
Linformer 0.74 54.5 1.0 5.6 35.4 35.2
SOFT 0.86 34.0 10.9 25.0 32.7 33.0
SOFT + Conv 1.02 35.5 1.0 65.5 74.9 75.0
Nystromformer 0.71 34.8 17.2 68.2 35.4 35.2
Nystrom + Conv 0.88 37.2 1.4 70.9 85.1 84.6
YOSO 0.97 29.8 13.0 68.4 35.4 35.2
YOSO + Conv 1.20 32.9 3.0 69.0 83.2 83.1
Reformer 1.23 59.4 0.7 69.5 84.9 85.0
Longformer 1.30 43.3 66.0 71.2 85.6 85.4
2.31 62.5 71.9 73.2 87.0 87.1
Big Bird 2.03 63.9 71.6 73.3 87.1 87.0
H-Transformer-1D 0.97 29.3 0.5 6.1 35.4 35.2
Scatterbrain 2.23 78.7 60.6 - - -
MRA-2 0.73 28.1 68.9 73.1 86.8 87.1
0.86 34.3 71.9 73.8 87.1 87.2
MRA-2-s 0.66 23.8 67.2 72.8 87.0 87.0
0.80 29.1 71.8 73.8 87.4 87.4
Table 1: Summary of 512 length RoBERTa-base models: runtime and memory efficiency, MLM accuracy, and MNLI accuracy. Unit for time (and memory) is ms (and MB). Before/After denotes accuracy before/after finetuning. The m/mm give the matched/mismatched MNLI. Some methods have more than one row for different model-specific hyperparameters. We divide measurements into three ranked groups for visualization (bold, normal, transparent).

Here, we use RoBERTa language modeling (Liu et al., 2019) to assess the performance and efficiency trade off of our method and baselines. We use a pretrained RoBERTa-base to evaluate the compatibility of each method with the existing Transformer models and overall feasibility for direct deployment. For fair comparisons, we also check the performance of models trained from scratch. Then, MNLI (Williams et al., 2018) is used to test the model’s ability on downstream tasks. Further, we extend the 512 length models to 4096 length for a set of best performing methods and use the WikiHop (Welbl et al., 2018)

task as an assessment on long sequence language models.

Standard Sequence Length. Since efficient self-attention approximates standard self-attention, we could simply substitute the standard self-attention of a trained model. This would allow us to minimize the training cost for new methods. To evaluate compatibility with the existing models, we use a pretrained 512 length RoBERTa-base model (Liu et al., 2019) and replace its self-attention module with efficient alternatives and measure the validation Masked Language Modeling (MLM) accuracy. Then, we check accuracy after finetuning the model on English Wikipedia and Bookcorpus (Zhu et al., 2015). Eventually, we finetune the model on the downstream task MNLI (Williams et al., 2018).

Method Time Mem MLM MNLI
ms MB m mm
Transformer 0.41 35.47 57.0 72.70.6 73.80.2
Performer 0.63 31.38 48.6 69.80.4 70.50.1
Linformer 0.35 27.23 53.5 72.50.8 73.20.4
SOFT 0.43 17.02 42.8 63.82.2 64.72.6
SOFT + Conv 0.53 17.77 56.7 70.80.5 71.80.4
Nystromformer 0.34 17.40 53.1 71.40.6 72.00.3
Nystrom + Conv 0.45 18.60 57.3 73.00.4 73.90.6
YOSO 0.47 14.91 53.4 72.90.8 73.20.4
YOSO + Conv 0.58 16.42 57.2 72.50.4 72.90.5
Reformer 0.39 16.43 52.4 73.70.4 74.60.3
0.61 29.65 55.6 75.00.2 75.60.3
Longformer 0.61 21.60 54.7 72.00.4 73.50.2
1.10 31.44 57.4 75.80.5 76.70.6
Big Bird 1.02 31.91 57.6 75.00.5 75.60.6
H-Transformer-1D 0.47 14.65 43.7 62.92.7 63.43.9
Scatterbrain 1.04 78.66 20.5 42.68.1 43.49.5
MRA-2 0.36 14.05 56.4 73.20.2 74.10.5
0.43 17.15 57.3 73.01.0 73.90.8
MRA-2-s 0.31 11.93 56.7 73.61.6 74.31.1
0.38 14.57 57.5 73.90.6 74.60.8
Table 2: Summary of 512 length RoBERTa-small models. We also include a 95% error bar for experiments that have a small compute burden.

Only a handful of schemes including Longformer, Big Bird, and MRA-2(-s) are fully compatible with pretrained models. Scatterbrain has a reasonable accuracy without further finetuning, but the training diverges when finetuning the model. The other methods cannot get a satisfactory level of accuracy. These statements also hold for the downstream finetuning results, shown in Tab. 1. Our method has the best performance among baselines for both MLM and MNLI. Meanwhile, it has a much better time and memory efficiency.

Method Time (ms) Mem (GB) MLM WikiHop
Transformer 30.88 3.93 74.3 74.6
Longformer 10.20 0.35 71.1 60.8
Big Bird 17.53 0.59 - -
MRA-2 7.03 0.28 73.1 71.2
9.25 0.38 73.7 73.4
MRA-2-s 6.37 0.23 73.0 71.8
8.62 0.38 73.8 74.1
Table 3: Summary of 4096 length RoBERTa-base models. Since Big Bird is slow and we are not able to reduce its training time using multiple GPU, we cannot can test Big Bird for 4096 sequence.

Since many baselines are not compatible with the trained model weights (performance degrades when substituting the self-attention module), to make the comparison fair for all methods, we also evaluate models trained from scratch. Due to the large number of baselines we use, we train a small variant of RoBERTa on English Wikipedia and BookCorpus (Zhu et al., 2015) to keep the training cost reasonable. Then, we again finetune the model on downstream task (MNLI). Results are summarized in Tab. 2. Only a few methods (including ours) achieve both good performance and efficiency.

Method Time (ms) Mem (GB) MLM WikiHop
Transformer 15.36 1.96 55.8 54.61.6
Performer 5.13 0.24 23.2 43.70.6
Linformer 2.85 0.21 13.8 11.00.4
SOFT 2.46 0.11 25.9 14.08.6
5.92 0.24 31.0 12.11.9
SOFT + Conv 3.33 0.11 52.8 30.829.3
Nystromformer 2.38 0.11 34.7 44.00.2
4.34 0.27 46.8 46.00.8
Nystrom + Conv 3.23 0.12 53.1 54.60.8
YOSO 4.15 0.12 47.8 52.40.1
5.07 0.17 49.9 52.80.5
YOSO + Conv 5.45 0.13 55.1 53.20.7
Reformer 5.04 0.24 52.2 53.70.9
Longformer 4.88 0.17 52.4 52.30.7
Big Bird 8.68 0.29 54.4 54.30.7
H-Transformer-1D 3.93 0.12 41.1 43.70.7
Scatterbrain 8.83 0.31 35.8 12.10.9
MRA-2 3.43 0.14 54.2 52.60.9
4.52 0.19 55.2 54.00.9
MRA-2-s 3.12 0.12 53.8 51.80.9
4.13 0.19 55.1 53.60.8
Table 4: Summary of 4096 length RoBERTa-small models.

Longer Sequences. To evaluate the performance of our MRA-2(-s) on longer sequences, we extend the 512 length models to 4096 length. We extend the positional embedding and further train the models on English Wikipedia, Bookcorpus (Zhu et al., 2015), one third of Stories dataset (Trinh and Le, 2018), and one third of RealNews dataset (Zellers et al., 2019) following (Beltagy et al., 2020). Then, the 4096 length models are finetuned on WikiHop dataset (Welbl et al., 2018) to assess the performance of these models on downstream tasks. The results are summarized in Tab. 3 for base models and Tab. 4 for small models. Our MRA-2 is again one of the top performing methods with high efficiency among baselines. Note that the difference in WikiHop performance of Longformer (Beltagy et al., 2020) from the original paper is due to a much larger window size which has an even slower runtime. Linformer (Wang et al., 2020) does not seem to be able to adapt the weights from its 512 length model to a 4096 model. It is interesting that the convolution in Nystromförmer (Xiong et al., 2021) seems to play an important role in boosting performance.

5.3 Long Range Arena

Method Listops Text Retrieval Image Pathfinder Avg
Transformer 37.10.4 65.20.6 79.61.7 38.50.7 72.81.1 58.70.3
Performer 36.70.2 65.20.9 79.51.4 38.60.7 71.40.7 58.30.1
Linformer 37.40.3 57.01.1 78.40.1 38.10.3 67.20.1 55.60.3
SOFT 36.31.4 65.20.0 83.31.0 35.31.3 67.71.1 57.50.5
SOFT + Conv 37.10.4 65.20.4 82.90.0 37.14.7 68.10.4 58.10.9
Nystromformer 24.717.5 65.70.1 80.20.3 38.82.9 73.10.1 56.52.8
Nystrom + Conv 30.68.9 65.70.2 78.91.2 43.23.4 69.11.0 57.51.5
YOSO 37.00.3 63.10.2 78.30.7 40.80.8 72.90.6 58.40.3
YOSO + Conv 37.20.5 64.91.2 78.50.9 44.60.7 69.53.5 59.01.1
Reformer 18.92.4 64.90.4 78.21.6 42.40.4 68.91.1 54.70.2
Longformer 37.20.3 64.10.1 79.71.1 42.60.1 70.70.8 58.90.1
Big Bird 37.40.3 64.31.1 79.90.1 40.91.1 72.60.7 59.00.3
H-Transformer-1D 30.48.8 66.00.2 80.10.4 42.10.8 70.70.1 57.81.8
Scatterbrain 37.50.1 64.40.3 79.60.1 38.00.9 54.87.8 54.91.4
MRA-2 37.20.3 65.40.1 79.60.6 39.50.9 73.60.4 59.00.3
MRA-2-s 37.40.5 64.30.8 80.30.1 41.10.4 73.80.6 59.40.2
Table 5: Test set accuracy of LRA tasks. Since the benchmark consist of multiple tasks with different sequence length, we do not include the efficiency components in the table.

The Long Range Arena (LRA) (Tay et al., 2021) has been proposed to provide a lightweight benchmark to quickly compare the capability of long sequence modeling for Transformers. Due to a consistency issue and code compatibility of official LRA benchmark (see Issue-34, Issue-35, and Lee-Thorp et al. (2021)), we use the LRA code provided by (Xiong et al., 2021) and follow exactly the same hyperparameter setting. The results are shown in Tab. 5. Our method has the best performance compared to others.

Caveats. A reader may ask why Longformer, Big Bird, and MRA-2-s perform better than standard Transformers (Vaswani et al., 2017) despite being approximations. The performance difference is most obvious on the image task. We also found that Longformer with a smaller local attention window (, , ) tends to offer better performance (, , , respectively) on the image task. One reason is that standard self-attention needs larger datasets to compensate for its lack of locality bias (Xu et al., 2021; d’Ascoli et al., 2021). Hence, due to the small datasets (i.e., its lightweight nature), the LRA accuracy metrics should be interpreted with caution.

Method Time (ms) Mem (MB) Top-1 Top-5
Transformer 1.24 45.5 48.7 73.7
Reformer 1.14 19.1 39.6 65.5
Longformer 1.12 13.7 49.1 73.9
H-Transformer-1D 1.03 9.8 48.7 73.9
MRA-2 1.00 11.8 48.9 73.6
MRA-2-s 0.98 9.7 49.2 73.9
Table 6:

Summary of ImageNet results trained on 4-layer Transformers. We reports both top-1 and top-5 accuracy.


. To test the performance on large datasets, we use ImageNet

(Russakovsky et al., 2015)

as a large scale alternative to CIFAR-10

(Krizhevsky et al., 2009) used in image task of LRA. Further, data augmentation is used to increase the dataset size. Like LRA, we focus on small models and use a 4-layer Transformer (see §A.4 for more details). Model specific hyperparameters are the same as the ones used on LRA. The results are shown in Tab. 6. MRA-2-s is the top performing approach. Standard self-attention and MRA-2 can clearly perform better on a large dataset.

6 Conclusion

We show that Multiresolution analysis (MRA) provides fresh ideas for efficiently approximating self-attention, which subsumes many piecemeal approaches in the literature. We expect that exploiting the links to MRA will allow leveraging a vast body of technical results developed over many decades. But we show that there are tangible practical benefits available immediately. When some consideration is given to which design choices or heuristics for a MRA-based self-attention scheme will interface well with mature software stacks and modern hardware, we obtain a procedure with strong advantages across both performance/accuracy and efficiency

. Further, our implementation can be directly plugged into existing Transformers, a feature missing in some existing efficient transformer implementations. We show use cases on longer sequence tasks and in resource limited setting but believe that various other applications of Transformers will also benefit in the short term. Finally, we should note the lack of integrated software support for MRA as well as our specialized model in current deep learning libraries. Overcoming this limitation required implementing custom CUDA kernels for some generic block sparsity operators. Therefore, extending our algorithm for other use cases may involve reimplementing the kernel. We hope that with broader use of MRA-based methods, the software support will improve thereby reducing this implementation barrier.


This work was supported by the UW AmFam Data Science Institute through funds from American Family Insurance. VS was supported by NIH RF1 AG059312. We thank Sathya Ravi for discussions regarding multigrid methods, Karu Sankaralingam for suggestions regarding hardware support for sparsity/block sparsity, and Pranav Pulijala for integrating our algorithm within the HuggingFace library.


  • I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: §1, §1, §2.1, §4.4, §5.2, §5.
  • E. J. Candès, X. Li, Y. Ma, and J. Wright (2011)

    Robust principal component analysis?

    Journal of the ACM (JACM) 58 (3), pp. 1–37. Cited by: §A.2.
  • B. Chen, T. Dao, E. Winsor, Z. Song, A. Rudra, and C. Ré (2021) Scatterbrain: unifying sparse and low-rank attention. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Figure 7, §A.2, §1, §2.1, §5.1, §5.
  • K. M. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Q. Davis, A. Mohiuddin, L. Kaiser, D. B. Belanger, L. J. Colwell, and A. Weller (2021) Rethinking attention with performers. In International Conference on Learning Representations (ICLR), Cited by: §1, §1, §2.1, §4.4, §5.
  • R. R. Coifman and M. Maggioni (2006) Diffusion wavelets. Applied and computational harmonic analysis 21 (1), pp. 53–94. Cited by: §1.
  • S. d’Ascoli, H. Touvron, M. Leavitt, A. Morcos, G. Biroli, and L. Sagun (2021) ConViT: improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning (ICML), Cited by: §5.3.
  • I. Daubechies (1992) Ten lectures on wavelets. CBMS-NSF Regional Conference Series in Applied Mathematics, Society for Industrial and Applied Mathematics. Cited by: §A.5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Cited by: §A.2.
  • D. L. Donoho and I. M. Johnstone (1995) Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association 90 (432), pp. 1200–1224. Cited by: §A.5.
  • D. L. Donoho (1995) De-noising by soft-thresholding. IEEE Transactions on Information Theory 41 (3), pp. 613–627. Cited by: §A.5.
  • M. Gavish, B. Nadler, and R. R. Coifman (2010)

    Multiscale wavelets on trees, graphs and high dimensional data: theory and applications to semi supervised learning

    In The International Conference on Machine Learning (ICML), Cited by: §1.
  • W. Hackbusch (1985) Multi-grid methods and applications. Vol. 4, Springer Science & Business Media. Cited by: Remark 4.6.
  • D. K. Hammond, P. Vandergheynst, and R. Gribonval (2011) Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30 (2), pp. 129–150. Cited by: §1.
  • S. Hu (2015)

    Relations of the nuclear norm of a tensor and its matrix flattenings

    Linear Algebra and its Applications 478, pp. 188–199. Cited by: §A.2.
  • V. K. Ithapu, R. Kondor, S. C. Johnson, and V. Singh (2017) The incremental multiresolution matrix factorization algorithm. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §1.
  • N. Kitaev, L. Kaiser, and A. Levskaya (2020) Reformer: the efficient transformer. In International Conference on Learning Representations (ICLR), Cited by: §2.1, §4.1, §5.
  • R. Kondor, N. Teneva, and V. Garg (2014) Multiresolution matrix factorization. In International Conference on Machine Learning (ICML), Cited by: §1.
  • O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky (2019) Revealing the dark secrets of bert. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §A.2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §5.3.
  • A. B. Lee and B. Nadler (2007) Treelets: a tool for dimensionality reduction and multi-scale analysis of unstructured data. In

    International Conference on Artificial Intelligence and Statistics

    Cited by: §1.
  • J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Ontanon (2021)

    FNet: mixing tokens with fourier transforms

    arXiv preprint arXiv:2105.03824. Cited by: §5.3.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §A.2, §5.2, §5.2, §5.
  • J. Lu, J. Yao, J. Zhang, X. Zhu, H. Xu, W. Gao, C. Xu, T. Xiang, and L. Zhang (2021) SOFT: softmax-free transformer with linear complexity. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.1, §5.
  • S. Mallat (1999) A wavelet tour of signal processing. Elsevier. Cited by: §1.
  • H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. Smith, and L. Kong (2021) Random feature attention. In International Conference on Learning Representations (ICLR), Cited by: §2.1.
  • X. Peng, C. Lu, Z. Yi, and H. Tang (2018) Connections between nuclear-norm and frobenius-norm-based representations.

    IEEE Transactions on Neural Networks and Learning Systems

    29 (1), pp. 218–224.
    Cited by: §A.2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. In International Journal of Computer Vision (IJCV), Cited by: §5.3.
  • Y. Saad (2003) Iterative methods for sparse linear systems. Second edition, Society for Industrial and Applied Mathematics, . Cited by: Remark 4.6.
  • S. Simic (2009) Jensen’s inequality and new entropy bounds. Applied Mathematics Letters 22 (8), pp. 1262–1265. Cited by: §A.3.
  • Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler (2021) Long range arena : a benchmark for efficient transformers. In International Conference on Learning Representations (ICLR), Cited by: §5.3, §5.
  • T. H. Trinh and Q. V. Le (2018) A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847. Cited by: §5.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §5.3.
  • S. Wang, B. Li, M. Khabsa, H. Fang, and H. Ma (2020) Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768. Cited by: §1, §1, §2.1, §4.4, §5.2, §5.
  • J. Welbl, P. Stenetorp, and S. Riedel (2018) Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics 6, pp. 287–302. Cited by: §5.2, §5.2.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Cited by: §5.2, §5.2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §A.4.
  • Y. Xiong, Z. Zeng, R. Chakraborty, M. Tan, G. Fung, Y. Li, and V. Singh (2021) Nyströmformer: a nyström-based algorithm for approximating self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §1, §2.1, §5.2, §5.3, §5.
  • Y. Xu, Q. Zhang, J. Zhang, and D. Tao (2021) ViTAE: vision transformer advanced by exploring intrinsic inductive bias. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §5.3.
  • M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed (2020) Big bird: transformers for longer sequences. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §1, §2.1, §5.
  • R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi (2019) Defending against neural fake news. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §5.2.
  • Z. Zeng, Y. Xiong, S. Ravi, S. Acharya, G. M. Fung, and V. Singh (2021) You only sample (almost) once: linear cost self-attention via bernoulli sampling. In International Conference on Machine Learning (ICML), Cited by: §1, §2.1, §5.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In International Conference on Computer Vision (ICCV), Cited by: §5.2, §5.2, §5.2.
  • Z. Zhu and R. Soricut (2021) H-transformer-1D: fast one-dimensional hierarchical attention for sequences. In Annual Meeting of the Association for Computational Linguistics, Cited by: §1, §2.1, §5.

Appendix A Appendix

The appendix includes more details regarding the formulation, analysis, and experiments.

a.1 Visualizing Approximation Procedure in Linear Scale

Figure 6: Illustration of approximation scheme for in linear scale.

We also show a visualization of our approximation procedure using a linear scale in Fig. 6. This figure gives a better illustration of how approximation quality increases as approximation procedure proceeds.

a.2 Link to Sparsity and Low Rank

Figure 7: The solid and dashed lines in the left plot is the theoretical workload, without considering the overhead, necessary to get relative error less than and , respectively, for different sequence length. Ideally, the workload should be linear to sequence length. The right plot is the approximation error vs entropy of self-attention matrices, similar to (Chen et al., 2021), at of workload for standard self-attention (keeping % of rank and nonzero entries for low rank and sparsity, respectively). Entropy of softmax is used as a proxy for the spread of attention. Relative error is defines as

Low rank and sparsity are two popular directions for efficient self-attention. To explore the potential of these two types of approximations, we set aside the efficiency consideration and use the best possible methods for each type of approximation. Specifically, subject to , we use sparse approximation which minimizes by finding support on the largest entries of , and low rank approximation which minimizes via truncated SVD. As shown in Fig. 7, these two types of methods are limited for approximating self-attention. The low rank method requires superlinear cost to maintain the approximation accuracy and fails when the entropy of self-attention is smaller. In many cases, sparse approximation is sufficient, but in some cases when the self-attention matrices are less sparse and have larger entropy, the sparse approximation would fail as well. This motivates the use of sparse + low rank approximation. It can be achieved via robust PCA which decomposes approximation to by solving an optimization objective . A convex relaxation (Candès et al., 2011) is used to make the optimization tractable, but the cost of finding a good solution is still more than , which is not suitable for efficient self-attention. Scatterbrain (Chen et al., 2021) proposes to combine an existing sparse attention with an existing low rank attention to obtain a sparse + low rank approximation and avoid the expensive cost of robust PCA.

Interestingly, a special form of our work offers an alternative to Scatterbrain’s approach for sparse + low rank approximation. When for some , our MRA-2 can be viewed as a sparse + low rank approximation. Specifically, let


for a resolution , then serves as a reasonably good solution for a relaxed version of sparse and low rank decomposition.

Let us consider an alternative relaxation of robust PCA objective,


Note that is easier to compute. And we have (Hu, 2015). In fact, Peng et al. (2018) shows that solutions obtained by minimizing and are two solutions of a low rank recovery problem and are identical in some situations. The optimal solution for objective (20) can be easily obtained. For , there exists a such that the optimal has support on the largest entries of , and . However, for practical use, the recovered sparsity cannot be efficiently used on a GPU due to scattered support. Further, the complexity is since we found that we still need to find the largest entries of . Suppose we restrict to be a block sparse matrix, namely, supported on a subset of , then a GPU can exploit this block sparsity structure to significantly accelerate computation. Further, the optimal is supported on the regions with the largest defined as


which is similar to (8). As a result, similar to approximating (8) with (10), we use the lower bound as an proxy for (21). Then, the cost of locating support blocks is only . Consequently, the resulting solution is supported on the regions with the largest . Note that this is exactly with an appropriate budget . And the is a reasonable solution for since is small and .

Figure 8: The top 3 plots are the optimal supports for 3 typical self-attention matrices at % sparsity. The bottom 3 plots are supports found via our MRA-2 at % sparsity.

We empirically evaluate the quality of this sparse solution. (Kovaleva et al., 2019) showed that the BERT model (Devlin et al., 2019) has multiple self-attention patterns capturing different semantic information. We investigate the types of possible self-attention of a pretrained RoBERTa model (Liu et al., 2019). And we show the optimal sparsity supports for self-attention matrices generated from RoBERTa-base with 4096 sequence in the top plots of Fig. 8. Our MRA-2, as shown in the bottom plots of Fig. 8, is able to find a reasonably good sparse solution for (20). Notice that while many self-attention matrices tend out to be diagonally banded matrices, which Longformer and Big Bird can approximate well, they are not the only possible structure. Diagonally banded structure is not sufficient to approximate the last two self-attention patterns well.

a.3 Analysis

In this section, we discuss in more detail some manipulations described in the main paper.

Observation A.1.

We can re-write as where is the index of that has the smallest support region and is supported on .


We describe the details next. First, notice that at each scale ,


If for all , then and thus . Further, at each scale , the supports of are disjoint, and there is exactly one whose support includes . Thus, if