1 Introduction
The Transformer model (Vaswani et al., 2017) is incredibly effective across natural language processing (NLP) applications including machine translation (Vaswani et al., 2017), language inference (Devlin et al., 2019) and paraphrasing (Raffel et al., 2020). Transformerbased models such as BERT (Devlin et al., 2019)
are pretrained in an unsupervised manner and later finetuned on different downstream tasks, often providing stateoftheart performance on standard benchmarks. While such models have strong empirical performance, their computational/memory requirements remain high. Consequently, in the NLP setting, many current models have certain constraints on the sequence length, e.g., BERT and other transformerbased language models
(Yang et al., 2019; Liu et al., 2019) limit the sentence length to be at most , although recent results have reported success with longer sequences based on interesting efficiencyfocused strategies (Beltagy et al., 2020; Zaheer et al., 2020; Xiong et al., 2021).MultiHead SelfAttention is central to Transformer based models and provides a flexible global receptive field to exchange information among input tokens. While selfattention provides various benefits, it is also a bottleneck when training with long sequences. In particular, the output of selfattention is a combination of all tokens where coefficients are determined by the similarities among tokens. This is beneficial, but involves a sizable resource footprint. When the sequence length is , one incurs a complexity in both time and memory to compute pairwise similarities among all input tokens. This quadratic cost restricts its use in applications where capturing long term context dependencies is important, and has motivated many ongoing efforts to mitigate the resource needs of such models.
Our work, also seeks to address the aforementioned issues, and is inspired by ideas of importance sampling via hashingbased sampling strategies (Spring and Shrivastava, 2018; Charikar and Siminelakis, 2017)
. We propose a Bernoulli based sampling scheme to approximate selfattention, which scales linearly with the input sequence length. We view selfattention as a sum of individual tokens associated with Bernoulli random variables whose success probability is determined by the similarities among tokens. In principle, we can sample all Bernoulli random variables at once with a single hash. It turns out that the resultant strategy (You Only Sample Almost Once, YOSOAttention) is more amenable to an efficient/backpropagation friendly implementation, and exhibits a favorable performance profile in experiments.
2 Related Works
We first review a commonly used form for selfattention and then the most relevant ideas that also focus on efficiency aspects of Transformer models. Then, we summarize a distinct line of work based on sampling which directly inspires our strategy for approximating softmax calculation.
2.1 SelfAttention
Selfattention is a scaled dotproduct attention mechanism to capture token dependencies in the input sequence, which can be defined as,
(1) 
where are embedding matrices from the input sequence, and called queries, key and values respectively. Here, is the input sequence length, is the embedding dimension of each token, are learned parameter matrices, is the head dimension, and is a diagonal matrix which normalizes each row of the matrix such that the row entries sum up to . For simplicity, we overload the notations for to denote in our description below.
MultiHead SelfAttention. MultiHead selfattention in Transformers runs through the scaled dotproduct attention multiple times and the attention outputs are concatenated to help the model capture information from multiple representation subspaces (Vaswani et al., 2017). MultiHead Selfattention can be formally written as,
(2) 
where is the number of heads, are heads with different parameter matrices, and , are learnable transformations.
SelfAttention Bottleneck. A bottleneck in selfattention is calculating the softmax matrix, , which requires all pairwise input token similarities.
2.2 Efficient Transformers
Recent proposals have identified a number of ways to reduce the quadratic cost of self attention. Linformer (Wang et al., 2020) shows that using a lowrank assumption, selfattention can be approximated via random projections along the sequence length dimension. The authors replace random projections with learnable linear projections and achieve a
complexity via a fixed projection dimension. Linear Transformers
(Katharopoulos et al., 2020) replace the softmax activation by applying a separable activation on the queries and keys. Based on the connection between the softmax activation and the Gaussian kernel, Performer (Choromanski et al., 2021) and Random Feature Attention (Peng et al., 2021)approximate softmax as the dot product of finite dimensional random feature vectors with a guarantee of convergence (to softmax). Nyströmformer
(Xiong et al., 2021), on the other hand, uses a landmarkbased Nyström method to approximate the attention matrices. These methods achieve complexity by avoiding direct calculation of attention matrices. Multiple approaches have been developed to exploit sparsity and structured patterns of attention matrices observed empirically. This line of work includes Sparse Transformer (Child et al., 2019), Longformer (Beltagy et al., 2020), and Big Bird (Zaheer et al., 2020), which involve time and memory complexity of respectively. The Reformer method (Kitaev et al., 2020), which is related to our work, also utilizes the sparsity of selfattention. But instead of predetermining a sparsity pattern, it uses Locality Sensitive Hashing (LSH) as a tool to approximate nearest neighbor search, and dynamically determines the sparsity pattern to achieve complexity. In contrast, our approach takes advantage of the connection of querykey similarity to the LSH collision probability which is used to directly estimate selfattention without relying on sparsity.2.3 Importance Sampling
We focus on developing an efficient low variance estimator of selfattention. Since selfattention can be thought of as integrating tokens over a softmax distribution, to estimate this integral, one could use importance sampling, see
(Press et al., 2007). Using importance sampling will allow drawing samples from a uniform distribution and avoids sampling from the softmax distribution directly (which is harder). But this leads to a high variance estimate since the softmax distribution is usually concentrated in a small region.
LSHbased Importance Sampling. Consider the case when the angular distance between a key and a query is small. In this case, the similarity (between the key and the query) as well as the softmax probability will be large. When viewed through the lens of a nearest neighbor retrieval, the above property coincides with a large collision probability of high similarity keyquery pairs, assuming that the neighbor retrieval is implemented via LSH. Motivated by the link between softmax probability and LSH collision probability , Spring and Shrivastava (2018) and Charikar and Siminelakis (2017) suggest using LSH as an efficient sampler for low variance softmax estimators, which can be adopted for selfattention approximation.
(a) Spring and Shrivastava (2018) propose approximating softmax by sampling a set, , a collection of neighboring keys for each query formed by the union of colliding keys using hash tables. The estimator is computed using , where is a query vector, are key and value vectors in the sampling set , and and
are softmax probability and collision probability of given pairs. This procedure involves importance sampling without replacement, which leads to a dependency among the samples. Deduplication (avoiding double counting) requires memory to store keys in each hash table and runtime to deduplicate keys for each query. If the size of hash buckets is skewed, the (GPU) memory needs depend on the size of the hash bucket and the runtime depends on the size of
.(b) Charikar and Siminelakis (2017) provide a Hash based Estimator to simulate a proposal distribution for importance sampling via LSH, which can be easily applied in the context of softmax. For each hash table, a key is uniformly selected from the bucket that the query is hashed to, for simulating a draw from a proposal distribution. The estimate is computed as , where denotes the size of hash bucket in the th hash table which is hashed to. This simulates samples drawn with replacement from the proposal distribution. However, the probability of one key being sampled depends not only on (i) the angular distance to the query but also (ii) the number of keys within the hash bucket, leading to a sampling dependency among all keys. Further, using it for selfattention causes a dependence between the sparsity in the softmax matrix and the number of hashes used. Specifically, the number of tokens that each query can attend to is bounded by the number of hashes: the procedure samples at most one distinct key for each hash table and so, it adds one additional nonzero to the softmax matrix, at most.
LSHbased Importance Sampling: practical considerations. While LSHbased importance sampling exploits the agreement between high probability and high collision probability , the alignment is not perfect. Samples from the proposal distribution must be reweighted to compensate for the difference. Further, for different queries, the likelihood ratios between the softmax distribution and the proposal distribution w.r.t. a single key are different. Therefore, a reweighing has to be done during querying. Although maintaining hash tables for storing keys is not a major problem in general, the high memory cost for hash tables and computation time for reweighing noticeably influences efficiency when applied to selfattention. To summarize
, we find that a direct application of LSHbased importance sampling in the deep learning context may not lead to an efficient selfattention scheme.
3 YOSO Attention
3.1 YOSO Attention
While the softmax computation bottleneck can be alleviated through LSHbased importance sampling, these approaches are not very efficient on GPUs. Our LSHbased Bernoulli sampling offers benefits here. Instead of using LSH to simulate sampling from a proposal distribution over tokens, we view attention as a sum of tokens associated with Bernoulli random variables. This modification relates better with LSH and less with LSHbased importance sampling – the probability of one query colliding with a key is not based on other keys. This strategy helps avoid the sampling dependency problem in LSHbased importance sampling and offers us an opportunity to develop a strategy more amenable to GPUs.
Remark 1.
We assume that the input keys and queries of selfattention are unit length – to allow treating dotproduct similarity in selfattention and cosine similarity in LSH similarly. This is simple using
Neyshabur and Srebro (2015): a variable is used to bound the squared norm of all queries and keys and to reconstruct new unit length keys and queries while preserving their pairwise similarities. Then, we can work with the softmax matrix in angular distance metric and derive our algorithm.SelfAttention via Bernoulli Sampling. We aim to approximate selfattention, which uses a softmax matrix to capture the context dependency among tokens via their pairwise similarities. Assuming that we can represent this context dependency directly using collision probability , no reweighting is required if the proposal and target distributions are the same. This means that challenges discussed in LSHbased importance sampling do not exist. So the coincidence of softmax probability and LSH collision probability makes a sensible starting point for approximating selfattention. Specifically, to model dependency based on similarity, the collision probability aligns well with the exponential function in softmax in the domain of interest in Figure 2: both functions have positive zeroth, first and second order derivatives.
Note that (a) positive zeroth order derivative indicates that the dependency is positive, (b) positive first order derivative ensures that the dependency based on similarity is monotonic, and (c) positive second order derivative means that the attention weight will rapidly increase and dominate others as the similarity increases. This leads us to hypothesize that a collisionbased selfattention may be as effective as softmaxbased selfattention. It can be formulated as,
(3) 
where is a Bernoulli random variable where the success probability is given by the collision probability of with the key . Hence, it can be determined by the similarity between .
In a single hash, each generates a realization to determine whether the corresponding token will be part of attention output or not. Conceptually, when sampling from the softmax distribution, only one token is sampled as the attention output. In contrast, Bernoulli sampling determines whether each individual token is a part of the attention output. In principle, to determine the context dependency among tokens, you only need to sample once (YOSO) using a single hash to generate realizations of all Bernoulli random variables, . Specifically, when keys are hashed to a hash table using a single hash, the realization of for each query will be if collides with , else it is . To our knowledge, using LSH collision probability to replace softmax dependencies for selfattention in this way has not been described before.
YOSOAttention. By replacing softmax dependency with Bernoulli random variables and using LSH as an efficient sampler to estimate the success probability, we obtain an efficient selfattention (YOSOAttention) to approximate softmaxbased selfattention.
(4) 
where
is the Bernoulli random matrix.
(5) 
where is a hash function. The expectation of is
(6) 
The variance of a Bernoulli random variable is simply:
(7) 
While a single sample would work in estimating attention output, in practice, the actual output of YOSOAttention can be the average of output from samples to lower the estimation variance, where is a small constant. The highlevel overview of our method is demonstrated in Figure 1. For LSH, each sample (hash) is a space partitioning of the input space. The ’s associated with ’s in the same partition are summed together. The partitions give a coarse representation of . As increases, the average of representations converges to .
Remark 2. Our proposed method enjoys multiple advantages explicitly noted recently in Performer (Choromanski et al., 2021), which is desired for selfattention: (a) The attention weights are always positive, which make it a robust selfattention mechanism. In YOSO, the attention weights are always in , which means that it is also numerically stable. (b) The variance goes to zero as attention weight approaches zero. In YOSO, the variance of attention weights are always upper bounded by the attention weights themselves, making the approximation error easily controllable.
Normalizing Attention. In softmax selfattention, each row of the softmax matrix is normalized so that the dependencies sum up to . We discussed above how the pairwise querykey dependencies can be estimated using Bernoulli sampling. We now describe how to normalize the dependency in our method as softmax selfattention. We can first estimate the dependencies and then normalize them using the sum of estimated dependencies estimated by where 1 is a vector of all entries being . can be computed by (4) by plugging 1 into . To make the estimation of selfattention more efficient, we adopt a normalization on the attention output, similar to use of normalization for word embedding in Levy et al. (2015). Thus, attention outputs are invariant to scaling, , under normalization. Therefore, we have,
(8) 
Empirically, as expected, we find that the normalization does not affect the performance of our method (discussed in the experiments).
3.2 LSHbased Bernoulli Sampling
Now, we discuss how to actually implement the idea of using Bernoulli sampling to approximate selfattention. While a standard LSH procedure can be used, maintaining hash tables to store keys is inefficient on a GPU – the GPU memory size required for hash table cannot be predetermined and the workload might be skewed due to skewed bucket sizes. Due to how our Bernoulli sampling is set up, it turns out that simply saving the summation of values corresponding to hashed keys is sufficient (instead of storing a full collection of hashed keys).
Overview. An outline of our algorithm is shown in Figure 3. To compute , the procedure proceeds as follows. We sample a hash function and create a hash table representing dimensional buckets. For each key , we add the value to the bucket whose index is the hash code , denoted as ,
(9) 
Note that the size of is and is independent of which bucket keys are hashed. With all keys processed, for each query , we maintain an output vector initialized to . Then, we allocate the bucket in using and use as the attention output for . Therefore, each final output can be computed as,
(10) 
Remark 3. The memory and time complexity of this algorithm are and respectively, In addition, both time and memory are independent of the size of hash buckets. We can further improve the memory complexity to by reusing the hash table and processing a few dimensions each time without increasing the time complexity. The constant
is a hyperparameter that controls the decay rate of attention weights with respect to the angular distance between query and key.
Speedup. While not essential, we find that a fast random projection for computing the LSH hash code is beneficial, since this step takes a large portion of the overall runtime. As suggested by Andoni et al. (2015), we use the approximated random projection to reduce time complexity to , allowing fast computation of hash codes (details in the appendix).
3.3 Backpropagation
For training, we also need to show that backward propagation steps for YOSOAttention are feasible. Here, we discuss this last component of YOSOAttention which enables endtoend efficient training.
Time  Forward  Backward 

Softmax  
YOSO  
Memory  Forward  Backward 
Softmax  
YOSO 
For backpropagation, the gradient of the loss w.r.t. can be estimated similar to (4),
(11) 
The gradients of w.r.t. are similar, so we only provide the expression for ,
(12) 
where are elementwise division and multiplication. One issue with the true gradient is that it goes to infinity as the alignment score between the query and the key approaches , which might lead to divergence. To avoid this numerical issue, we use a lower bound of the actual derivative of the collision probability,
(13) 
The empirical behavior is shown in Figure 2, and it can be efficiently estimated via a variation of LSHbased Bernoulli Sampling. Specifically, note that the approximation can be decomposed into sum of LSHbased Bernoulli Sampling,
(14) 
Since it needs runs of a subroutine whose complexity is and for memory and time respectively, its memory complexity is , and time complexity is . The term in the memory complexity can be eliminated by repeatedly using the same hash tables times without increasing runtime, which improves the memory complexity to . The overall complexity of our method relative to softmax selfattention is shown in Table 1. To address the quadratic dependence on , we describe a scheme to estimate the quantity (13) with a cost that is linear in and similarly, for estimating (12) in appendix. The models trained using the latter estimate are identified as *YOSO in our experiments.
4 Experiments
In this section, we analyze YOSO experimentally and evaluate its performance. In the previous section, we assumed that queries and keys are unit length and described how to make the strategy work. In the experiments, we found that simply applying a normalization on queries and keys and using as a hyperparameter does not degrade the performance and yet is more efficient to compute, so we use the simpler version in the experiments.
For empirical evaluations, we evaluate YOSOAttention on BERT language model pretraining followed by GLUE downstream tasks finetuning. Then, we compare our method with other efficient Transformer baselines using a small version of BERT and the LRA benchmark. As a sanity check, we also include YOSOAttention (YOSOE) where the expected attention weights are directly computed using collision probability. This represents the behavior of YOSO when the number of hashes tends to infinite. We verified that in all tasks, YOSOE behaves similarly as softmax selfattention. Further, we demonstrate that the performance of YOSOm (YOSOAttention using hashes) generally converges to YOSOE as increases. Also, training using the backpropagation estimate in (13) (denoted YOSO) converges in all tasks, but the backpropagation estimate based on (12) (denoted *YOSO) is slightly better. When compared to other efficient Transformer baselines, we show that our proposal performs favorably while maintaining high efficiency for both time and memory. Finally, we empirically verified that the approximation error of YOSOm stays relatively flat as the sequence length increases.
4.1 Language Modeling
Method  MLM  SOP  MRPC  SST2  QNLI  QQP  MNLIm/mm 

Softmax  4.65  94.2  88.3  91.1  90.3  87.3  82.4/82.4 
YOSOE  4.54  94.4  88.1  92.3  90.1  87.3  82.2/82.9 
YOSO64  4.79  94.2  88.1  91.5  89.5  87.0  81.6/81.6 
YOSO32  4.89  93.5  87.3  90.9  89.0  86.3  80.5/80.7 
YOSO16  5.14  92.8  87.1  90.7  88.3  85.3  79.6/79.5 
*YOSO32  4.89  93.5  87.6  91.4  90.0  86.8  80.5/80.9 
*YOSO16  5.02  93.4  87.7  90.8  88.9  86.7  80.6/80.5 
To evaluate YOSO, we follow the BERT language model pretraining procedure (Devlin et al., 2019) and evaluate the performance of our method on both intrinsic tasks and multiple downstream tasks in the GLUE benchmark.
BERT Pretraining. Following Devlin et al. (2019), the model is pretrained on BookCorpus (Zhu et al., 2015) and English Wikipedia. To evaluate the capacity of the model in capturing sentence level information, instead of using NextSentencePrediction (NSP) as the sentence level loss as in the original BERT, we adapt the SentenceOrderingPrediction (SOP) from ALBERT (Lan et al., 2020) – this is more difficult compared to NSP. All models are trained with MaskLanguageModeling (MLM) and SOP objectives. We use the same hyperparameters for pretraining as Devlin et al. (2019). However, to keep the compute needs manageable, all models are trained for K steps (batch size of ).
Number of Hashes during Pretraining. Since the estimation variance decreases as the number of hashes increases, to evaluate the tradeoff between efficiency and performance in YOSO, we test multiple hash settings: (*)YOSO16, (*)YOSO32, YOSO64, and finally, YOSOE (to simulate infinite hashes). We plot MLM validation perplexity and SOP validation loss curves of length models pretrained with softmax selfattention and YOSOAttention (Fig. 4 right) and show the MLM validation perplexity and SOP accuracy obtained in Table 2. The curves of YOSOE agrees with and slightly exceeds softmax selfattention, indicating that YOSO is indeed as effective as selfattention. It is expected that as the number of hashes increase, the performance of YOSO will approach YOSOE, as the approximation becomes more accurate. For both MLM and SOP, we confirm that YOSO is as effective as softmax selfattention.
Number of Hashes during Validation. YOSOAttention is a stochastic model. To make the inference deterministic, as in dropout (Srivastava et al., 2014), ideally, we take the expectation as the output. However, directly computing the expectation involves a cost, so we experiment with the effect of different hash settings in validation and simulate expectation as the number of hashes increases. We plot the MLM perplexity and SOP loss of the same pretrained models using different number of hashes on validation in Figure 5. We observe that as the number of hashes increases, the MLM perplexity and SOP loss generally decreases for all pretraining hash settings.
GLUE. We examined the effectiveness of our method on diverse downstream tasks and ask how YOSO compares with softmax selfattention even after finetuning. We finetuned all pretrained BERTbase model on MRPC (Dolan and Brockett, 2005), SST2 (Socher et al., 2013), QNLI (Rajpurkar et al., 2016), QQP (Chen et al., 2018), and MNLI (Williams et al., 2018)
tasks in the GLUE benchmarks and report their corresponding dev metrics. For large datasets including QNLI, QQP, and MNLI, due to extensive resource needs, we did not perform hyperparameter search, so we used a batch size of 32 and learning rate 3e5 to update our model and finetune our models for 4 epochs. For MRPC and SST2, we follow BERT finetuning to do a hyperparameter search with candidate batch size
8, 16, 32 and learning rate 2e5, 3e5, 4e5, 5e5 and select the best dev set result. Results are listed in Table 2. We observed that YOSO’s performance on downstream tasks is comparable with softmax selfattention, and even shows slightly better results in some hash settings. Further, the downstream performance of YOSO generally increases with more hashes, providing an adjustable tradeoff between efficiency and accuracy.4.2 Performance considerations relative to baselines
Method  listops  text  retrieval  image  pathfinder  Avg 

Sequence Length  2K  4K  4K  1K  1K  
none  19.20  61.11  74.78  33.86  66.51  51.09 
softmax  37.10  65.02  79.35  38.20  74.16  58.77 
yosoe  37.30  64.71  81.16  39.78  72.90  59.17 
nyströmformer  37.15  65.52  79.56  41.58  70.94  58.95 
longformer  37.20  64.60  80.97  39.06  73.01  58.97 
linformer  37.25  55.91  79.37  37.84  67.60  55.59 
reformer  19.05  64.88  78.64  43.29  69.36  55.04 
performer  18.80  63.81  78.62  37.07  69.87  53.63 
yoso32  37.25  63.12  78.69  40.21  72.33  58.32 
yosoC16  37.40  64.28  77.61  44.67  71.86  59.16 
*yoso16  37.20  62.97  79.02  40.50  72.05  58.35 
*yosoC16  37.35  65.89  78.80  45.93  71.39  59.87 
We also evaluate how well our method performs compared to other efficient Transformer baselines. For the baselines, we compared YOSO with Nystromformer ( landmarks and convolution size) (Xiong et al., 2021), Longformer ( attention window size) (Beltagy et al., 2020), Linformer ( projection dimensions) (Wang et al., 2020), Reformer ( hashes) (Kitaev et al., 2020), and Performer ( random feature dimensions) (Choromanski et al., 2021) on a small version of BERT and LRA benchmark. The same modelspecific hyperparameters are also used in efficiency profiles in Figure 7. Further, inspired by Nystromformer, we also tested adding a depthwise convolution, which is referred as YOSOCx (x for the number of hashes). The experiment results indicate that depthwise convolution improves the performance of our method in some tasks.
BERTSmall. For BERT pretraining task, due to the large resource needs of running all baselines in BERTbase setting, we evaluate all methods in a BERTsmall setting (4 layers, 512 dimensions, 8 heads) with K steps pretraining. Since the attention window size of Longformer is the same as the maximal sequence length of the input, it provides full selfattention, similar to softmax in this setting. Softmax selfattention achieves 7.05 MLM validation perplexity and 91.3% SOP validation accuracy on this task, and YOSO (with convolution) achieves 7.34 MLM validation perplexity and 89.6% SOP validation accuracy. Here, YOSOC is comparable to softmax selfattention and Nyströmformer while performs favorably relative to Reformer and Performer. For GLUE tasks, we found that 99% of instances in MRPC, SST2, QNLI, QQP, and MNLI have sequence lengths less than . Since the chunked attention window in Reformer can capture full attention across all tokens in this setting, we expect to see similar performance for Reformer as softmax selfattention. We provide results for QNLI, QQP, and MNLI for all baselines in the appendix (also includes results on MLM and SOP pretraining tasks).
LRA Benchmark. To evaluate the generalization of YOSO on diverse tasks and its viability on longer sequence tasks, we run our method on LRA benchmark (Tay et al., 2021) and compare it with standard Transformer as well as other efficient Transformer baselines. This benchmark consists of five tasks: Listops (Nangia and Bowman, 2018)
, bytelevel IMDb reviews classfication (Text)
(Maas et al., 2011), bytelevel document matching (Retrieval) (Radev et al., 2013), pixellevel CIFAR10 classification (image)
(Krizhevsky et al., 2009), and pixellevel Pathfinder (Linsley et al., 2018). These tasks are designed to assess different aspects of an efficient Transformer and provide a comprehensive analysis of its generalization on longer sequence tasks.Since the code release from (Tay et al., 2021) only runs on TPUs, and the hyperparameters are not known, we followed the experimental settings in Xiong et al. (2021). We include a model without selfattention, labeled “None”, as a reference to show how much each baseline helps in modeling long sequences. The results are shown in Table 3. The performance of YOSO compared to softmax selfattention on LRA tasks is consistent with the results we reported for language modeling. For baseline comparisons, YOSO is comparable to Longformer and Nyströmformer and outperforms all other baselines by 3% average accuracy across the five tasks. Further, with a depthwise convolution, YOSO outperforms all baselines. These results provide direct evidence for the applicability of YOSO on longer sequences.
4.3 Efficiency considerations relative to baselines
The overall thrust efficient Transformer models is to have the same capacity as a standard Transformer while reducing the time and memory cost of selfattention. In this section, we profile the running time and memory consumption of our method as well as vanilla Transformer and other efficient Transformers for different sequence lengths. We use a Transformer model of 6 layers, 256 embedding dimension, 1024 hidden dimension, 4 attention heads and measure runtime and peak memory consumption using random inputs. To achieve the best efficiency for each baseline, for each method and each sequence length, we use the largest batch size we can fit into memory and run training for 10 steps and average the results to estimate the time and memory cost of a single instance. The experiments were performed on a single NVIDIA 2080TI. The result is shown in Figure 7. While Longformer scales linearly with respect to the sequence length, the benefit comes from longer sequence, which is consistent to (Beltagy et al., 2020). The detailed results and further experiments on efficiency are provided in the appendix. The profiling results indicate that our YOSO is able to scale efficiently with input sequence lengths. Further, the results suggest that our YOSO is highly efficient in terms of runtime and offers the highest efficiency in terms of memory compared to baselines.
4.4 How large is the approximation error?
To assess the estimation error of YOSO, we generate attention matrices of YOSO using from a trained model and compare it against softmax selfattention. In Figure 6, visually, our method produces similar attention patterns as softmax selfattention. The estimation of attention matrix is more accurate as the number of hashes increases. Further, in the formulation of YOSO, each output of YOSOAttention is a weighted sum of random variables as shown in (3); so one may suspect that as the sequence length increases, the variance of YOSOAttention output might potentially increase. To assess the increase in variance, we use from a trained model and measure the averaged angle between YOSOE and YOSOm for for sequence lengths between to . Since YOSO outputs unit vectors, only the vector direction is meaningful, we use the radian between outputs of YOSOE and YOSOm to assess the approximation error. The result is shown in Figure 8. The axis uses a logscale to verify that the approximation error increases at a much slower rate compared to the sequence length. One explanation is that most attention weights are near zero, see Figure 6. This means that the variance of the corresponding attention weights are near zero. In most cases, the size of the dependency (large attention weights) is relatively independent of the sequence length. As a result, the increase in sequence length does not introduce the same amount of error in approximation.
5 Conclusion
We present a transformerbased model, YOSOAttention, which scales linearly in the number of tokens. This allows YOSO to be applicable to long document tasks. Via a randomized sampling based scheme, YOSO approximates selfattention as a sum of individual tokens associated with Bernoulli random variables that can be sampled at once by a single hash, in principle. With specific modifications of LSH, YOSOAttention can be efficiently deployed within a deep learning framework and various aspects of this idea and our implementation, we expect, will find use in some novel settings (e.g., point cloud modeling and vision). Our preliminary work suggests that YOSO has potential applications beyond Transformers, e.g., for scalability benefits in kernel density estimation and differentiable LSH.
Acknowledgments
This work was supported by the University of Wisconsin Data Science Institute through funding provided by American Family Insurance. YX and VS were also supported by University of Wisconsin Center for Predictive Computational Phenotyping (CPCP) funded by NIH U54 AI117924. SNR was supported by UICICR startup funds. The authors thank Rudrasis Chakraborty for many helpful discussions.
References
 Practical and optimal lsh for angular distance. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. . External Links: Link Cited by: §3.2.
 Longformer: the longdocument transformer. arXiv preprint arXiv:2004.05150. Cited by: §1, §2.2, §4.2, §4.3.

Similarity estimation techniques from rounding algorithms.
In
Proceedings of the ThiryFourth Annual ACM Symposium on Theory of Computing
, STOC ’02, New York, NY, USA, pp. 380–388. External Links: ISBN 1581134959, Link, Document Cited by: Figure 2.  Hashingbasedestimators for kernel density in high dimensions. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), Vol. , pp. 1032–1043. External Links: Document Cited by: §1, §2.3, §2.3.
 Quora question pairs. Quora. Cited by: §4.1.
 Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §2.2.
 Rethinking attention with performers. In International Conference on Learning Representations, External Links: Link Cited by: §2.2, §3.1, §4.2.
 BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §4.1, §4.1.
 Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), External Links: Link Cited by: §4.1.
 Transformers are RNNs: fast autoregressive transformers with linear attention. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 5156–5165. External Links: Link Cited by: §2.2.
 Reformer: the efficient transformer. In International Conference on Learning Representations, External Links: Link Cited by: §2.2, §4.2.
 Learning multiple layers of features from tiny images. Cited by: §4.2.

ALBERT: a lite bert for selfsupervised learning of language representations
. In International Conference on Learning Representations, External Links: Link Cited by: §4.1.  Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3, pp. 211–225. External Links: Link, Document Cited by: §3.1.

Learning longrange spatial dependencies with horizontal gated recurrent units
. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §4.2.  Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1.

Learning word vectors for sentiment analysis
. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 142–150. External Links: Link Cited by: §4.2.  ListOps: a diagnostic dataset for latent tree learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, New Orleans, Louisiana, USA, pp. 92–99. External Links: Link, Document Cited by: §4.2.
 On symmetric and asymmetric lshs for inner product search. In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1926–1934. External Links: Link Cited by: §3.1.
 Random feature attention. In International Conference on Learning Representations, External Links: Link Cited by: §2.2.
 Numerical recipes: the art of scientific computing, 3rd edition. Cambridge University Press. External Links: ISBN 9780521706858 Cited by: §2.3.
 The ACL anthology network corpus. Language Resources and Evaluation 47 (4), pp. 919–944. External Links: Document Cited by: §4.2.

Exploring the limits of transfer learning with a unified texttotext transformer
. Journal of Machine Learning Research 21 (140), pp. 1–67. External Links: Link Cited by: §1.  SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. External Links: Link, Document Cited by: §4.1.
 Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. External Links: Link Cited by: §4.1.
 Scalable estimation via LSH samplers (LSS). In International Conference on Learning Representations, Workshop Track Proceedings, External Links: Link Cited by: §1, §2.3, §2.3.

Dropout: a simple way to prevent neural networks from overfitting
. Journal of Machine Learning Research 15 (56), pp. 1929–1958. External Links: Link Cited by: §4.1.  Long range arena : a benchmark for efficient transformers. In International Conference on Learning Representations, External Links: Link Cited by: §4.2, §4.2.
 Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §1, §2.1.
 Linformer: selfattention with linear complexity. arXiv preprint arXiv:2006.04768. Cited by: §2.2, §4.2.
 A broadcoverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1112–1122. External Links: Link, Document Cited by: §4.1.

Nyströmformer: a nyströmbased algorithm for approximating selfattention.
Proceedings of the AAAI Conference on Artificial Intelligence
35 (16), pp. 14138–14148. External Links: Link Cited by: §1, §2.2, §4.2, §4.2.  XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §1.
 Big bird: transformers for longer sequences. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 17283–17297. External Links: Link Cited by: §1, §2.2.

Aligning books and movies: towards storylike visual explanations by watching movies and reading books.
In
2015 IEEE International Conference on Computer Vision (ICCV)
, Vol. , pp. 19–27. External Links: Document Cited by: §4.1.
Comments
There are no comments yet.