1. Introduction
Since the introduction of the selfattention mechanism in Transformers (Vaswani et al., 2017), it has seen incredible success in a variety of sequence modeling tasks in a variety of fields, such as machine translation (Chen et al., 2018a), object detection (Wang et al., 2018), music generation (Huang et al., 2018) and bioinformatics (Madani et al., 2020). Recently, selfattention has also demonstrated its formidable power in recommendation (Kang and McAuley, 2018; Zhang et al., 2018; Sun et al., 2019).
However, despite impressive performance attributable to its ability to identify complex dependencies between elements in input sequences, selfattention based models suffers from soaring computational and memory costs when facing sequences of greater length. As a consequence of computing attention scores over the entire sequence for each token, selfattention takes operations to process an input sequence of length . This hinders the scalability of models built on selfattention in many settings.
Recently, a number of solutions have been proposed to address this issue. The majority of these approaches (Kitaev et al., 2019; Zaheer et al., 2020; Roy et al., 2020; Tay et al., 2020; Child et al., 2019; Ainslie et al., 2020; Beltagy et al., 2020) leverages sparse attention patterns, limiting the number of keys that each query can attend to. Although these sparse patterns can be established in a variety of contentdepended ways like LSH (Kitaev et al., 2019), sorting (Tay et al., 2020)
and kmeans clustering
(Roy et al., 2020), crucial information may be lost by clipping the receptive field for each query. While successfully reducing the cost of computing attention weights from to , where is the fixed bucket size, extra cost incurs in assigning the keys/values into buckets. This cost typically is still quadratic with respect to , and it may cause significant overheads dealing with shorter sequences. We observe that Reformer (Kitaev et al., 2019) could be 7.6x slower than the vanilla Transformer on sequences of length 128. Other techniques are also being employed to improve the efficiency of selfattention. For instance, lowrank approximations of the attention weights matrix is used in (Wang et al., 2020). This method, however, only supports a bidirectional attention mode and assumes a fixed length of input sequences.We observe that selfattention essentially computes a weighted average of the input sequences for each query, and the weights are computed based on the inner product between the query and the keys. For each query, keys with larger inner product will be paid more attention to. We relate this to the Maximum Inner Product Search (MIPS) problem. The MIPS problem is of great importance in many machine learning problems
(Koren et al., 2009; Felzenszwalb et al., 2009; Shrivastava and Li, 2014), and fast approximate MIPS algorithms are well studied by researchers. Among them, vector (product) quantization (Gray and Neuhoff, 1998; Guo et al., 2016; Dai et al., 2020) has been a popular and successful method. Armed with vector quantization, we no longer have to exhaustively compute the inner product between a given query and all the points in the database. We can only compute that for the centroids (i.e., codewords), whereis a budget hyperparameter. We therefore successfully avoid redundant computations since the points belong to the same centroid share the same inner product with the query.
The idea of vector quantization has also been applied to compress the item embedding matrix and improve the memory and search efficiency of recommendation systems (Chen et al., 2018b; Lian et al., 2020b). In the stateoftheart lightweight recommendation model, LightRec (Lian et al., 2020b), a set of differentially learnable codebooks are used to encode items, each of which is composed of codewords. An item is represented by a composition of the most similar codeword within each codebook. Hence we only need to store the indices of its corresponding codewords, instead of its embedding vector. Since the codeword index in a codebook can be compactly encoded with bits, the overall memory requirements to store item representations can be reduced from bytes to bytes (Lian et al., 2020b).
Inspired by the benefit that redundant inner product computations can be circumvented in MIPS algorithms based on vector quantization, and the ability of using codebooks to quantize any embedding matrix, we propose LISA (LIneartime SelfAttention), an efficient attention mechanism based on computing codeword histograms. Equipped with a series of codebooks to encode items (or any form of tokens), LISA can dramatically reduce the costs of inner product computation in a similar vein. Since each item (token) is represented as a composition of codewords, and the entire input sequence can be compressed to a histogram of codewords for each codebook (illustrated in Figure 1), we are essentially performing attention over codewords. The histograms are used to compute the attention weights matrix in time. We then pool over the codewords with the attention weights to get the outputs. To enable selfattention in a unidirectional setting (i.e., with casual masking (Kitaev et al., 2019)), we can resort to the mechanism of prefixsums and compute a histogram at each position of the sequence.
Compared to the efficient attention methods that rely on sparse patterns, our proposed method performs full contextual attention over the input sequence, with a computational and memory complexity linear in the sequence length. Our proposed method also enjoys the compression of item embeddings brought by LightRec. Particularly, in an online recommendation setting, our method can encapsulate a user’s entire history with a fixed size histogram, greatly reduce the storage costs.
Our contributions can be summarized as follows:

[leftmargin=*]

We propose LISA (LIneartime SelfAttention), a novel attention mechanism for efficient recommendation that reduces the complexity of computing attention scores from to , while simultaneously enabling model compression. The total number of codewords is a budget hyperparameter balancing between performance and speed.

We also propose two variants of LISA, one of them allows soft codeword assignments, and the other uses a separate codebook to encode sequences. These techniques allow us to use much smaller codebooks, resulting in further efficiency improvements.

We conduct extensive experiments on four realworld datasets. Our proposed method obtains similar performance to vanilla selfattention, while significantly outperforms the stateoftheart efficient attention baselines in both performance and efficiency.
2. Related Work
2.1. Applications of SelfAttention Mechanisms
The scaled dot product selfattention introduced in Transformers (Vaswani et al., 2017) has been extensively used in natural language understanding (Devlin et al., 2019; Xu et al., 2020)
. As a powerful mechanism that connects all tokens in the inputs with a pooling operation based on relevance, selfattention has also made tremendous impacts in various other domains like computer vision
(Xiang et al., 2020; Zhang et al., 2019), graph learning (Veličković et al., 2017).Recently, selfattention networks are successfully applied to sequential recommendation. Kang and McAuley (Kang and McAuley, 2018) adapted a Transformer architecture by optimizing the binary crossentropy loss based on inner product preference scores, while Zhang et al. (Zhang et al., 2018) propose to optimize a triplet margin loss based on Euclidean distance preference. Selfattention is also used for geographical modeling in location recommendation (Lian et al., 2020c, a). They have demonstrated significant performance improvements over the RNN based models.
2.2. Improving Efficiency of Attention
Considerable efforts have been made trying to scale Transformers to long sequences. TransformerXL in (Dai et al., 2019) captures longerterm dependency by employing a segmentlevel recurrent mechanism, which splits the inputs into segments to perform attention. Sukhbaatar et al. (Sukhbaatar et al., 2019) limited the selfattention context to the closest samples. However, these techniques do not improve the asymptotic complexity of selfattention.
In another line of work, attempts in reducing the asymptotic complexity are made. Child et al. (Child et al., 2019)
proposed to factorize the attention computation into local and strided ones.
Tay et al. (Tay et al., 2020), on the other hand, improved local attention by introducing a differentiable sorting network to resort the buckets. Reformer (Kitaev et al., 2019) hashes the querykeys into buckets via hashing functions based on random projection, and attention is computed within each bucket. In a similar manner, Roy et al. (Roy et al., 2020) assign tokens to buckets through clustering. Built on top of ETC (Ainslie et al., 2020), Big Bird (Zaheer et al., 2020) considers a mixture of various sparse patterns, including sliding window attention and random attention. Clustered Attention, introduced in (Vyas et al., 2020), however, groups queries into clusters and perform attention on centroids. Linformer (Wang et al., 2020) resorts to a lowrank projection on the length dimension. However, it can only operate in a bidirectional mode without casual masking.Most of the aforementioned approaches rely on sparse attention patterns, while our method performs full contextual attention over the whole sequence. Besides, Linformer and Sinkhorn Transformer assume a fixed sequence length due to the use of sorting network and projection, while our method poses no such constraint. Our method is also notably faster than the existing approaches, enjoying an asymptotic complexity of , while inner product can be stored in a table.
3. Methodology
In this section, we first quickly go through some of the underlying preliminaries. Then we introduce our proposed method step by step, starting from a simple case. We propose two more variants for further efficiency improvement. Finally, we analysis the complexity of our method.
3.1. Preliminaries
3.1.1. Regular SelfAttention Mechanism
The vanilla dotproduct attention, introduced in (Vaswani et al., 2017), accepts matrices representing queries, keys and values, and computes the following outputs:
(1) 
In the selfattention setting, we let the input sequence attend to itself. Concretely, given an input sequence , we linearly project via three matrices to get and . The results are then computed using Eq. (1). This operation can be interpreted as computing a weighted average of the all other positions for every position in the sequence.
Selfattention has already been widely used in recommendation (Kang and McAuley, 2018; Sun et al., 2019; Xu et al., 2019; Yu et al., 2019). Kang and McAuley (Kang and McAuley, 2018) used selfattention along with the feedforward network from (Vaswani et al., 2017) to encode user’s sequential behaviors, and recommend the next item by computing the inner product between the encoded representation and target items’ embeddings.
However, the computation of Eq. (1) suffers from quadratic computational and memory complexities, as computing the attention scores (the softmax term) and performing the weighted average both require operations.
3.1.2. Embedding Quantization with Codebooks
Our efficient attention method is motivated by the idea of using codebooks to compress the embedding matrix (Lian et al., 2020b; Jegou et al., 2010; Ge et al., 2013; Chen et al., 2018b). LightRec, proposed in (Lian et al., 2020b), encodes items with a set of codebooks, each contains dimensional codewords that serve as a basis of the latent space. An item’s embedding can be approximately encoded as:
(2) 
where is a similarity metric between two vectors . In LightRec, a bilinear similarity function is adopted: . denotes the th codeword in the th codebook. are learnable weights.
At training time, the codebooks and the item embeddings can be jointly trained using a softmax relaxation and the straightthrough estimator
(Bengio et al., 2013). At the inference stage, the item embedding can be discarded completely. For each item , we only store its corresponding codeword indices in each codebook, i.e., . Because each codeword index can be encoded with bits, the memory cost of storing items is reduced from bytes to bytes, where the first term is for codeword indices, and the second term is for codebooks.3.2. Motivation: A Simple Case
To illustrate the motivation behind our proposed method, we first look at a simple case where a single codebook is used to encode items.
In this case, an item is directly represented by the codeword with the maximum relevance score to it. The th item in the sequence is therefore given by: , where and denotes the th codeword in the codebook. Then, to perform the dotproduct attention for a query (with keys and values being the sequence ), we compute the inner product between and the corresponding codeword for every item in the sequence. The output of the attention is computed as follows:
(3) 
where is the sequence length. For the sake of simplicity, we omit the projection matrices
at this moment. From the above equation, we observe that we may have repeatedly compute the inner product of
with the same codeword , since a number of items in the sequence may all share as their representations. This redundant computation significantly hampers the efficiency, especially when , where is the set of unique codeword indices that the items in the sequence correspond to, i.e., .To address this issue, we note that is just a weighted average of all the codewords in , and the weight of each codeword depends only on its inner product with and its number of occurrences. Therefore, we only need to count how many times each codeword in is used in the sequence, and compute the inner product of with once. The computation of Eq. (3) can be reformulated as:
(4) 
We illustrate this idea in Figure 2.
3.3. LinearTime Self Attention
As we can see, the mechanism of codebook allows us to obtain the exact results of dotproduct attention with less computation (both in computing the attention scores and computing the final weighted average), at least in the case of a single codebook. Now, we turn to the case that multiple codebooks are used. The items in the sequence are represented by an additive composition of codewords in all codebooks, as given by Eq. 2. The result of dotproduct attention for a given query is as follows:
(5) 
Unlike the single codebook scenario, although in each codebook many items may correspond to the same codeword, their representations will diverge after the additive composition. Hence we still have to compute the inner product between and every item in the sequence.
To tackle this problem, we propose to relax the attention operation. We split the computation, perform the attention in each codebook separately, and then take the sum:
(6) 
This additive compositional formulation can be considered as a form of ”multihead” attention, where each attention head correlates with a codebook. Since different codebooks form different latent spaces, Eq. (6) in fact, aggregates information from different representational subspaces of the items, using independent attention weights.
Equipped with the above relaxation, we can once again reuse the inner product by computing the frequencies of each codeword that appeared in the sequence, separately for every codebook. We can reformulate the computation of Eq. (6) as follows:
(7) 
where is the set of unique codeword indices of the th codebook, and is the number of occurrences of in the sequence.
However, the cardinality of varies across different sequences and different codebooks . The computation of Eq. (7
) therefore operates on different sizes of tensors, which is suboptimal for efficient batching in GPU and TPU
(Kochura et al., 2019). For batching purpose, we perform the attention over all codewords in each codebook, fixing the ”context size” of the attention to :(8) 
For a codeword that is not used by any item in the sequence, the occurrence count , will not contribute to the weighted average. Hence Eq. (7) and Eq. (8) is equivalent.
Now we put it to the selfattention setting, where we use the input sequence as queries to attend to itself, the th query is just , i.e., . Since we regard the attention in different codebook as independent heads that attend in different latent spaces, we further reduce the computation of the inner product from to , considering only the term in the same codebook. This gives us:
(9) 
where is the th output of the attention operation.
Eq. (9) computes the bidirectional attention (each position can attend over all positions in the input sequence), since indicates the frequency of in the entire sequence. However, in the recommendation setting, the model should consider only the first items when making the th prediction (Kang and McAuley, 2018), we therefore favor a unidirectional setting (each position can only attend to positions up to and including that position). This requires us to compute the codeword histogram of every codebook, up to th position, for each . This can be implemented via the mechanism of prefixsum. We first transform the codeword index into a onehot representation , where . The onehot vectors for each codebook at each position forms a tensor of shape , we compute the prefixsum along the first dimension to get the histograms up to each position in the sequence:
(10) 
There exists an efficient algorithm (Cormen et al., 2009; Ladner and Fischer, 1980) for prefixsum with a computational complexity of when compute in parallel.
As we mentioned earlier, in the vanilla selfattention (Vaswani et al., 2017), linear projections are applied on the input sequence to get queries, keys and values. Similarly, we can directly apply the projection matrices on the codebooks since every item in the input sequence is just a composition of codewords. Combining this with Eq. (10), we obtain the following unidirectional attention mechanism:
(11) 
As we only need to compute the inner product between codewords in the same codebook, we can store them (after taking the exponent) in tables, and retrieve required terms via table lookup at inference time. We can achieve this by storing tables with items each, resulting in a memory cost of . However, this is not feasible without embedding quantization via codebooks, which leads to memory complexity of , where is the number of items. We present the workflow of LISA in Figure 3, and we outline the main algorithm for LISA formally in Algorithm 1.
3.4. Variants
We notice that the computational cost of LISA is determined by the fixed context size of (i.e., the total number of codewords). To further increase the efficiency, especially on shorter sequences, we propose to use a separate set of codebooks to encode the sequence with a much smaller . In our experiments, we find that using a of 128/256 is enough to obtain decent performance, compared to a of 1024/2048 that we used in our base model. We investigate the following two variants:

[leftmargin=*]

LISASoft: Instead of assigning a unique codeword for each item , we allow a soft codeword assignment. In this case, becomes the softmax scores where . With a soft assignment we can no longer compress the embedding matrix by storing discrete codeword indices at inference time. Hence we directly use the original embeddings for target items.

LISAMini: To enable embedding compression, we still use a hard codeword assignment. We adopt two separate sets of codebooks: a smaller one (i.e., with a smaller ) to encode the sequence, and a larger one to encode target items.
3.4.1. Extensions
Vanilla Transformer could stack multiple selfattention layers to improve performance. However, we find that using multiple attention layers is not particularly helpful in recommendation, as with (Kang and McAuley, 2018). Therefore we only employ a single layer. Our method can easily extend to multiple layer cases. A straightforward solution is to use a different set of codebooks to remap the attention outputs to codewords in a different set of latent spaces. Our method can also be adapted beyond selfattention, as long as queries, keys and values can be encoded via codebooks. Besides recommendation, we can employ LISA in other domains since codebooks are able to quantize any embedding matrices. For example, the inputs in NLP tasks are just token embeddings, where our method can easily be applied.
3.5. Complexity Analysis
We see that computing the codeword histograms takes steps, as we have to compute the prefixsums along the sequence length dimension for every codeword in all codebooks. The time complexity for computing the final outputs (weighted sum of values) is , as this operation is essentially a batched matrix multiplication between an attention score tensor of shape and a tensor of shape representing codebooks. Computing the inner product tables requires time, but at inference time, we can save this cost via table lookup. At training time, this is still a negligible term compared to , where is the batch size. Hence, our method has an overall asymptotic time complexity of .
4. Experiments
In this section, we empirically analyze the recommendation performance of our proposed method, compared to the vanilla Transformer and existing efficient attention methods. Following that, we present the computational and memory costs of LISA with respect to different sequence lengths. We also investigate how the number of codewords affects the performance of our method. Finally, we show the efficiency improvement brought by LISA in an online setting. We have also published our code^{1}^{1}1Available at: https://github.com/libertyeagle/LISA.
Dataset  #users  #items  #ratings  avg. length 

Alibaba  99,979  80,000  25M  252.93 
ML1M  6,040  3,416  1M  165.50 
Video Games  59,766  33,487  0.5M  8.82 
ML25M  162,541  32,720  25M  153.47 
4.1. Datasets
We use four realworld datasets for sequential recommendation that vary in platforms, domains and sparsity:

[leftmargin=*]

Alibaba: A dataset sampled from user click logs on Alibaba ecommerce platform, collected from September 2019 to September 2020. This is a dataset that contains relatively longer behavior sequences than the other datasets used in the experiments.

Amazon Video Games (Ni et al., 2019): A series of product reviews data crawled from Amazon spanning from 1996 to 2018. The data is split into separate datasets according to the toplevel product categories. In this work, we consider the ”Video Games” category. This dataset is notable for its sparsity.

MovieLens (Harper and Konstan, 2016)
: A widely used benchmark dataset of movie ratings for evaluating recommendation algorithms. We adopt two versions: MovieLens 1M (
ML1M) and MovieLens 25M (ML25M), which include 1 million and 25 million ratings, respectively.
Following the common preprocessing practice in (Kang and McAuley, 2018; Sun et al., 2019; Tang and Wang, 2018), we treat the presence of a rating as implicit feedback. Users and items with fewer than five interactions are discarded. Table 1 shows the statistics of the processed dataset.
LISABase  LISASoft  LISAMini  
#codebooks ()  8  8  8  
#codewords () 

16  32  
Alibaba  24.26    18.45  
ML1M  3.19    2.51  
Video Games  13.02    10.62  
ML25M  12.78    10.44  
4.2. Compared Methods
We evaluate our proposed base model, denoted as LISABase, as well as its two variants: LISASoft and LISAMini. We compare these methods with the vanilla Transformer (Vaswani et al., 2017), as well as the following efficient attention methods:

[leftmargin=*]

Reformer (Kitaev et al., 2019): It utilizes LSH to restrict queries to only attend to keys that fall in the same hash bucket, reducing the computational complexity to . We do not use the reversible layers since this technique can be applied to all methods, including ours.

Sinkhorn Transformer (Tay et al., 2020): It extends local attention by learning a differentiable sorting of buckets. Queries can then attend to keys in the corresponding sorted bucket. This model has a computational complexity of , where is the bucket size.

Routing Transformer (Roy et al., 2020): It is a clusteringbased attention mechanism. Kmeans clustering is applied to input queries and keys. The attention context for a query is restricted to keys that got into the same cluster with the query. The computational complexity is , where is the number of clusters.

Improved Clustered Attention (Vyas et al., 2020): Another clusteringbased attention method. This approach, however, only groups queries into clusters, and attend cluster centroids over all keys. The top keys for each cluster centroid are extracted to compute the attention scores with queries in this cluster. This results in a computational complexity of , where is the number of clusters.

Linformer (Wang et al., 2020): An efficient attention mechanism based on lowrank approximation. Linformer projects the keys and values of shape to , effectively reducing the context size to a tunable hyperparameter . This leads to a complexity of . We note that it is the only baseline that does not support unidirectional attention.
For simplicity, we ignore the term regarding the latent dimension size in the abovementioned asymptotic complexities.
Alibaba  ML1M  Video Games  
(lr)25 (lr)69 (lr)1013  HR@5  NDCG@5  HR@10  NDCG@10  HR@5  NDCG@5  HR@10  NDCG@10  HR@5  NDCG@5  HR@10  NDCG@10 
Transformer  0.6597  0.5528  0.7569  0.5843  0.6841  0.5376  0.7914  0.5725  0.5525  0.4337  0.6583  0.4680 
Linformer  0.3829  0.3007  0.4929  0.3360  0.4171  0.2899  0.5704  0.3394  0.4643  0.3605  0.5671  0.3937 
Reformer (LSH1)  0.6209  0.5189  0.7212  0.5513  0.6753  0.5248  0.7806  0.5590  0.5637  0.4429  0.6694  0.4771 
Reformer (LSH4)  0.6184  0.5156  0.7199  0.5484  0.6492  0.5040  0.7627  0.5408  0.5648  0.4446  0.6685  0.4781 
Sinkhorn (32)  0.6298  0.5278  0.7260  0.5589  0.6743  0.5256  0.7796  0.5599  0.5479  0.4289  0.6557  0.4638 
Sinkhorn (64)  0.6331  0.5319  0.7289  0.5629  0.6775  0.5310  0.7844  0.5656  0.5469  0.4258  0.6541  0.4605 
Routing (32)  0.5742  0.4789  0.6724  0.5106  0.6623  0.5186  0.7704  0.5537  0.5615  0.4412  0.6657  0.4750 
Routing (64)  0.6037  0.5037  0.7023  0.5356  0.6535  0.5100  0.7616  0.5452  0.5570  0.4369  0.6604  0.4704 
Clustered (100)  0.5924  0.4937  0.6941  0.5266  0.6573  0.5127  0.7697  0.5492  0.5591  0.4394  0.6642  0.4734 
Clustered (200)  0.5934  0.4936  0.6962  0.5268  0.6538  0.5095  0.7712  0.5478  0.5578  0.4384  0.6633  0.4725 
LISABase  0.6660  0.5460  0.7702  0.5798  0.6940  0.5406  0.7962  0.5740  0.6203  0.4788  0.7338  0.5157 
LISASoft  0.6575  0.5393  0.7622  0.5732  0.6795  0.5229  0.7887  0.5587  0.5951  0.4592  0.7035  0.4944 
LISAMini  0.6430  0.5146  0.7559  0.5511  0.6853  0.5308  0.7886  0.5644  0.5917  0.4497  0.7102  0.4881 
HR@5  NDCG@5  HR@10  NDCG@10  
Transformer  0.9338  0.8073  0.9752  0.8209 
Linformer  0.8627  0.7086  0.9367  0.7329 
Reformer (LSH1)  0.9214  0.7847  0.9694  0.8005 
Reformer (LSH4)  0.9150  0.7765  0.9667  0.7935 
Sinkhorn (32)  0.9195  0.7836  0.9682  0.7995 
Sinkhorn (64)  0.9161  0.7820  0.9649  0.7980 
Routing (32)  0.9167  0.7829  0.9658  0.7990 
Routing (64)  0.9215  0.7890  0.9685  0.8044 
Clustered (100)  0.9215  0.7830  0.9700  0.7989 
Clustered (200)  0.9199  0.7818  0.9692  0.7980 
LISABase  0.9254  0.7933  0.9713  0.8083 
LISASoft  0.9269  0.7964  0.9710  0.8109 
LISAMini  0.9243  0.7900  0.9701  0.8050 
4.3. Settings & Metrics
4.3.1. Parameter Settings
We use the SASRec (Kang and McAuley, 2018) architecture as the building block for our experiment setup, as SASRec purely relies on selfattention to perform sequential recommendation. Hence we can simply replace the regular Transformer selfattention with our method or the aforementioned baselines to compare the performance. We find that the number of attention layers has negligible impacts on the recommendation performance, and the performance of using multiple attention heads is consistently worse than single head (Kang and McAuley, 2018). Multiple attention layers and attention heads only lead to greater computational cost. Hence we use a single layer and a single head for all compared methods.
All methods are implemented in PyTorch and trained with the Adam optimizer with a learning rate of 0.001 and a batch size of 128. We use an embedding dimension of 128, and the dropout rate is set to 0.1 on all datasets. We train all methods for a maximum of 200 epochs. Following the settings in the original papers, we consider two settings for Reformer: LSH1 and LSH4, which use one and four parallel hashes, respectively. For Sinkhorn Transformer and Routing Transformer, we consider a bucket (window) size of 32 and 64. We set the number of clusters to 100 and 200 for Clustered Attention. We use a lowrank projection size of 128 for Linformer. We apply casual masking for all methods except Linformer.
We report the settings of codebooks used for all three versions of our proposed method in Table 2. Since LISABase and LISAMini can simultaneously compress the embedding matrix, we also report the achieved compression ratios on all four datasets. We see that the item embeddings can be compressed up to 24x.
4.3.2. Metrics
Following (Kang and McAuley, 2018; Sun et al., 2019; Lian et al., 2020c), we apply two widely used metrics of ranking evaluation: Hit Rate (HR) and NDCG (Weimer et al., 2008). HR@, counts the fraction of times that the target item is among the top. NDCG@, rewards methods that rank positive items in the first few positions of the top ranking list. We report the two metrics at and . The last item of each user’s behavior sequence is used for evaluation, while the remaining are used for training. For each user, We randomly generate 100 negative samples that the user has not interacted with, pairing them with the positive sample for the compared methods to rank.
4.4. Recommendation Performance
We report the results of the comparison of recommendation performance with baselines in Table 3 and Table 4. Since we also care about efficiency besides performance, we use bold font to denote the bestperforming method among the efficient attention baselines and the two more efficient variants of our approach, excluding LISABase.
From the two tables, we have the following important findings:

[leftmargin=*]

LISABase consistently outperforms all the stateoftheart efficient attention baseline on all four datasets. It attains improvements of up to 8.78% and 7.29% over the bestperforming baseline in terms of HR@10 and NDCG@10. This demonstrates the effectiveness of our proposed attention method based on codeword histograms, as we compute the full contextual attention, compared to the sparse attention mechanism most baselines built upon. Since we use more codewords, LISABase also outperforms LISASoft and LISAMini on all datasets except ML25M, where it has similar performance to LISASoft. On some metrics and datasets, LISABase even obtains higher performance than Transformer. This does make sense, considering LISABase could attend over a broader context, encapsulating relevant information from a large number codewords in each codebook (providing diverse views).

LISASoft and LISAMini achieve decent performance with much smaller codebooks. Even with 16 codewords used per codebook, LISASoft still outperforms the bestperforming baseline by 2.46% and 1.16% in terms of HR@10 and NDCG@10, on average. Only on ML1M, it is slightly worse than Sinkhorn Transformer (64) in terms of NDCG. We suppose that the issue might be we only use the soft codeword assignment scores when computing the codeword histograms , we still use a unique codeword per codebook to approximate the query. Otherwise, it would pose challenges to handle the cross terms between different codebooks when computing the inner product. This could create a potential mismatch between queries and keys/values, leading to the performance gap on this dataset. However, in most cases, LISASoft achieves comparable performance with respect to LISABase, using 94% fewer codewords. Even when model compression is desired, LISAMini can still improve the bestperforming baseline by 2.46% in terms of HR@10, on average.

Our proposed method, and the ones that allocate items to buckets based on similarity, even lead to increased recommendation performance on Video Games dataset. With an average length of only 8.8, the user sequences in Video Games tend to be noisy for making nextitem recommendations. Fullcontext attention in this scenario would confuse the model with the noise. Reformer, Routing Transformer and Clustered Attention remedy this issue by only attending to the informative items selected through hashing or clustering (note that the number of buckets/clusters are predetermined according to the maximum sequence length in the dataset and the desired bucket/cluster size). Meanwhile, LISA addresses this issue by summarizing information from different codebooks, which can be reckoned as a way of denoising.

In general, sparse attention via sorting the buckets seems to be more effective than learning the bucket assignments. We observe that Sinkhorn Transformer is a strong baseline, considerably outperforms Reformer, Routing Transformer and Clustered Attention on Alibaba and ML1M, while has almost identical performance with them on ML25M. Only on Video Games it performs slightly worse, due to the abovementioned intrinsic noise in this dataset. In this instance, Sinkhorn Transformer will perform full contextual attention, as it divides the sequence into consecutive blocks of fixed size.

LSH is better than clustering in bucket assignment. Reformer and Routing Transformer are both contentbased sparse attention methods that differ mostly by the technique used to infer sparsity patterns. Reformer employs LSH while Routing Transformer resorts to online kmeans clustering. We see that Reformer consistently outperforms Routing Transformer. The latter one sorts tokens by their distances to each cluster centroid and assigns membership via the topk threshold. The centroids are updated by an exponential moving average of training examples. Unlike Reformer, this approach does not guarantee that each item belongs to a single cluster, which may partially contribute to Routing Transformer’s worse performance.

Unidirectional attention is vital for satisfactory performance in recommendation. Observing the results of Linformer we can obtain this conclusion. Because the projection is applied to the length dimension, causing the mixing of sequence information, it is nontrivial to apply casual masking for Linformer. This bidirectional attention leads to significant performance degradation, as our attempts with other methods in bidirectional mode corroborate this finding. The designs of certain baseline methods also induce some issues in order to enforce casual masking. For example, in the unidirectional mode, Sinkhorn Transformer sorts the buckets only according to the first token in each bucket. Bidirectional Clustered Attention could first approximate the full contextual attention scores with that of the cluster centroid each query belongs to, while separately computing on the top keys. However, this technique is not viable in a unidirectional setting.

Using a larger bucket size does not necessarily improve the performance. We observe this phenomenon from the results of Sinkhorn Transformer and Routing Transformer. While the bucket size is increased, hence the context size for each query, we use fewer buckets/clusters. This would make it harder for kmeans clustering and Sinkhorn sorting to group relevant items together. Hence, one has to carefully tune the bucket size to achieve ideal performance, as it balances between the size of attention context and the quality of sorting / bucket assignment. Surprisingly, we find using multiple rounds of hashing in Reformer does not enhance the performance either.
4.5. Computational Cost
4.5.1. Settings
To evaluate the computational efficiency of our proposed method, we compare the inference speed of our method with the vanilla Transformer and the aforementioned efficient attention baselines. Following (Kitaev et al., 2019; Wang et al., 2020), we use synthetic inputs with varying lengths from 128 to 64K, and perform a full forward pass. The batch size is scaled inversely with the sequence length, to keep the total number of items (tokens) fixed. We report the average running time on 100 batches. For each baseline model, we only consider the less timeconsuming variant. For example, we only report the LSH1 variant for Reformer, as the LSH4 version is far more computationally intensive. Since the asymptotic complexity of our proposed method is , the inference speed of all the three versions of LISA only depends on the total number of codewords used to encode sequences (i.e., ). We evaluate two settings of LISA that use a total of 128 and 256 codewords (denoted as LISA128 and LISA256), corresponding to the settings we used for LISASoft and LISAMini in Section 4.4. We only measure the cost of selfattention, since other components are the same for all compared models. We consider latent dimension sizes of 128 and 1024. All the experiments are conducted on a single Tesla V100 GPU with 16GB memory. The results are shown in Figure 4.
4.5.2. Findings

[leftmargin=*]

Our method consistently and dramatically outperforms Transformer and all efficient attention baselines in inference speed. Only on sequences of length 128 and using an embedding size of 128, LISA is slightly slower than Transformer. When , LISA128 is 3.1x faster than Reformer on 64K sequences. Benefiting from using inner product tables, our method is even way faster than others when , achieving a speed boost of 57x compared to Transformer on 16K sequences. All other methods take considerably longer time as the cost of computing the inner product dominates in this scenario. Linformer has an almost identical speed to LISA128 when . However, its recommendation performance is notably worse than ours. From Figure 4, we also verify the linear complexity of LISA, as the inference time remains constant when the total number of items in a batch is constant.

Sinkhorn Transformer and Routing Transformer still suffer from enormous computational cost with growing sequence length. Especially when , the inference time increases by 5x for Sinkhorn and 27x for Routing moving from sequences of 128 to 64K. Both the two methods require time to compute query/key dot product within each bucket, where is the bucket size. Sinkhorn Transformer takes time to sort buckets, while Routing Transformer spends time to perform cluster assignments. With the bucket size fixed, the cost of sorting/clustering becomes dominant. Increasing the bucket size, on the other hand, we would have to pay an extra price in computing attention scores within each bucket.

Though the extra overhead dominates when sequences are short, Reformer tends to be almost linear when facing longer sequences. We see that hashing items into buckets via LSH is exceptionally timeconsuming. When , Reformer is significantly slower than the vanilla Transformer on sequences shorter than 512, even slower than on 64K sequences due to larger batch size. Our method, on the contrary, does not suffer from this issue, being up to 6.5x faster than Reformer on sequences of 128. On longer sequences, Reformer scales almost linearly, since the term is quite small in its asymptotic complexity .

Clustered Attention fails to demonstrate its advantage of linear complexity even on sequences of 64K. From Figure 4, we observe that the Clustered Attention is indeed linear (although bears the same extra overhead problem handling short sequences as Reformer). It seems that there underlies a substantial computational cost by computing fullcontextual attention using the cluster centroids, and then improving the approximation for each query on the top keys. Clustered Attention is still slower than Reformer on 64K sequences.
4.6. Memory Consumption
4.6.1. Settings
We also evaluate the memory efficiency of different methods by measuring the peak GPU memory usage. The settings of the compared methods are the same as the previous section’s. The latent dimensionality is set to 128. For a given sequence length, we choose the batch size to be the maximum that all compared models can fit in memory. We report results on sequences up to 16K long, as the vanilla Transformer could not fit longer sequences even with a batch size of 1. The compression ratios with respect to Transformer are shown in Table 5.
sequence length  

(lr)27  512  1024  2048  4096  8192  16384 
Linformer  2.46x  4.48x  8.49x  16.51x  32.65x  65.53x 
Reformer  0.66x  1.16x  2.15x  4.16x  8.31x  11.26x 
Sinkhorn  0.97x  1.70x  3.15x  6.09x  12.14x  25.74x 
Routing  1.27x  2.22x  4.08x  7.73x  14.87x  29.45x 
Clustered  2.32x  4.17x  7.85x  15.26x  30.63x  64.91x 
LISA128  2.94x  5.14x  9.55x  18.45x  36.86x  78.26x 
LISA256  1.50x  2.62x  4.87x  9.40x  18.78x  39.93x 
4.6.2. Findings
All the efficient attention baselines greatly reduce the memory consumption on longer sequences. Among which LISA128 is the most efficient one, requiring only 1.3% of the memory needed by Transformer in the best case. Although Reformer enjoys faster inference speed on long sequences, we see that it is more memoryhungry than other baselines. This again reflects the LSH bucketing overhead of Reformer.
4.7. Sensitivity w.r.t. Number of Codewords
4.7.1. Settings
We investigate the impact of the number of codewords per codebook (i.e., ) used in LISASoft and LISAMini on recommendation performance. We keep the number of codebooks to be 8 and vary from 16 to 96. We leave the settings of the codebooks used to encode target items in LISAMini unchanged. We show the results on Alibaba and Video Games in Figure 5.
4.7.2. Findings
The performance of LISAMini on Alibaba consistently improves with the increasing number of codewords used. Due to the sparsity of Video Games, it is challenging to learn two large codebooks well simultaneously. Hence the performance drops a bit when using a large number of codewords on this dataset. On the other hand, the performance of LISAsoft is relatively stable w.r.t. on both datasets, indicating that we can attain desirable performance with only a small number of codewords, greatly boost the inference efficiency.
4.8. Improving Efficiency for Online Recommendation
Here we consider a practical setting that users and the recommender interact in a dynamic manner. The recommender makes recommendations based on the user’s historical behaviors. The user then interacts with the recommendations, and the response is appended to the user’s history. This process is repeated as the recommender makes new recommendations using the updated user sequence.
A particular advantage of our method emerges in this setting. In our method, the computation of the attention scores only depends on the codeword histogram and the codebooks themselves. For each user, instead of having to store his entire history sequence at the cost of , we can just save the codeword histogram and the last item’s codeword indices to represent the user’s state. The codeword histogram and the indices can be dynamically updated, resulting in a constant storage cost of . At each inference step, our method can utilize the stored histogram to compute a weighted average of codebooks in a constant time of , compared with the complexity for the vanilla selfattention.
We simulate this scenario with randomly generate data. The total time required to make stepwise inferences from scratch up to some length is measured. Since most efficient attention baselines face challenges when dealing with variable sequence length (recall that Sinkhorn Transformer and Linformer assume a fixed sequence length as their model parameters depend on this length), we only compare LISA256 with the Transformer.
We show the results in Figure 6. We see that our method is considerably faster than Transformer in this setting, especially at a larger number of steps. Concretely, our method takes about 0.11ms to progress a step, no matter how long the sequence it. However, it would take Transformer 0.98ms to compute attention for a single query when the sequence is at 2K length, 1.50ms at 64K, and 11.01ms at 1024K, ~100x slower than our method.
4.9. Migrating Codebooks from Vanilla SelfAttention
4.9.1. Settings
We note that the codebooks serve as a plug and play module, which can be used to replace any embedding matrix. We can also train the model based on vanilla selfattention with codebooks. The pretrained codebooks are directly applied to LISABase and are frozen. We evaluate the performance of this model and report the results in Table 6.
4.9.2. Findings
We find that directly use codebooks trained with regular dotproduct attention does not cause performance degradation, but actually improves the performance of the LISABase model a little. This implies that our method indeed can approximate dotproduct attention to some extent.
HR@5  NDCG@5  HR@10  NDCG@10  
Alibaba  0.6697  0.5492  0.7711  0.5821 
ML1M  0.7002  0.5456  0.7945  0.5763 
Video Games  0.6188  0.4800  0.7333  0.5172 
ML25M  0.9287  0.7991  0.9725  0.8135 
5. Conclusions and Future Works
In this paper, we propose LISA, an efficient attention mechanism for recommendation, built upon embedding quantization with codebooks. In LISA, codeword histograms for each codebook are computed over the input sequences. We then use the histograms and the inner product between codewords to compute the attention weights, in time linear in the sequence length. Our method performs on par with the vanilla Transformer in terms of recommendation performance, while being up to 57x faster. Future works can include extending LISA to other domains like language modeling.
Acknowledgments
The work was supported by grants from the National Natural Science Foundation of China (No. 62022077 and 61976198), and the Fundamental Research Funds for the Central Universities.
References
 Encoding long and structured data in transformers. arXiv preprint arXiv:2004.08483. Cited by: §1, §2.2.
 Longformer: the longdocument transformer. arXiv preprint arXiv:2004.05150. Cited by: §1.

Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §3.1.2. 
The best of both worlds: combining recent advances in neural machine translation
. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 76–86. Cited by: §1.  Learning kway ddimensional discrete codes for compact embedding representations. In International Conference on Machine Learning, pp. 854–863. Cited by: §1, §3.1.2.
 Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §1, §2.2.
 Introduction to algorithms. MIT press. Cited by: §3.3.

Normexplicit quantization: improving vector quantization for maximum inner product search.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 34, pp. 51–58. Cited by: §1. 
Transformerxl: attentive language models beyond a fixedlength context
. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988. Cited by: §2.2.  BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §2.1.
 Object detection with discriminatively trained partbased models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §1.
 Optimized product quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (4), pp. 744–755. Cited by: §3.1.2.
 Quantization. IEEE transactions on information theory 44 (6), pp. 2325–2383. Cited by: §1.
 Quantization based fast inner product search. In Artificial Intelligence and Statistics, pp. 482–490. Cited by: §1.
 The movielens datasets: history and context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5 (4), pp. 19. Cited by: 3rd item.
 Visualizing music selfattention. Cited by: §1.
 Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (1), pp. 117–128. Cited by: §3.1.2.
 Selfattentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 197–206. Cited by: §1, §2.1, §3.1.1, §3.3, §3.4.1, §4.1, §4.3.1, §4.3.2.
 Reformer: the efficient transformer. In International Conference on Learning Representations, Cited by: §1, §1, §2.2, 1st item, §4.5.1.
 Batch size influence on performance of graphic and tensor processing units during training and inference phases. In International Conference on Computer Science, Engineering and Education Applications, pp. 658–668. Cited by: §3.3.
 Matrix factorization techniques for recommender systems. Computer 42 (8), pp. 30–37. Cited by: §1.
 Parallel prefix computation. Journal of the ACM (JACM) 27 (4), pp. 831–838. Cited by: §3.3.
 Personalized ranking with importance sampling. In Proceedings of The Web Conference 2020, pp. 1093–1103. Cited by: §2.1.
 LightRec: a memory and searchefficient recommender system. In Proceedings of The Web Conference 2020, pp. 695–705. Cited by: §1, §3.1.2.
 Geographyaware sequential location recommendation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2009–2019. Cited by: §2.1, §4.3.2.
 ProGen: language modeling for protein generation. arXiv preprint arXiv:2004.03497. Cited by: §1.
 Justifying recommendations using distantlylabeled reviews and finegrained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pp. 188–197. Cited by: 2nd item.
 Efficient contentbased sparse attention with routing transformers. arXiv preprint arXiv:2003.05997. Cited by: §1, §2.2, 3rd item.
 Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in neural information processing systems, pp. 2321–2329. Cited by: §1.
 Adaptive attention span in transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 331–335. Cited by: §2.2.
 BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1441–1450. Cited by: §1, §3.1.1, §4.1, §4.3.2.
 Personalized topn sequential recommendation via convolutional sequence embedding. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 565–573. Cited by: §4.1.
 Sparse sinkhorn attention. In International Conference on Machine Learning, Cited by: §1, §2.2, 2nd item.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1, §3.1.1, §3.1.1, §3.3, §4.2.
 Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §2.1.
 Fast transformers with clustered attention. Cited by: §2.2, 4th item.
 Linformer: selfattention with linear complexity. arXiv preprint arXiv:2006.04768. Cited by: §1, §2.2, 5th item, §4.5.1.

Nonlocal neural networks
. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 7794–7803. Cited by: §1.  CoFI rank  maximum margin matrix factorization for collaborative ranking. In Advances in Neural Information Processing Systems, pp. 1593–1600. Cited by: §4.3.2.
 Learning to stop: a simple yet effective approach to urban visionlanguage navigation. arXiv preprint arXiv:2009.13112. Cited by: §2.1.
 Graph contextualized selfattention network for sessionbased recommendation.. In IJCAI, pp. 3940–3946. Cited by: §3.1.1.
 Layoutlm: pretraining of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200. Cited by: §2.1.
 NAIRS: a neural attentive interpretable recommendation system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 790–793. Cited by: §3.1.1.
 Big bird: transformers for longer sequences. arXiv preprint arXiv:2007.14062. Cited by: §1, §2.2.

Selfattention generative adversarial networks
. In International Conference on Machine Learning, pp. 7354–7363. Cited by: §2.1.  Next item recommendation with selfattention. arXiv preprint arXiv:1808.06414. Cited by: §1, §2.1.