have played a key role in the recent advancement of the state of the art in many natural language processing tasks, such as machine translation, language modeling [20, 21], and question answering [10, 26, 16]. The key component of these networks is the self-attention layer, which updates the embeddings of the input tokens based on their context. Clark et al. , Coenen et al.  use probing tasks and show that self-attention layers in trained Transformer models compute embeddings that have substantial linguistic knowledge, and excel at several semantic and syntactic tasks. Naturally, the self-attention layer also plays the key role in the analysis of Transformers [17, 3, 12, 28, 2]; for example, Yun et al.  show that Transformers can approximate any continuous sequence-to-sequence functions (i.e., universal approximation), by proving that self-attention layers can compute contextual mappings of the input embeddings.
On the other hand, the self-attention layer is also the main bottleneck in scaling these models. It involves computation of pairwise inner products between input tokens, which results in quadratic computational complexity in the length of the input sequence . To mitigate this issue, researchers have developed methods to sparsify the pairwise interactions/connections in self-attention layers to reduce the computational complexity and/or improve model interpretability, and have shown successful empirical results on tasks with long sequence lengths [11, 5, 23, 9, 8, 19, 27, 29, 22, 15, 1]. For example, Child et al.  propose sparse Transformers for sequence generation. One of the sparsity patterns considered in  is the Strided pattern, where the sparse attention layers alternate between two patterns: each token attends to only i) local neighbors, and then ii) one after every
tokens in a strided manner. By choosing, they propose sparse attention layers with connections and show improvements on both speed and performance over the dense Transformer.
In the existing results, the rule of thumb while designing sparsity patterns (e.g., Strided) is connectivity; the intuition is that if each token can attend to the other tokens in multiple “hops,” then the resulting sparse Transformers do not lose much expressive power. However, there has been no formal justification for this intuition. How does sparsifying the interaction in the self-attention layers affect the model’s expressive power and ability to learn? What are the sparsity levels at which the model still retains its rich expressive power, and how is it affected by the sparsity pattern? Such fundamental questions about sparse attention models still remain unanswered.
1.1 Summary of contributions
In this paper, we take the first step towards a theoretical understanding of sparse Transformers.
We propose a unified framework to analyze sparse Transformers, which generalizes the existing approaches that sparsify attention layers (§ 3.1).
We propose a set of intuitive conditions on the sparsity pattern (Assumption 3.2
) and the probability map (Assumption3.2). Then, in Theorem 3.3, we show that Sparse Transformers, of fixed width and arbitrary depth, satisfying these conditions are universal approximators of any continuous sequence-to-sequence functions (§ 3.2 and § 3.3).
We next show some examples of existing sparse Transformers [5, 11, 1, 9, 8, 29] that satisfy these conditions, and hence have universal approximability (§ 3.4). Surprisingly, we show that there are sparse Transformers with only connections per self-attention layer (instead of ) that have enough expressive power to approximate arbitrary continuous functions (Corollary 3.4).
We report experimental results on standard NLP tasks using sparse Transformers, comparing different sparsity patterns/levels (§ 5).
2 Preliminaries and related works
In this section, we summarize the notation we will use throughout the paper, give a brief overview of Transformers, and then discuss existing efforts to sparsify the self-attention mechanism.
For a positive integer , we denote
. For any vector, let denote its -th coordinate. For any matrix , let denote its -th column, and denote the submatrix consisting of columns of in an index set . We use to denote the entry-wise norm of . Let
be the softmax operator, which takes a matrix as input and applies softmax operation to each column of the matrix, which results in a column stochastic matrix.
2.2 Transformers and their universal approximation power
A Transformer network, consisting of multiple layers of Transformer blocks, implements a sequence-to-sequence function that maps to . A Transformer Block () consists of two layers: a self-attention layer and a token-wise feed-forward layer, and both layers have an identity skip connection. More concretely, for an input consisting of -dimensional embeddings of tokens, a Transformer block consists of the following two layers:
where , , and
. Although our analysis and experiments rely on bias vectors, we omit those in (1) for simplicity.
To endow the network with information about the position of input tokens, it is common to add a positional embedding to the input before feeding it to the network. The positional embedding can be fixed  or trainable ; we consider the latter. Using a trainable , is defined to be a class of functions of the form , where is a composition of any number of Transformer blocks with attention heads of head size , and hidden layers of width . Thus, is a class of Transformers with a fixed width while the depth can be arbitrary.
Further, let be the class of continuous functions defined on any compact domain , where continuity is defined with respect to the entry-wise norm (). Yun et al. [28, Theorem 3] show that can universally approximate . More precisely, for any , and , there exists a function such that .
2.3 Sparse Transformers
As seen in Eq. (1a), the self-attention layer involves computing the inner product between each pair of tokens, which we will refer to as the attention score matrix . This leads to quadratic computational complexity in , which makes it expensive to apply Transformers to tasks with long sequence lengths. To mitigate this problem, one of the predominant approaches is to sparsify
the interactions in the self-attention layers, which can be sub-classified into three categories.
The first category reduces computation by making sparse in a pre-determined manner. Each token in the sequence only attends to a fixed smaller set of other tokens instead of the whole sequence [5, 19, 1]. In some papers, auxiliary tokens are added to improve connectivity between existing tokens while maintaining sparsity [11, 27]. One drawback of these approaches is that the sparsity pattern is independent of input, so it cannot adapt to the data. To remedy this issue,  proposes to learn local attention span from data.
The second category studies making sparse after the full has been computed [9, 8, 29]. Here, the focus is not on the computational gain via sparsity, because the full score matrix has to be computed first; rather, the goal here is to make attention layers more interpretable, as well as to improve performance. This line of works modifies in (1a) to other probability maps, by using top- elements or adopting sparser variants such as sparselin-gen or -entmax [14, 18]. Compared to the first category, this approach has an advantage that sparsity patterns are adaptive to data.
The last category attempts to get the best of both worlds. This line of works tries to learn sparsity patterns from data using extra components predicting the connection between tokens, e.g., -means clustering , LSTM , or locality-sensitive hashing . This way, one can adaptively determine the sparsity patterns before computing the score matrix. However, the drawback of this approach is that one needs extra computation to train/run these additional components, which may be expensive.
3 Universal approximation theorem for sparse Transformers
In this section, we derive a unifying framework to study sparse Transformers. We then propose a set of conditions on the sparse self-attention layers, and prove that the sparse Transformers satisfying theses conditions are universal approximators of any continuous sequence-to-sequence functions. Finally, we show some examples of existing sparse Transformers that satisfy these conditions.
3.1 A unifying framework for sparse Transformers
We modify the Transformer block in (1) to the following sparse Transformer block ():
where the sets , for and , define the sparsity patterns (formally defined below), which are indexed by . Moreover, the parameter dimensions stay the same as in (1).
Note that there are three main modifications from the dense Transformer.
(Cycling blocks) There are superscripts added to the symbols such as . Unlike dense Transformers, some sparse Transformers cycle through different patterns. For example, the Strided pattern  described in § 1 alternates between two different patterns, which corresponds to . We add the superscript to include such cases in our formulation. We assume that the layers in a sparse Transformer cycle through .
(Sparsity patterns) Note that denotes the -th column of the -th sparse attention head. Unlike dense Transformers, the inner product of the -th query vector is taken only with , the key vectors of tokens in the set . Hence, instead of all tokens, the -th token computes attention scores with only tokens in . For , we refer to the collection of the index sets , or simply , as a sparsity pattern. As a result, is a linear combination of columns in , rather than the whole sequence.
(Probability map) After computing the attention score matrix, the dense Transformer (1) uses the softmax operator to get a column stochastic matrix. In the sparse Transformers, we generalize to . The probability map is any map that takes a matrix as input and outputs a column stochastic matrix.
As a sanity check, by choosing , for all , and , we recover the dense Transformer. Note also that the sparse Transformer formulation covers the first and second categories of existing results discussed in § 2.3. The first category corresponds to choosing a predetermined sparsity pattern(s) , while setting . The second category corresponds to opting for a probability map other than softmax , while maintaining for all .
In this paper, we assume for simplicity that all sparse attention heads in a single layer have identical sparsity patterns . However, since our result only requires two sparse attention heads per layer (as we will see in Theorem 3.3), our result can be easily extended to the case that allows multiple sparsity patterns in a single layer.
Similar to in § 2.2, we define the class of functions represented by sparse Transformers. We hide the dependence of this class on the sparsity patterns and probability map to simplify the notation.
3.2 Conditions on sparsity patterns and probability map
In this section, we define a set of conditions on the sparsity patterns and the probability map that ensures that the sparse Transformer universally approximate the function class (cf. § 2.2).
For and the index sets , we define a sequence of sets in a recursive way:
The set is the set of all tokens that the -th token can directly/indirectly attend to, after sparse attention layers with sparsity patterns cycling through . We now state our conditions on sparsity patterns. The sparsity patterns satisfy the following:
For all and , we have .
There exists a permutation such that, for all , .
There exists a finite such that .
Assumption 3.2.1 is equivalent to saying that every token always attends to itself. Assumption 3.2.2 requires that there is a chain of direct connections that covers all tokens; note that the set is the set of all tokens that the -th token directly attends to. To elaborate more about the chain, consider a directed graph with vertices corresponding to the tokens. For any , we add a directed edge . Given a graph constructed this way, Assumption 3.2.2 requires that the graph has a Hamiltonian path . Assumption 3.2.3 requires that after sparse attention layers, every token can attend to all the other tokens, either directly or indirectly.
We now state the assumption on the probability map . For this, we define to be the hardmax operator, which outputs the one-hot representation of the entry for each column of the input matrix. Since both and are column-wise operators that output column-stochastic matrices, we state the assumption for the operation of on a single column.
 For any and , such that, for any column input satisfying (where ), we have and . Assumption 3.2 requires that, for inputs that have some margin between the unique maximum entry and the other entries, can closely approximate the behavior of the hardmax operator by scaling its input by a positive factor . This assumption is satisfied by softmax and other sparse variants such as sparselin-gen and -entmax, as we show in § B of the supplementary material.
3.3 Sparse Transformers are universal approximators
We now state our main theorem, which shows that if the sparsity patterns and the probability map satisfy Assumptions 3.2 and 3.2, sparse Transformers with attention heads of size , and hidden layer width are universal approximators of continuous sequence-to-sequence functions on any compact domain (recall that denotes the class of such continuous functions). Consider any , and the class of sparse Transformers (cf. (3.1)) with the underlying sparse attention layers satisfying Assumptions 3.2 and 3.2. Then, for any and , there exists a function such that
As discussed earlier, dense Transformers do satisfy Assumptions 3.2 and 3.2. Thus, Theorem 3.3 subsumes the existing result  for dense Transformers. Also, we note that the required width parameters , , and in Theorem 3.3 are completely independent of , , or the sparsity patterns.
The key justifying intuition for adopting sparse attention layers is that, if each token can attend to the other tokens in multiple hops111Note that this corresponds to our Assumption 3.2.3., then these models do not lose too much expressive power. However, there has been no formal justification for this intuition. Our theorem provides the first formal evidence that well-designed sparse attention layers do not limit Transformer’s universal approximation power. In § 3.4, we show a surprising result that there exist sparse self-attention layers with only connections (as opposed to connections in regular self-attention layers) that retain enough expressive power to approximate . This advantage of sparse Transformers over their dense counterpart becomes even stronger with increasing sequence length , providing a theoretical support for the adoption of sparsity for the tasks with long sequence lengths.
We provide a high-level proof sketch of Theorem 3.3 in § 4.1. Although the outline of the proof is similar to , the sparsity in attention mechanism and the choice of general probability map pose nontrivial challenges in extending the existing result to our setting. We detail these challenges and briefly describe how we overcome them in § 4.2.
3.4 Analysis of existing sparse Transformers
Child et al.  propose two kinds of -step sparsity patterns (i.e., ) for sequence generation tasks, namely Strided and Fixed patterns. We consider the extension of their autoregressive patterns (i.e., attending only to past tokens) to the whole sequence. In the Strided pattern, a token first attends to its neighbors and then attends to one token after every tokens in a strided manner. The sparsity pattern for the -th token reads
In the Fixed pattern, we divide the token into segments of length . A token in a segment has access to other tokens in the same segment, and then the last tokens of the other segments:
The Strided and Fixed patterns satisfy both Assumption 3.2 and 3.2 for all values of . Specifically, Assumption 3.2.3 holds with , because any token can directly/indirectly access all the tokens in two hops. As for Assumption 3.2.2, the identity permutation suffices to satisfy the assumption for both patterns. By choosing , sparse Transformers with the Strided and Fixed patterns achieve univeral approximation power with connections per attention layer.
Guo et al.  consider the Star sparsity pattern where they add an auxiliary relay token that attends to all the tokens, and the other tokens attend only to neighboring tokens and the relay token. There is only one sparsity pattern, so . The Star sparsity pattern can be written as
where . For any fixed , this sparse Transformer has connections per attention layer, and it satisfies both assumptions. Specifically, Assumption 3.2.2 is satisfied with the identity permutation, i.e., for . Since any token can access other tokens within two hops, Assumption 3.2.3 is satisfied with . This demonstrates that connections per layer suffice for sparse attention layers to have universal approximation power. One can similarly check that the sliding window sparsity patterns with/without global attention, proposed in Longformer , also satisfy the assumptions with connections. We state this interesting observation as a corollary below. There exist sparse Transformers with connections per self-attention layer that are universal approximators in the sense of Theorem 3.3.
4 Proof sketch and discussion
4.1 Sketch of proof of Theorem 3.3
Now, we sketch the proof of Theorem 3.3, which consists of three steps. Throughout the proof, we assume without loss of generality that .
In the first step, we approximate with a piecewise constant function. Towards this, consider a class of piecewise constant functions that map to , where and is an integer. Any function in maps cubes of the form to matrices , where . We approximate with a function such that , by choosing small enough . We defer the statement and the proof to § C of the supplementary material.
We then approximate with a sparse Transformer network with a slightly different architecture. In this architecture, we replace in the feed-forward layer with any piecewise linear activation , where denotes the class of piecewise linear functions with three pieces. We also replace in the sparse attention layer with the hardmax operator. We refer to the function class represented by the modified sparse Transformer as . The next key result shows that any can be exactly represented by the modified Transformer. See § D and § E in the supplementary material for the proof. For any , there exists such that for all .
The final step is to approximate the function with a sparse Transformer . This is done by approximating and with and , respectively, while carefully bounding the accumulation of errors introduced by the approximation. See § F in the supplementary material for the details. For in Lemma 4.1, there exists such that .
Combining these three steps, we establish that . ∎
4.2 Key challenges in the proof
While the high level outline of the proof is similar to the one for dense Transformers , the proof in  crucially relies on having all connections for computing attention in each layer, which we do not have in sparse Transformers. Instead, we rely on the sparsity conditions (cf. Assumptions 3.2 and 3.2) to establish Theorem 3.3. We highlight the key differences below.
Establishing the Step 2 of the dense result  relies on constructing a contextual mapping using attention layers. A contextual mapping is a function that maps tokens in different sequences to unique values, thereby allowing Transformers to distinguish the same token appearing in different contexts. A crucial ingredient in the construction of such a mapping is a shift operation implemented with two attention heads in an attention layer. This shift operation involves each token taking the maximum and minimum over the entire sequence, which obviously cannot be done with sparse Transformers as it would require each token to attend to every other token in the sequence. We circumvent this issue by carefully choosing the positional embedding dependent on (cf. Assumption 3.2.2), and ensuring that a similar shift operation is applied in a desired order even under sparsity.
As the final phase of the contextual mapping in , a single attention layer shifts the entire sequence by the maximum over the sequence. Again, this cannot be directly implemented due to sparsity. Using Assumption 3.2.3, we instead prove that by stacking sparse layers, one can successfully implement a similar operation that shifts the entire sequence by the maximum over the whole sequence, up to some controlled errors. This way, we overcome the difficulties posed by the sparsity and construct a new version of contextual mappings. The details can be found in § E.2 of the supplementary material.
Moreover, the proof of Step 3 in  uses the simple fact that softmax can approximate hardmax arbitrarily closely. Since we do not restrict ourselves to softmax and generalize the probability map, a more careful argument is required. Since there are many layers in the network , it turns out that approximating it with an original sparse Transformer in requires carefully controlling the approximation errors accumulated over layers. The proof of Lemma 4.1 in § F of the supplementary material shows that this is indeed possible by utilizing Assumption 3.2.
We now present our experimental study comparing different design and implementation choices, including sparsity patterns and levels, on three tasks: i) a synthetic copying task, ii) language modeling, and iii) translation. Our goal is to understand the effect of such choices while employing sparse Transformers to the tasks with small sequence lengths, complementing the existing results for sparse Transformers on long sequence tasks .
5.1 Experiment Settings
We consider four sparsity patterns: Strided (4), Fixed (5), Star (6) and Random. The first three patterns are proposed in  and ; we test them for different values of . In case of the Random pattern, given a sparsity level, we make connections uniformly at random. Following , Strided and Fixed patterns are tested for three different head configurations: i) Sequential, where the sparse attention layers alternate between and , as described in the previous sections; ii) Union, where all sparse attention layers use the sparsity pattern ; and iii) Multihead, where half of the attention heads in every attention layer use and the other half use . Note that, given the same sequence length, Union is less sparse than the other two configurations. Thus, to ensure fair comparisons, we compare different configurations based on their sparsity levels.
We use maximum sequence length 256 in all our experiments. For the copying task, we experiment with only one sparse Transformer block (cf. Eq (2)), with varying numbers of attention layers, where each attention layer has attention heads. For the language modeling and the translation, we use the Tensor2Tensor  framework and employ 12-block and 6-block (respectively) Transformers with attention heads per block. For more details of the setup, see § G of the supplementary material.
We consider a synthetic copying task proposed in , where the input sequence has the format , where is a 127 length sequence of symbols in . The models have to predict (copy) the second part, given the first half of the input. This task tests the ability of sparse Transformers to communicate the information. Table 1 presents the results for this task. Except for the Star and Random patterns, we can see that the networks learn to copy the sequences with four sparse attention layers. One possible explanation for the bad performance of Star is that, except for the relay token, it only attends to local neighbors while the task requires to copy distant tokens.
We conduct the language modeling experiments on the One Billion Word Benchmark  which has almost one billion tokens and a vocabulary of more than 800K unique tokens. In Figure (a)a, we plot the perplexity against the sparsity level. We observe that the Strided pattern and the Star achieve the best performance across all sparsity levels. For both the Strided and Fixed patterns, the Union configuration shows the best performance.
For the translation task, we train the model on WMT18 English-Czech (en-cs) dataset and test it on the Newstest 2015 dataset. We plot the BLEU score against the sparsity level in Figure (b)b. We apply the same sparsity pattern to both the encoder and the decoder. The Strided and Fixed patterns with Union configuration show the best scores, which are similar to the dense attention. The Union configuration is also the least sensitive to the sparsity levels.
In both language modeling and translation, the Random pattern performs significantly worse than the deterministic patterns, demonstrating the need for a careful design of sparsity patterns. Overall, our experiments suggest that the design of the optimal sparsity patterns is dependent on specific tasks. For example, the Star pattern shows the best performance on the language modeling task, while having trouble with copying and translation. Among the three head configurations tested for Strided and Fixed, the Union performs the best in all tasks and is also insensitive to sparsity levels, possibly because it suffers less from the “unbalance” between and (cf. Eqs (4) and (5)).
Recently, sparse Transformers have received a lot of attention as they enable more efficient/faster attention mechanisms for the tasks with very long sequence lengths. We take an initial step to provide a theoretical understanding of these models. We provide a unifying framework that captures existing sparse attention models, and prove a universal approximation theorem for sparse Transformers which holds under intuitive conditions on sparsity patterns and probability maps. We also carry out experiments comparing different sparsity patterns and levels on standard NLP tasks. We hope that this work will shed light on the understanding of sparsity in attention layers, and provide guidance for the design of sparse attention models.
Chulhee Yun acknowledges partial support as a graduate Research Assistant from the NSF Grant (CAREER 1846088). Chulhee Yun also acknowledges Korea Foundation for Advanced Studies for their support.
- Beltagy et al.  I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document Transformer. arXiv preprint arXiv:2004.05150, 2020.
- Bhojanapalli et al.  S. Bhojanapalli, C. Yun, A. S. Rawat, S. J. Reddi, and S. Kumar. Low-rank bottleneck in multi-head attention models. arXiv preprint arXiv:2002.07028, 2020.
- Brunner et al.  G. Brunner, Y. Liu, D. Pascual, O. Richter, M. Ciaramita, and R. Wattenhofer. On identifiability in Transformers. arXiv preprint arXiv:1908.04211, 2019.
- Chelba et al.  C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, and P. Koehn. One billion word benchmark for measuring progress in statistical language modeling. CoRR, abs/1312.3005, 2013. URL http://arxiv.org/abs/1312.3005.
- Child et al.  R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse Transformers. arXiv preprint arXiv:1904.10509, 2019.
- Clark et al.  K. Clark, U. Khandelwal, O. Levy, and C. D. Manning. What does BERT look at? an analysis of BERT’s attention. arXiv preprint arXiv:1906.04341, 2019.
- Coenen et al.  A. Coenen, E. Reif, A. Yuan, B. Kim, A. Pearce, F. Viégas, and M. Wattenberg. Visualizing and measuring the geometry of BERT. arXiv preprint arXiv:1906.02715, 2019.
- Correia et al.  G. M. Correia, V. Niculae, and A. F. Martins. Adaptively sparse Transformers. arXiv preprint arXiv:1909.00015, 2019.
- Cui et al.  B. Cui, Y. Li, M. Chen, and Z. Zhang. Fine-tune BERT with sparse self-attention mechanism. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3539–3544, 2019.
- Devlin et al.  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Guo et al.  Q. Guo, X. Qiu, P. Liu, Y. Shao, X. Xue, and Z. Zhang. Star-Transformer. arXiv preprint arXiv:1902.09113, 2019.
- Hahn  M. Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171, 2020.
- Kitaev et al.  N. Kitaev, Ł. Kaiser, and A. Levskaya. Reformer: The efficient Transformer. arXiv preprint arXiv:2001.04451, 2020.
- Laha et al.  A. Laha, S. A. Chemmengath, P. Agrawal, M. Khapra, K. Sankaranarayanan, and H. G. Ramaswamy. On controllable sparse alternatives to softmax. In Advances in Neural Information Processing Systems, pages 6422–6432, 2018.
- Li et al.  X. Li, Y. Meng, Q. Han, F. Wu, and J. Li. Sac: Accelerating and structuring self-attention via sparse adaptive connection. arXiv preprint arXiv:2003.09833, 2020.
- Liu et al.  Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Pérez et al.  J. Pérez, J. Marinković, and P. Barceló. On the Turing completeness of modern neural network architectures. arXiv preprint arXiv:1901.03429, 2019.
- Peters et al.  B. Peters, V. Niculae, and A. F. Martins. Sparse sequence-to-sequence models. arXiv preprint arXiv:1905.05702, 2019.
- Qiu et al.  J. Qiu, H. Ma, O. Levy, S. W.-t. Yih, S. Wang, and J. Tang. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972, 2019.
- Radford et al.  A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding by generative pre-training. Technical Report, OpenAI, 2018.
- Radford et al.  A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. Technical Report, OpenAI, 2019.
- Roy et al.  A. Roy, M. Saffar, A. Vaswani, and D. Grangier. Efficient content-based sparse attention with routing Transformers. arXiv preprint arXiv:2003.05997, 2020.
- Sukhbaatar et al.  S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin. Adaptive attention span in Transformers. arXiv preprint arXiv:1905.07799, 2019.
- Vaswani et al.  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- Vaswani et al.  A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws, L. Jones, Ł. Kaiser, N. Kalchbrenner, N. Parmar, et al. Tensor2tensor for neural machine translation. arXiv preprint arXiv:1803.07416, 2018.
- Yang et al.  Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le. XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
- Ye et al.  Z. Ye, Q. Guo, Q. Gan, X. Qiu, and Z. Zhang. Bp-Transformer: Modelling long-range context via binary partitioning. arXiv preprint arXiv:1911.04070, 2019.
- Yun et al.  C. Yun, S. Bhojanapalli, A. S. Rawat, S. J. Reddi, and S. Kumar. Are Transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020.
- Zhao et al.  G. Zhao, J. Lin, Z. Zhang, X. Ren, Q. Su, and X. Sun. Explicit sparse Transformer: Concentrated attention through explicit selection. arXiv preprint arXiv:1912.11637, 2019.
Appendix A Outline and notation
The supplementary material is organized as follows. First, § B proves that the softmax operator as well as its sparse versions indeed satisfy Assumption 3.2. Next, § C provides formal statements of Step 1 in the proof sketch (§ 4.1). The outline of proof of Lemma 4.1 (Step 2 in the proof sketch) is presented in § D, followed by a separate section (§ E) proving the three key sublemmas in the proof. The proof of Step 3, Lemma 4.1, is given in § F. Lastly, § G presents the detailed setup of our experiments.
We next review some of the notation and also introduce additional notation used throughout the supplementary material. For a positive integer , let . For where is an integer multiple of , we write . For any matrix , let denote its -th column, and denote the submatrix consisting of columns of in the index set . We also use to denote its -th entry. Let be the 0-1 indicator for an event. Let be a vector whose components are all 1.
Appendix B Sparse probability maps satisfy Assumption 3.2
In this section, we show that the softmax operator as well as the probability maps used to replace softmax in the existing approaches, namely softmax with only top- inputs , sparselin-gen , and -entmax , all satisfy Assumption 3.2. We restate the assumption for reader’s convenience: See 3.2 As in the assumption, we only consider the operation of these probability maps on a single vector, as they are applied column-wise. For each of the probability maps, we will show that for any and , we can choose that satisfies the conditions of Assumption 3.2.
b.1 Softmax & softmax with top- inputs
Given an input vector , the -th coordinate of the output of softmax is defined as
We assume without loss of generality that the entry of is in decreasing order, where the first two entries satisfy . For any such and any , our aim is to show the existence of such that . Then, follows.
Now, since for , note that
Since is an increasing function in , one can increase sufficiently large to make it greater than .
The same argument holds for the softmax with top- inputs, used in . By the assumption on , entries are the top components. Thus,
can be satisfied by choosing large enough .
where is the probability simplex. Then, the solution for optimization problem above can be written as
where is a threshold function that chooses the threshold such that .
Now, assume without loss of generality that the entry of is in decreasing order, where the first two entries satisfy . For any such and any , our aim is to show the existence of such that . This is done by choosing . To see this, notice that if ’s are in decreasing order, then are also in decreasing order. Now consider
If , then for all , and . If , then
where is the probability simplex and is the Tsallis continuous family of entropies
As shown in , the solution of -entmax is equal to softmax if , and otherwise () it is given in the form
where is a threshold function that chooses the threshold such that . Since softmax () is already covered above, we focus on .
Again, assume without loss of generality that the entry of is in decreasing order, where the first two entries satisfy . For any such and any , our aim is to show the existence of such that . This is done by choosing .
Note that due to our choice of . Then, we will show that with such a , must hold. For the sake of contradiction, suppose not: . Then, by monotonicity of , we have . This means
in particular, we have . However, recall that , which implies . This results in
thus contradicting . Therefore, must hold.
Appendix C Details of the Step 1 in the proof sketch (§ 4.1)
We start by formally defining the function class .
where . We now state and prove the lemma. For any and , there exists a small enough such that there exists such that .
Since is a continuous function on a compact domain, it is uniformly continuous. Also, continuity is defined with respect to entry-wise norm which is equivalent to entry-wise norm, uniform continuity leads to
Then, suppose we create a set of cube grid points , and define a piece-wise constant approximation
Note that for any we have , so we have
This implies that
finishing the proof of the lemma. ∎
In this section, we describe in further details how modified sparse Transformers (the class ) are able to exactly express arbitrary piecewise constant functions in . We show that we can compute a contextual mapping of the entire input sequences without relying on dense self-attention layers. The token-wise feed-forward layers then transform these contextual mappings to the desired output sequence.
To give a high level summary of the proof, we want to show that given a piece-wise constant function , there exists a modified Transformer network that exactly represents . Recall first that the function class has an additive positional embedding matrix that is added to input before the input is fed to the network. We start by choosing the positional embedding and construct a Transformer network that implements quantization of the input, contextual mapping of the quantized input, and value mapping of the context ids.
Given the input , a series of modified feed-forward layers quantizes it so that each entry of the quantized input has a value in (Lemma D.2).
Next, a series of modified sparse self-attention layers takes the quantized input and implement a contextual mapping such that, for different quantized input sequences and , all the elements in and are distinct (Lemma D.3).
Finally, a series of modified feed-forward layers maps each element in the context id to the desired output value of at the input (Lemma D.4).
Before discussing the details of each step, we note that although a Transformer network stacks self-attention and feed-forward layers in an alternate manner, we can use a series of arbitrary number of the same layers, thanks to skip connections. The outline of the proof is similar to , but key component in their proof called selective shift operation relies on the fact that each token can attend to the entire sequence; this is not true in sparse Transformers, which poses a nontrivial challenge. We overcome this issue by a more careful construction of the positional embedding and sparse self-attention layers.
d.1 Choosing the positional embedding
Recall from Assumption 3.2.2 that there exists a permutation such that for all , is one of the tokens that the -th token directly attends to. Using this permutation , we choose the columns of positional embedding in the following way:
As a result, the -th column of will be in the range , and similarly for . This means that the entries corresponding to different tokens lie be in disjoint intervals of the form , where .
d.2 Quantization by feed-forward layers
Note from the previous step that each entry of must be in . Next, we quantize this interval of input using to a set of -grid points . This allows us to deal with finite set of values, which proves useful in the later stages of the proof. The next lemma shows that the quantization can be carried out using a seried of the modified feed-forward layers. Consider a entry-wise quantization map :
There exists a function composed of token-wise feed-forward layers with and an activation , which implements the entry-wise quantization to each entry of its input.
d.3 Contextual mapping by sparse self-attention layers
After the input is quantized, the output of must be in the following set :
where was defined to be the -cubic grid points of . Using this finite set of sequences, we construct a contextual mapping that maps each sequence in to unique numbers. Recall that the sparse attention layer has sparsity patterns that rotate in cycles, and Assumption 3.2.3 assumes that one token directly/indirectly access all the other tokens after such sparse attention layers. We now state the lemma. Assume that , and is an integer satisfying . Suppose that the sparse self-attention layers () satisfy Assumption 3.2 and employ the hardmax operator, and that the positional embedding was chosen as described in § D.1. Then, there exist a function composed of sparse self-attention layers, and a vector , such that satisfies the following properties:
For any , the entries of are all distinct.