# Are Transformers universal approximators of sequence-to-sequence functions?

Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other simpler alternatives to self-attention layers and empirically evaluate them.

## Authors

• 13 publications
• 24 publications
• 19 publications
• 23 publications
• 55 publications
• ### O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Transformer networks use pairwise attention to compute contextual embedd...
06/08/2020 ∙ by Chulhee Yun, et al. ∙ 5

State of the art sequence-to-sequence models perform a fixed number of c...
10/22/2019 ∙ by Maha Elbayad, et al. ∙ 0

• ### Challenges and Thrills of Legal Arguments

State-of-the-art attention based models, mostly centered around the tran...
06/06/2020 ∙ by Anurag Pallaprolu, et al. ∙ 0

• ### Staircase Attention for Recurrent Processing of Sequences

Attention mechanisms have become a standard tool for sequence modeling t...
06/08/2021 ∙ by Da Ju, et al. ∙ 0

• ### GLU Variants Improve Transformer

Gated Linear Units (arXiv:1612.08083) consist of the component-wise prod...
02/12/2020 ∙ by Noam Shazeer, et al. ∙ 0

• ### FNet: Mixing Tokens with Fourier Transforms

We show that Transformer encoder architectures can be massively sped up,...
05/09/2021 ∙ by James Lee-Thorp, et al. ∙ 0

• ### Attending to Mathematical Language with Transformers

Mathematical expressions were generated, evaluated and used to train neu...
12/05/2018 ∙ by Artit Wangperawong, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Self-attention based Transformer networks

(Vaswani et al., 2017)

have been at the center of the recent progress on various natural language processing (NLP) tasks, including machine translation

(Vaswani et al., 2017), language modeling (Radford et al., 2018, 2019), and question answering (Devlin et al., 2018; Yang et al., 2019; Liu et al., 2019). All these tasks involve learning models that map an input sequence of tokens to an output sequence of tokens. Transformers make it feasible to train large models to approximate these sequence-to-sequence functions due to their ability to process the input tokens in a parallel way, as opposed to the sequential nature of RNNs and LSTMs.

A Transformer block consists of two kinds of layers: a self-attention layer and a token-wise feed-forward layer, with skip connections present in both layers. The self-attention layer transforms each input token embedding using a weighted combination of the embeddings of all tokens in the input sequence, where weights are generated by pairwise dot-products among the input token embeddings. The token-wise feed-forward layer then independently processes each of these modified input token embeddings without any interaction among them. Notably, Transformers employ parameter reuse across tokens, as both layers use the same parameters to process each token. Moreover, Transformers have to rely solely on the pairwise dot-products to capture interaction between the input tokens.

Given the parameter sharing and limited interactions between tokens, it is natural to wonder: what class of sequence-to-sequence functions can the Transformer networks represent? Also, what is the role of the two different kinds of layers? Are both layers needed to obtain the representation power of Transformers? In the existing literature, the advantage of Transformers has often been attributed to their capability of computing contextual embeddings/mappings of the input, as opposed to fixed word embeddings as in word2vec (Mikolov et al., 2013). Is it possible to formalize the notion of contextual mappings? If yes, can Transformers actually compute such mappings? Such questions still remain elusive.

In this paper, we provide a mathematical definition of contextual mappings and surprisingly show that multi-head self-attention layers can indeed compute contextual mappings of the input sequences. We further show that this ability to compute contextual mappings coupled with the value mapping ability of the feed-forward layers makes Transformers universal approximators of any permutation equivariant sequence-to-sequence function. We also improve this result using positional encodings, and show that Transformers can represent any sequence-to-sequence function; i.e., the restriction of permutation equivariance can be removed by positional encodings.

These results on universal approximation of sequence-to-sequence functions raise a natural question: is it possible to have a more efficient architecture to compute contextual mappings, consequently, preserving the ability to universally approximate sequence-to-sequence functions? Towards this, we discuss other simpler architectures that can implement contextual mappings (to some extent), and experimentally evaluate their performance. In our experiments, we notice that the models that combine these simpler architectures with Transformers have better performance, compared to the standalone Transformers. We conclude the paper by presenting more discussion and interesting future research directions along these lines.

### 1.1 Summary of our contributions

• We prove that Transformers are universal approximators of continuous and permutation equivariant sequence-to-sequence functions with compact support (Theorem 3). We also show that, if Transformers have trainable positional encodings added to the input, then they are universal approximators of continuous sequence-to-sequence functions on a compact domain (Theorem 3).

• We formalize the notion of contextual mappings

and show that the attention layers can compute contextual mappings, where each unique context is mapped to a unique vector (Lemma

4.2).

• We experimentally evaluate other simpler layers that can compute contextual mappings to some extent, such as bi-linear projections and separable convolutions, and show that substituting some of the self-attention layers with these layers can result in better performance (Section 5).

### 1.2 Related works & notation

Analysis of attention-based models. Given the popularity of Transformers, there have been numerous works trying to understand the role of attention layers in natural language processing models. One such line of work focuses on probing the output of attention layers to understand the attention mechanism and internal language representation (Hewitt & Manning, 2019; Clark et al., 2019; Coenen et al., 2019; Vig & Belinkov, 2019). Although these results give valuable insights, a consistent theoretical analysis corroborating these findings is missing.

Universal approximation theorems.

Universal approximation theorems are classical results in neural network theory, dating back many decades

(Cybenko, 1989; Hornik, 1991). These results show that given unbounded width, a shallow neural network can approximate arbitrary continuous function with compact support, up to any accuracy. Other results focusing on depth appeared more recently (Lu et al., 2017; Hanin & Sellke, 2017; Lin & Jegelka, 2018). In particular, Lu et al. (2017); Hanin & Sellke (2017)

consider fully-connected ReLU networks whose input dimension is

, and show that networks with width and unbounded depth are universal approximators of scalar-valued continuous functions. Lin & Jegelka (2018)

show that a residual network with one hidden neuron per residual block is a universal approximator of scalar-valued functions, given unbounded depth. Although Transformer networks do have residual connections, due to their heavy parameter sharing, the existing analyses for residual networks do not extend to Transformers.

Sannai et al. (2019) consider universally approximating permutation invariant/equivariant functions using fully-connected ReLU networks.

Turing completeness results on Transformers. Recently, Pérez et al. (2019) have shown that Transformers with infinite precision are Turing complete, which is not the case in finite precision setting (Dehghani et al., 2018). We note that Turing completeness deals with computation on formal languages (thus discrete objects), while universal approximation focuses on functions on a continuum. In other words, these are two different concepts; and one does not imply another.

Notation. We use the following notation in the paper. Given a matrix , let , , and denote its -th entry, -th row, and -th column, respectively. We use to denote the entry-wise norm of . Let

be the softmax operator, which takes a matrix as input and applies softmax operation to each column of the matrix, which results in a column stochastic matrix, i.e., a matrix that has non-negative entries with all columns summing to 1. We similarly define

to be the hardmax operator, which outputs the one-hot representation of the entry for each column of the input matrix. If there are entries, then the output is for such entries. We use to denote a vector of length whose entries are all . We denote the 0-1 indicator function by . We use and to denote the embedding dimension and the sequence length, respectively. We assume throughout that , as the Transformers reduce to residual networks when .

## 2 Transformer networks

A Transformer block is a sequence-to-sequence function mapping to . It consists of two layers: a self-attention layer and a token-wise feed-forward layer, with both layers having a skip connection. More concretely, for an input consisting of -dimensional embeddings of tokens, a Transformer block with multiplicative or dot-product attention (Luong et al., 2015) consists of the following two layers111

In our proof we use bias vectors

for query projections in attention layers. We omit them here for brevity.:

 Attn(X) =X+∑hi=1WiOWiVX⋅σ[(WiKX)TWiQX], (1) FF(X) =Attn(X)+W2⋅ReLU(W1⋅Attn(X)+b11Tn)+b21Tn, (2)

where , , , and is the output of the Transformer block. The number of heads and the head size are two main parameters of the attention layer; and denotes the hidden layer size of the feed-forward layer.

Here, we would like to point out that our definition of the self-attention layer (1) is an equivalent reformulation of (Vaswani et al., 2017), where they concatenate attention heads and multiply a matrix to the concatenation. One difference in our setup is the absence of layer normalization, which simplies our analysis while preserving the basic architecture of the Transformer.

We define the Transformer networks as the composition of Transformer blocks. The family of the sequence-to-sequence functions corresponding to the Transformers can be defined as:

 \mcTh,m,r \defeq{g:\realsd×n→\realsd×n∣g is% a composition of Transformer blocks th,m,r's}. (3)

where denotes a Transformer block defined by an attention layer with heads of size each, and a feed-forward layer with hidden nodes.

We say that a function is permutation equivariant if for any permutation matrix , we have ; i.e., if we permute the columns of , then the columns of are permuted in the same way. A Transformer block is permutation equivariant, which we formally prove in Section A. This consequently establishes the permutation equivariance of the class . A Transformer block defines a permutation equivariant map from to .

## 3 Transformers are universal approximators of sequence-to-sequence functions

In this section, we present our results showing that the Transformer networks are universal approximators of sequence-to-sequence functions. Let us start by defining the target function class , which consists of all continuous permutation equivariant functions with compact support that map to . Here, continuity is defined with respect to any entry-wise norm, . Given two functions , for , we define a distance between them as

 dp(f1,f2)\defeq(∫\normf1(X)−f2(X)ppdX)1/p.

The following result shows that a Transformer network with a constant number of heads , head size , and hidden layer of size can approximate any function in .

Let and , then for any given , there exists a Transformer network , such that .

Next, we present our theorem on Transformers with positional encodings. In order to endow the Transformer networks with the ability to capture the information about the position of tokens in the input sequence, it is a common practice to add positional encodings to the input sequence before feeding it to the Transformer network  (Vaswani et al., 2017; Devlin et al., 2018). Consider the functions represented by Transformers with positional encodings:

 \mcTh,m,rP\defeq{gP(X)=g(X+E)∣g∈\mcTh,m,r and E∈\realsd×n}. (4)

Here we show that if is trainable, these positional encodings are sufficient to remove the permutation equivariance restriction of the Transformers. Towards this, we define to be the set of all continuous functions that map a compact domain in to . Note that does not have the restriction of permutation equivariance as in , but any is defined on a compact domain instead of the whole . The following result states that, equipped with the trainable positional encodings, Transformers can approximate any sequence-to-sequence function in .

Let and , then for any given , there exists a Transformer network such that we have . Theorems 3 and 3 provide an interesting characterization of the representation power of fixed-width Transformer networks. Since the function classes and become richer as we increase the values of , our results establish that general Transformer networks are also universal approximators of sequence-to-sequence functions. Remarkably, none of the parameters depend on the input sequence length or embedding dimension .

Here, we would like to point out that Theorems 3 and 3 appear quite surprising at a first glance, given the parameter sharing across all the tokens in a sequence, e.g., feed-forward layers are applied token-wise and the projection matrices in the self-attention layers are the same across different tokens. Furthermore, attention layers can only capture pairwise interaction between different tokens in the sequence. In the next subsection, we briefly describe one of our key steps in overcoming the aforementioned restrictions and proving universal approximation power of Transformers.

### 3.1 A key step: self-attention layers can implement contextual mappings

Let us consider a setting where we are interested in embedding two sentences: 1) I am happy; and 2) I am Bob. These sentences are fed to a sequence-to-sequence model as

 X=[X:,1,X:,2,X:,3]=[vI,vam,vhappy]  and  ~X=[~X:,1,~X:,2,~X:,3]=[vI,vam,vBob],

where and denote -dimensional embedding for the tokens ‘I’, ‘am’, ‘happy’, and ‘Bob’, respectively. Since the word ‘I’ occurs in different contexts in these sentences, in order to implement arbitrary sequence-to-sequence functions, the sequence-to-sequence model should map the two occurrences of ‘I’ to different values. We formally define this requirement below.

[Contextual mapping] Consider a finite set . A map defines a contextual mapping if the map satisfies the following:

1. For any , the entries in are all distinct.

2. For any , with , all entries of and are distinct.

In other words, a contextual mapping maps each token (column) of to a unique value which depends on the entire ; as a result, capturing the precise context of . This allows the subsequent token-wise function (e.g., defined by the feed-forward layers in case of Transformer networks) to realize the outputs of any arbitrary sequence-to-sequence functions.

At first thought, we can consider getting a contextual mapping by simply averaging all the tokens, because this can capture the one-word difference (e.g., “happy” vs. “Bob”) in two different contexts. However, if there are multiple words that are different, it is not guaranteed that the average will be different. Indeed, requiring unique mappings for all the tokens for any change in any number of tokens, is a steep requirement.

While the self-attention layer does consider pair-wise interactions among different input tokens, it is not clear if this weak form of pair-wise interaction with shared projection weights is sufficient to extract the underlying context. The following result, which we sketch here, shows that self-attention layers can implement a permutation equivariant contextual mapping over almost all elements of a grid in . We defer the full statement to Section 4.2.

###### Lemma 4.2 (informal).

Consider the grid . Then, there exist a function composed of self-attention layers () and a vector such that satisfies the following properties, for a subset that contains almost all elements of :

1. For any , the entries of are all distinct.

2. For any such that is not a permutation of , all entries of , are distinct.

Lemma 4.2 shows that a series of self-attention layers can implement contextual mappings, despite the apparent restriction that each of them can only capture pair-wise interaction. However, the restriction of permutation equivarance still exists because attention layers are inherently permutation equivariant. Coupled with the ability of token-wise feed-forward layers to map different values in to arbitrary output values, we can prove universal approximation capability of Transformers.

### 3.2 Proof of the universal approximation theorem (Theorem 3)

Next, we outline the proof of Theorem 3 in greater detail. We refer the reader to Section C for the proof of Theorem 3, since it is a modification of Theorem 3. Even though Theorems 3 and 3 do not specifically mention the required depth for approximation, our proof techniques do characterize it, and we show that our construction is tight in the number of parameters. We defer the discussion of depth to Section 4.4.

Recall that we want to show that given a function , we can find a Transformer network such that . Without loss of generality, we can assume that the compact support of is contained in . We achieve our desired objective in three key steps:

Step 1. Approximate with piece-wise constant functions. We first use (a variant of) the classical result that any continuous function can be approximated up to arbitrary accuracy by piece-wise constant functions. For , we define the following class of piece-wise constant functions.

 ¯¯¯¯¯¯¯¯¯¯¯¯\mcFPE(δ)\defeq{f:X↦∑L∈GδAL1{X∈SL}∣f is permutation % equivariant, AL∈\realsd×n},

where and, for a grid point , denotes the associated cube of width . Let be such that .

Step 2. Approximate with modified Transformers. We then consider a slightly modified architecture for Transformer networks, where the softmax operator and are replaced by the hardmax operator

and an activation function

, respectively. Here, the set of allowed activations consists of all piece-wise linear functions with at most three pieces, where at least one piece is constant. Let denote the function class corresponding to the sequence-to-sequence functions defined by the modified Transformer networks. The following result establishes that the modified Transformer networks in can closely approximate functions in .

For each and , such that .

Step 3. Approximate modified Transformers with (original) Transformers. Finally, we show that can be approximated by . Let be such that .

Theorem 3 now follows from these three steps, because we have

 dp(f,g)≤dp(f,¯¯¯f)+dp(¯f,¯¯¯g)+dp(¯¯¯g,g)≤2ϵ/3+O(δd/p).

Choosing small enough ensures that . ∎

We refer the reader to Sections B.1 and B.2 in the supplementary material for the formal statements and proofs of Steps  and , respectively. As for Step , which is the most critical step in establishing the universal approximation property of Transformers, we provide a sketch of the proof of Proposition 3.2 in the next section, and refer the reader to Section B.3 for the complete proof.

## 4 Proof sketch of Proposition 3.2: different roles of two layers

As mentioned earlier, the heavy parameter sharing in Transformers makes the goal of universally approximating sequence-to-sequence functions seemingly difficult. Both the self-attention and the feed-forward layer weights inside a Transformer block are fixed across tokens. In this section, we show that Transformers are able to overcome this architectural constraint, and compute contextual mappings of the entire input sequence just based on the pair-wise interactions. The token-wise feedforward layers then transform these contextual mappings to the desired output sequence.

We highlight these inner workings of Transformers en route to proving Proposition 3.2. We want to show that given a piece-wise constant function , there exists a modified Transformer network that closely approximates . We achieve this goal by establishing the following three claims, which correspond to Lemmas 4.1, 4.2, and 4.3.

1. Given an input , a series of feed-forward layers in the modified Transformer network can quantize to an element on the extended grid .

2. Next, a series of self-attention layers in the modified Transformer network can take the input and implement a contextual mapping such that, for and that are not permutation of each other, all the elements in and are distinct.

3. Finally, a series of feed-forward layers in the modified Transformer network can map elements of the contextual embedding to the desired output value of at the input .

Before discussing these three claims in detail, we note that even though a Transformer network stacks self-attention and feed-forward layers in an alternate manner, the skip connections enable these networks to employ a composition of multiple self-attention or feed-forward layers. Furthermore, as alluded earlier, these three steps clearly highlight the different roles that self-attention and feed-forward layers play in realizing the ability to universally approximate sequence-to-sequence functions: 1) self-attention layers compute precise contextual maps; and 2) feed-forward layers then assign the results of these contextual maps to the desired output values.

### 4.1 Quantization by feed-forward layers

Since our objective in Proposition 3.2 is to approximate the function , which takes a constant value on the cubes ’s, the (modified) Transformer network approximating first quantizes the input according to these cubes. In particular, we want each input to be mapped to the point . The following result shows that a modified Transformer network can indeed implement this quantization map with a composition of multiple feed-forward layers.

Consider a scalar quantization map :

 gentq(t)={kδ if kδ≤t<(k+1)δ,  k=0,…,1/δ−1,−δ−nd otherwise.

There exists a function composed of token-wise feed-forward layers with and activations in , which employs the scalar quantization to each entry of its input.

As desired, the function maps any to . Furthermore, if any element of is not in , the element is mapped to , indicating that is outside the compact support of .

### 4.2 Contextual mapping by self-attention layers

In this subsection, we show that the (modified) Transformer network can compute contextual mappings (cf. Definition 3.1) from the output of the map (cf. Section 4.1) by using a composition of self-attention layers. The following lemma, sketched earlier in Section 3.1, shows that the (modified) Transformer networks can implement a permutation equivariant contextual mapping over almost all elements of , while mapping the rest of elements in to a disjoint set.

Consider the following subset of :

 \wtGδ\defeq{L∈Gδ∣L:,i≠L:,j for all i≠j}.

Assume that and . Then, there exist a function composed of self-attention layers () that employ the operator, a vector , constants (), such that satisfies the following properties:

1. For any , the entries of are all distinct.

2. For any such that is not a permutation of , all entries of , are distinct.

3. For any , all the entries of are in .

4. For any , all the entries of are outside .

At this point, a few remarks about the result in Lemma 4.2 are in order. First, since the Transformer networks are bound to implement permutation invariant maps, we require the Property 4.2.2 to hold for the pair of sequences that cannot be mapped to each other via permutation of columns. Furthermore, the self-attention layers implement the desirable contextual map for only , where all columns of are distinct. Note that for small , constitutes a negligible fraction of because . The function in Lemma 4.2 maps the elements of outside —the interval where the outputs of the contextual mapping for reside.

#### 4.2.1 Proof sketch of Lemma 4.2

Since Lemma 4.2 is one of the major technical contributions of this paper, we provide a short sketch of its proof. The complete proof is presented in Section B.5. For simplicity, we consider the case , so the input is a row vector of length .

The key idea of the proof is that, using two attention heads of size , one can implement a self-attention layer that shifts up input entries that are in a specific interval, while leaving all other entries intact. We call this the selective shift operation. Since the entries in are quantized, we apply the selective shift operation to using attention layers. Interestingly, the value of the largest output entry after these operations is unique for each up to permutations. Using the largest entry, one can add one last layer that shifts up the entire matrix and outputs that satisfies Properties 4.2.1 and 4.2.2 of the lemma.

More concretely, the following function , parametrized by satisfying , can be implemented with two attention heads of size with the hardmax () operator:

 Ψ(Z;b,b′)1,j={maxkZ1,k−minkZ1,k if bb′.

If we define an attention layer of the form , then any entry in is shifted up by , while all the other entries stay untouched. We can choose and to selectively shift certain entries, hence the name selective shift operation.

We stack self-attention layers, with attention parts for each , in increasing order of . With these layers, we can apply the selective shift operations to input entries of values . To see how the shift operations modify the input, now consider for simplicity, and let . Without loss of generality, we can assume . The selective shift operation is applied to first, shifting it by , resulting in . After that, the operation on shifts it up by . Thus, the first layers map () to

We can show that the map from to is one-to-one, and that . We then add one last layer that shifts all positive entries of by , whose output we denote by . All entries of are in , and this interval is disjoint for different ’s because is one-to-one. Thus, satisfies Properties 4.2.1 and 4.2.2 of the lemma. The remaining details are in Section B.5.

### 4.3 Function value mapping by feed-forward layers

This brings us to the final step, which demonstrates the key utility of the feed-forward layers. After the contextual mapping by self-attention layers, each token captures the entire context available in the input sequence. The following result shows that token-wise application of a composition of feed-forward layers can map these tokens to the desired output values required by the function .

Let be the function from Lemma 4.2. Then, there exists a function composed of token-wise feed-forward layers () with activations in such that is defined by a token-wise function on each column,

 gv(Z)=[gtknv(Z:,1)⋯gtknv(Z:,n)],

where for all ,

 gtknv(gc(L):,j)={(AL):,j if L∈\wtGδ,0d if L∈G+δ∖\wtGδ.

### 4.4 Tightness of constructions

We showed in this section that Theorem 3 requires Transformer blocks for approximation, where is the width of the cubes. Each transformer block is of constant width, so it has parameters; this means that the total number of parameters is . We note that this exponential dependence cannot be avoided in the worse case. If we assume continuity without any additional smoothness, quantizing the domain to cubes and approximating the function with constants require memorizing real numbers, where the factor of is due to permutation equivariance. Thus, Theorem 3 is optimal in the order of parameters.

If we compare with the residual network result (Lin & Jegelka, 2018), we can consider “flattening” into a -dimensional vector and fitting the function. The proof technique in (Lin & Jegelka, 2018) requires layers, where each layer has parameters: the total parameter requirement is . This shows that Transformers can approximate permutation equivariant functions in a more efficient way than residual networks.

In Section C, our proof of Theorem 3 shows that we require layers to approximate continuous (not permutation equivariant) sequence-to-sequence functions. As seen from the argument above, this construction is also optimal in the order of parameters.

## 5 Discussion and Experiments

As detailed in Section 4, the ability of the self-attention layers to compute contextual mappings plays a crucial role in the universal approximation property. Interestingly, our analysis shows that replacing the dot-product attention in Transformers with any other component capable of computing contextual mappings should preserve this universal approximation property. This leads naturally to questions about the alternative architectures that realize certain kinds of contextual mappings at different computational and memory costs. We explore and discuss some examples of such alternatives in this section. Our preliminary empirical study demonstrates their practical utility.

### 5.1 Bi-linear projection

Given token embeddings as input, the bi-linear projection layer computes the following update.

 BProj(X)=X+WO⋅X⋅WP. (5)

The bi-linear projection layer (Gong et al., 2013) is motivated from the ability of random (Gaussian) matrices to map sparse differences to dense vectors (Ailon & Chazelle, 2009). If there are two input contexts and that differ in one token, their difference is sparse; however, after random projection, the difference

will be dense, and the numbers are distinct with high probability, implementing a form “pair-wise contextual mapping,”

222This guarantee only holds for a finite set (can be exponential in ) of fixed vectors in . although different from the contextual mapping in Definition 3.1.

This layer advantageously incurs smaller number of matrix multiplications as compared to the dot-product attention. That said, the number of parameters in this layer depend on the sequence length, making it harder to reuse the model across tasks with different input sequence lengths. Moreover, the weights used to compute the contextual embeddings () are independent of the inputs (), whereas in self-attention the weights depend on . The first drawback can be addressed by replacing the linear projection with a depth-wise separable convolution layer, which is discussed in the next subsection.

### 5.2 Depth-wise separable convolutions

A depth-wise convolution layer (Sifre & Mallat, 2014; Chollet, 2017; Kaiser et al., 2017) involves convolving each dimension of with a corresponding convolution filter of size :

 SepConv(X)=X+WO(X∗WC), (6)

where and . Unlike bi-linear projection, this layer can be used across tasks with different input sequence lengths as the number of parameters are independent of the sequence length. While a single layer is unable to compute contextual mappings when the filter size is small, stacking multiple such layers can potentially provide a cheaper way to compute contextual mappings. In fact, based on depth-wise separable convolutions, Wu et al. (2019) proposed a light-weight dynamic convolution architecture that performs competitively with Transformers on machine translation.

### 5.3 Experiments

We now present our experiments with these other architectures, with the goal of understanding the extent to which computing contextual mappings can capture the performance of Transformers. As discussed earlier, and do not implement contextual mappings (cf. Definition 3.1), so we do not expect that either or based models to have the same performance as the expensive Transformers. These models do not use input dependent weights to compute attention, and hence have weaker representation power. Instead, our goal is to see if we can use these cheaper layers to replace (some of) the expensive self-attention layers.

We follow the experimental setting from Devlin et al. (2018) to train the Transformers, with the masked language model pre-training followed by a task specific fine-tuning, and work with a layer architecture based on . We present our results on a question answering task (SQuAD) (Rajpurkar et al., 2016) and a sentence entailment task (MNLI) (Williams et al., 2018). In our first set of experiments we train models that employ and layers, instead of the self-attention layer in eq.(1). We notice that, as expected, these simpler models have weaker performance than the self-attention layer. See Table 1 in Section D for a comparison of these models on MNLI.

Next, we swap a varying number of the first few self-attention layers in with , implemented with filter reuse across dimensions (Wu et al., 2019)333We refer to Section D for a complete description of the setup.. Fig. 1 illustrates the performance of these hybrid models. Interestingly, models with or convolution layers and rest the self-attention layers, perform better than models with only the self-attention layers. Note that, replacing self-attention layer with also reduces the computational cost and the number of parameters. One explanation we have is that the first few attention layers tend to attend broadly to the whole sequence (as empirically observed in (Clark et al., 2019)), and the cheaper convolution layers can perform this job more efficiently. A detailed evaluation of such hybrid architectures will be interesting future research.

Our experiments also call for a deeper understanding of the exact nature of the embeddings computed by practical attention models. Since Transformers in practice have fixed depth, we believe that they might not be able to exactly implement contextual mappings as we defined in Definition

3.1. However, there is some preliminary empirical evidence that Transformers do implement some sort of “contextual mappings.” For example, Fig. 4 of Coenen et al. (2019) presents visualizations of embeddings of a single word in different contexts (sentences). They experimentally notice that Transformers, in addition to computing contextual mappings, also map a word into semantic clusters. Formalizing and evaluating this property of Transformers is an interesting direction for future work. We again note that Wu et al. (2019) have proposed an alternative way to compute such embeddings based on dynamic convolution layers. Evaluating the mappings computed by these models should shed more light on the workings of attention models and inspire efficient and better performing architectures.

## Appendix A Proof of Claim 2

Suppose was given as input, where is a permutation matrix. First note that

 (WiKXP)T(WiQXP)=PT(WiKX)T(WiQX)P

After the softmax operation, we get

 σ[PT(WiKX)T(WiQX)P]=PTσ[(WiKX)T(WiQX)]P.

Then,

 Attn(XP) =XP+h∑i=1WiO(WiVXP)⋅PTσ[(WiKX)T(WiQX)]P=Attn(X)P,

where we used . Permutation equivariance of the token-wise feed-forward layer can be shown similarly:

 FF(XP) =Attn(X)P+W2⋅ReLU(W1⋅Attn(X)P+b11TnP)+b21TnP =Attn(X)P+W2⋅ReLU(W1⋅Attn(X)+b11Tn)P+b21TnP=FF(X)P,

where was used. This analysis shows that the function class is restricted to permutation equivariant functions.

## Appendix B Proof details of Theorem 3

We first define some additional notation. For where , let and . For where is an integer multiple of , we write .

### b.1 Approximating \mcFPE with ¯¯¯¯¯¯¯¯¯¯¯¯\mcFPE(δ)

For any given and , one can find a such that which satisfies .

###### Proof.

Since is a continuous function with compact support, the function is uniformly continuous. Since continuity is defined using entry-wise norm, and entry-wise norm is equivalent to entry-wise norm when the number of entries are finite, uniform continuity implies that

 ∀ϵ>0,∃δ>0 such that ∀X,Y,\linfX−Y<δ⟹\normf(X)−f(Y)p<ϵ.

This means that given any , we have such a . Using this , we can create a grid and corresponding cubes , as described in the main text. For any , we define to be the center point of the cube . Then, we can define a piece-wise constant approximation . Note that, for any , we have , so by uniform continuity, we have . This proves that .

As for permutation equivariance, since is permutation equivariant, we have for any permutation matrix . For any , we have , so

 ¯¯¯f(XP)=f(CLP)=f(CLP)=f(CL)P=¯¯¯f(X)P.

Thus, the approximation is also permutation equivariant. This proves the lemma. ∎

### b.2 Approximating ¯¯¯¯¯¯¯¯¯¯¯¯\mcT2,1,1 with \mcT2,1,4

For each and , such that .

###### Proof.

Recall that refers to the class of functions representable with composition of Transformer blocks with heads of size in self-attention layers and hidden nodes in feed-forward layers. The same notation holds for the modified Transformers .

Note that the softmax operator on a matrix can be made arbitrarily close to hardmax by scaling up . That is,

 σ[λA]→σH[A]   as λ→∞.

This means that by scaling up parameters inside , we can approximate arbitrarily closely. Thus, the modified self-attention layers can be approximated with the original self-attention layers of the same number of heads and head size .

Also, any arbitrary (possibly discontinuous) piecewise linear function can be approximated arbitrarily closely by four ’s. Note that as at most three pieces, and at least one of the pieces is constant. For example, consider the following function :

 ϕ(t)=⎧⎨⎩b1 if t

This function can be approximated by four ’s, as claimed by the lemma:

 \wtϕ(t)= b1+a2c1+b2−b1ϵReLU(t−c1+ϵ)+(a2−a2c1+b2−b1ϵ)ReLU(t−c1) +(a3c2+b3−a2(c2−ϵ)−b2ϵ−a2)ReLU(t−c2+ϵ) +(a3−a3c2+b3−a2(c2−ϵ)−b2ϵ)ReLU(t−c2) = ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩b1 if t

Also, as we make , we can approximate as closely as possible using . The cases where the second or third piece is constant can be shown similarly. This means that the modified feed-forward layers (whose activation is ) with single hidden node can be approximated with the original feed-forward layers () with four hidden nodes.

Thus, given any , there exists a function arbitrarily close to