Theoretical Limitations of Self-Attention in Neural Sequence Models

06/16/2019 ∙ by Michael Hahn, et al. ∙ Stanford University 0

Transformers are emerging as the new workhorse of NLP, showing great success across tasks. Unlike LSTMs, transformers process input sequences entirely through self-attention. Previous work has suggested that the computational capabilities of self-attention to process hierarchical structures are limited. In this work, we mathematically investigate the computational power of self-attention to model formal languages. Across both soft and hard attention, we show strong theoretical limitations of the computational abilities of self-attention, finding that it cannot model periodic finite-state languages, nor hierarchical structure, unless the number of layers or heads increases with input length. Our results precisely describe theoretical limitations of the techniques underlying recent advances in NLP.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Work

Prior Work on Self-Attention

Transformers were proposed by Vaswani et al. (2017), previous related work using self-attention includes Cheng et al. (2016); Parikh et al. (2016); Paulus et al. (2017); Lin et al. (2017). It has been a recurrent suggestion in the literature that transformers, relying entirely on self-attention, are restricted computationally, as they cannot process their input sequentially. Dehghani et al. (2018) suggested that transformers cannot compute functions that require sequential processing of input, without providing further details or proofs. Similarly, Shen et al. (2018a); Chen et al. (2018); Hao et al. (2019) have introduced extensions of transformers with recurrence, citing similar intuitions about limitations of transformers. Our results provide an explicit formalization of these limitations.

Other studies have experimentally tested the abilities of transformers to learn structures. Most related to our work, Tran et al. (2018) compared the ability of transformers and LSTMs to learn hierarchical structure, specifically English subject-verb agreeement and evaluating logical formulas. Based on their experimental results, they suggested that LSTMs are better at learning hierarchical structure.

Yang et al. (2019)

experimentally investigated the power of self-attention to extract word order information, finding differences between recurrent and self-attention models; however, these were modulated by the training objective.

Lin et al. (2019) and Tenney et al. (2019) show that BERT Devlin et al. (2018) encodes syntactic information.

Theoretical study of transformers was initiated by Pérez et al. (2019)

, who theoretically studied the ability of Seq2Seq transformers to emulate the computation of Turing machines. While we consider incremental modeling of sequences, where the number of computation steps is bounded by the input length

, they study the setting in which the transformer computes an unbounded number of autoregressive decoding steps, not bounded in the input length .

Investigating the Power of Sequence Modeling Architectures

The computational power of recurrent neural networks has been a focus of study. A particular focus has been on their ability to learn non-regular context-free languages, often thought to provide simple models of recursion and hierarchical structure as found in natural language.

A range of studies has experimentally examined the ability of recurrent networks to model counter languages such as Kalinke and Lehmann (1998); Gers and Schmidhuber (2001); Cartling (2008); Weiss et al. (2018); Suzgun et al. (2019). Other work has experimentally studied the performance of reccurent architectures on learning to recognize well-bracketed strings, a similar but more challenging problem Sennhauser and Berwick (2018); Skachkova et al. (2018); Bernardy (2018). Beyond modeling formal languages, another line of work has studied the ability of LSTMs to model hierarchical structure as occurring in realistic natural language data Linzen et al. (2016); Gulordava et al. (2018).

Recently, Merrill (2019) theoretically studied several types of recurrent networks, showing that – in the finite precision setting – LSTMs recognize a subset of the counter languages, whereas GRUs and simple RNNs recognize regular languages.

A related, though different, strand of research has investigated the power of neural networks to model Turing machines. A classical result Siegelman and Sontag (1995) states that – given unlimited computation time – recurrent networks can emulate the computation of Turing machines. Very recently, Pérez et al. (2019) have shown the same result for both (argmax-attention) Transformers and Neural GPUs. The crucial difference between these studies and studies of language recognition is that, in these studies, the networks are allowed to perform unbounded recurrent computations, arbitrarily longer than the input length.

2 Self-Attention

Here we define self-attention as used in Transformers, following Vaswani et al. (2017). We have an input , where are input embeddings, and we assume that encodes an end-of-sequence symbol. We furthermore have a sequence of positional embeddings . These are independent of the input, and can be computed through some predefined scheme, or could be learned for each position occurring in the training data (Vaswani et al., 2017)

. Input and positional embeddings are combined (e.g., addition or concatenation) to vectors

(), which we will refer to as Layer 0.

A transformer has a fixed number of layers; the activations at position of the -th layer () are defined as follows. Each layer has a set of attention heads; we first compute attention scores for the -th head:


where is a function combining the activations from the previous level into an attention score. In practice, this can be implemented e.g. using dot product or additive attention. The final activation of the head is computed by weighting according to attention weights:


In the soft attention version,

In the hard attention variant Pérez et al. (2019) , we take the actual maximum of attention values instead of softmax: .111This requires an arbitrary choice when there are multiple positions with maximal attention weight; we will assume that the one occurring first in the sequence is chosen. Our analysis would also work under other schemes of resolving ties, such as random selection. The final activation at each position is then computed as


where is implemented as a fully-connected feedforward network with a skip-connection Vaswani et al. (2017).

Hard and Soft Attention

There is a choice between soft attention and hard attention Shen et al. (2018b); Pérez et al. (2019). The one prior theoretical study of transformers Pérez et al. (2019) assumes hard attention. In practice, soft attention is easier to train with gradient descent methods; however, analysis studies suggest that attention often concentrates on one or a few positions Voita et al. (2019); Clark et al. (2019) and that the most important heads tend to be those that clearly focus on a few positions Voita et al. (2019), suggesting that attention often behaves like hard attention in practice. We will examine both hard (Section 4) and soft (Section 5) attention settings.

Formalizing Language Recognition

We consider the problem of language recognition, the task of classifying input strings as either belonging to or not belonging to a formal language. Following 

Weiss et al. (2018), we formalize this as the sequence-to-sequence task of transducing words to labels (‘in the language’) and (‘not in the language’). Following the construction of transformers in sequence-to-sequence tasks Vaswani et al. (2017)

, we compute a softmax probability vector for this label from the last activation

, obtained after reading the end-of-sequence symbol.

3 Regular and Context-Free Languages

We will analyze the ability of transformers to recognize regular and context-free languages, using two prominent representatives (Parity and 2Dyck).


is the set of bit strings such that the number of s is even. This is a very simple regular language, recognized by a finite-state automaton with two states. The regular languages form the lowest layer of the Chomsky hierarchy, and even simple RNNs can compute all regular languages. Within the regular languages, a particularly basic class is formed by the counter-free or star-free languages McNaughton and Papert (1971), which can be expressed by regular expressions using only union, complementation, and concatenation. In some sense, Parity is the simplest non-counter-free, or periodic, regular language. This means, if transformers cannot compute Parity, they cannot recognize (almost)222More precisely, inability to compute Parity entails that they cannot recognize any regular language whose syntactic morphism is not quasi-aperiodic (Barrington et al., 1992, p. 488). any regular language that is not counter-free.

In the context of natural language, Parity

naturally arises in the context of evaluating logical formulas: Evaluating iterated negations is tantamount to counting whether the number of nested negations is even or odd. If we know that transformers cannot compute parity, this entails that they also cannot evaluate logical formulas accurately.


is the language of correctly bracketed words consisting of ‘(’, ‘[’ and ‘)’, ‘]’. This language is a very simple model of hierarchical structure. The Chomsky-Schützenberger theorem Chomsky and Schützenberger (1963) states that any context-free language arises from a variant of 2Dyck with multiple types of parentheses through intersection with a regular language and homomorphisms. Consequently, the ability of LSTMs to model languages such as 2Dyck has been an object of experimental study Sennhauser and Berwick (2018); Skachkova et al. (2018); Bernardy (2018). Our theoretical results will show that transformers are strongly limited in their ability to model 2Dyck, including variants with fewer or more types of parentheses.

4 Results for Hard Attention

We will start our analysis with the study of hard attention Pérez et al. (2019). We show that hard attention transformers cannot represent Parity or 2Dyck

. To keep the results maximally general, our analysis will use combinatorial arguments and make no assumption about, e.g., activation functions and the norms of parameter matrices. In fact, we do not even assume that the internal position-wise representations

in each layer are vector-valued, as opposed to, say, discrete structures.

(a) (b) (c) (d)
Figure 1: Iteratively reducing the layers of a transformer by fixing a few input symbols. (a) We fix a small number of input symbols, ‘attracting’ attention from the first layer to a few inputs. (b) After this step, each activation in the first layer only depends on a small number of input symbols. (c) We again fix a few input symbols in such a way as to ‘attract’ attention of layer-1 heads to some layer-0 activations. As a result, each layer-1 activation only depends on a small number of layer-0 activations. (d) After this step, each layer-1 activation only depends on a few inputs, and we can remove layer 1. In this example, input ends up being ignored by the transformer after applying the restriction.

The basic idea (see Figure 1) behind the proof is that, through fixing a small fraction of the input symbols in a particular way, we can ‘capture the attention’ of the transformer in such a way that it ends up ignoring almost all remaining input symbols. This shows that the transformer could not have solved a problem such as Parity, where every single input bit matters. The idea of using such input restrictions has been successful in the theory of Boolean circuits Furst et al. (1984); Yao (1985); Hastad et al. (1994). In particular, Furst et al. (1984) famously used it to prove that polynomial-size bounded-depth Boolean circuits with , and gates cannot compute Parity. We describe a new method to prove existence of suitable restrictions appropriate to transformers, as the proof approaches from the Boolean circuit literature do not seem to carry over to networks with continuous real-valued activations.

A restriction is a family of maps for . A restriction is applied to a transformer by restricting, when the input size is , inputs to the value if it is or . The output value of the resulting transformer only depends on those inputs such that .

The following technical result formalizes the intuition described above:

Theorem 1.

Let any transformer be given, and let . Then there is a restriction and an integer such that

(for all sufficiently large ) and such that the function computed by the transformer on the restricted input depends only on inputs, independent of input length .

We first show how this entails that transformers do not recognize the two formal languages:

Corollary 2.

Transformers with hard attention cannot model Parity or 2Dyck.


Take . For Parity, after applying a restriction, the output of the transformer depends on inputs. An input of sufficiently large size thus has unrestricted inputs that do not influence the output. But flipping a single input bit changes the value, so the transformer’s output cannot match membership in Parity beyond chance as soon as is sufficiently large.

For 2Dyck, we show that hard attention transformers cannot even solve the simpler variant 1Dyck with a single bracket type (‘(’, ‘)’). We first restrict the first input positions to ‘(’. After applying the restriction provided by the theorem, the resulting restricted input will still be compatible with both well-bracketed and non-well-bracketed inputs, but the prediction will depend only on a bounded number of positions. As the prediction depends on only a bounded number of positions, this shows the transformer could not recognize 1Dyck, and thus also not 2Dyck. ∎


It may be instructive to compare to similar languages that can be modeled by hard-attention transformers. First, (over the alphabet ) is the regular language of words that have only ones and no zeroes; its minimal automaton has two states, like Parity. A transformer can recognize this by having an attention head that attends to a position with zero input if it exists, and rejects if the head found such a position. Second, is a very basic context-free language. It can be recognized using suitable positional embeddings by (1) having one head attend to the largest position , (2) using this information to attend to any at position or any at position . If such a symbol is found, the model rejects, else it accepts. A crucial difference between these languages and Parity / 2Dyck is that fixing a few inputs can easily force nonmembership, e.g. a single 0 for , and an in the second half for . Therefore, such simple languages are immune to the depth reduction method, and indeed can be modeled perfectly with self-attention.

In general, the depth reduction method applies to languages that are sufficiently sensitive: If, for some , fixing input symbols cannot force a word to be inside or outside of the language, then hard-attention transformers cannot recognize this language. Sensitivity of functions has been studied in computational complexity Boppana (1997); Gilmer et al. (2015); Gopalan et al. (2016); Rossman (2018) and more recently linked to generalization in feedforward networks De Palma et al. (2018). We intend to investigate these connections in future work.

Proof Idea of the Theorem

Our approach for proving Theorem 1 will be to iteratively remove the lowest layer of a given transformer by constructing input restrictions. This process is illustrated in Figure 1. After the first step, each of the heads in the first layer will only depend on a bounded number of input positions. In the second step, we apply the same argument to the heads in the second layer, so that each head in the second layer only depends on a bounded number of heads in the first layer. After this step, we can collapse the first layer into a collection of feedforward networks that transform a bounded number of input positions into an activation of the lowest layer. After this step, the first layer has been entirely removed. Iterating this argument, we remove all layers until the prediction output only depends on a bounded number of input positions, bounded independently of input length.

After the removal of a layer, the resulting structure is not a transformer any more, as each head in the lowest layer now depends on a combination of input positions. We introduce a technical definition to make this concept precise:

Definition 3 (-Transformer).

Let a positive integer. A -transformer of depth is one in which the layer-0 activations depend on the embeddings not just at one position , but are a function of the embeddings at input positions:


Here, the indices depend on the input length .

Note that an ordinary transformer of depth is also a -transformer of depth .

With this technical notion, we show that we can reduce layers, iteratively removing the lowest layer until no self-attention layer is left:

Lemma 4 (Depth Reduction Lemma).

Given a -transformer with layers, and some restriction such that


() for all sufficiently large . Choose any . Then there is a restriction such that


for all sufficiently large , and such that the resulting function is computed by a -transformer with layers, for some integer (depending on ), where is the number of attention heads at each layer and position.

Before proving this lemma, we note that it implies Theorem 1.

Proof of Theorem 1.

The output of the transformer is determined by the last activation . Apply the Depth Reduction Lemma iteratively, choosing the constants in the lemma appropriately, until only the zero-th layer remains. Then, after applying the resulting restriction, the final activation is now computed by , which is determined by a bounded number of input bits. ∎

4.1 Proving the Depth Reduction Lemma

The rest of this section will be devoted to proving the Depth Reduction Lemma. We will do the first part of the argument for any integer . In the second part, we will select a sufficiently large .

Part 1: Preliminaries

We construct the restrictions separately for each . For each layer-1 attention head at position and each position , we compute the maximum possible attention value that can be achieved for this pair:


We order the positions downwards by this value, obtaining a sequence for each layer-1 attention head at a position (In the case of ties, we order by position).

For each layer-1 attention head, we then select a sequence such that (1) for each , there is at least one input that only feeds into the element and no other , (2) is minimal, i.e. there is no subsequence with smaller that also satisfies (1). There is only one case in which we cannot find such a subsequence, which is if all together depend only on inputs, in which case this head already satisfies the condition we’re aiming for.

A layer-1 head -depends on some input if and appears as an input to some for . A layer-1 head -depends on an input if and only if that input appears as an input to some (). (This is since is minimal).

Two layer-1 head are -neighbors if some for one and for the other both -depend on some input .

We will construct input restrictions using probabilistic arguments over i.i.d. random restrictions. For this approach to succeed, we require a sufficient amount of independence between the activations of different heads in layer 1. We thus need to ensure that the number of neighbors of each head is bounded.

Fix some (to be chosen later).

Let be the number of attention heads in each position of layer 1. First, assume there is some input that has many -depending layer-1 heads. Assume the number of such inputs is larger than for some . Then, the number of pairs of inputs and depending level-2 heads would be more than for this . On the other hand, there are only many pairs of inputs and depending layer-1 heads – contradiction. Thus, the number of such inputs is at most for all . We can therefore modify the restriction so that they are set to some fixed value (doesn’t matter which one) for these inputs, and unchanged for the others. After this manipulation, every input has at most many -depending layer-1 heads, independent of .

Furthermore, assume the number of inputs with depending layer-0 activations is . Then the number of pairs of inputs and layer-0 activations is . But there are at most such pairs, contradiction. So the number of inputs with depending layer-0 heads is . We can again restrict these inputs to some fixed value (again, it doesn’t matter which one).

After these preparations,


for all sufficiently large .

Part 2: Constructing Restrictions

After the previous part, we are in the setting where every input has many depending layer-1 heads, and consequently every layer-1 head has at most many -neighbors (for any ). Also, every input has depending layer-0 heads.

In order to prove the existence of suitable input restrictions, we apply the “probabilistic method”: we define a probability distribution over restrictions

of inputs, and show that the probability mass assigned to restrictions of the type we require is strictly greater than zero, showing that such a restriction exists.

For each input length , define the distribution that independently assigns to each input position the symbol with probability (to be chosen later), and or with equal probability else. On those inputs where , we restrict this random restriction to agree with . For the -th layer-1 attention head (there are at most many such), define to be the event that, for the -th head, none of are assigned the value that produces the highest attention weight (each of these is assigned either the other value, or ). Define to be the event that more than many inputs are set to .

We first show that the probability of decays exponentially in . Let () be the event that the layer-0 activation is not assigned the value that produces the highest attention weight, for the given attention head . Note that . We have . Any can be statistically dependent on at most other events . Therefore, there is a set of independent events among these. Call these . Then , and thus


Furthermore, a Chernoff bound gives Mitzenmacher and Upfal (2005)


since the number of independent multinomial samples in the sampling of is at least . We want to show that we can find , such that the following holds:


where . Once we have shown this, then by the asymmetric Lovász Local Lemma Mitzenmacher and Upfal (2005), there is some input restriction that avoids all events .

Choose , . First, we need to satisfy


For the RHS,


Also, equals


where we choose . So, if we choose large enough (independently of ), the RHS can be made arbitrarily close to , in particular, greater than the LHS.

In order to also satisfy


make , large enough to satisfy the inequality (again, choosing independent of ). In conclusion, there exists, for each sufficiently-large , a restriction that avoids all events .

With such a , is at least


for all sufficiently large . Then choose small, small, and (such that ) in such a way to achieve .

We now note that, since every layer-1 head depends only on many inputs after applying , each layer-1 activation only depends on many inputs. We can thus collapse layers 0 and 1, converting layer-1 activations into layer-0 activations , and obtain a -transformer performing the same computation as before when is applied. This concludes the proof of the Depth Reduction Lemma.

5 Results for Soft Attention

In the previous section, we showed that transformers using hard attention are not able to recognize a range of core formal languages. In this section, we study soft attention. Here, we will use the smoothness of the operations used in transformers to show that any transformer, as inputs get longer, will not be able to robustly separate members of the languages from other inputs. The idea behind the proof is that the impact of any single input on the output of the transformer is small if the input is long:

Lemma 5.

If we exchange one input symbol, then the change in the resulting activations at the decoder layers is bounded as with constants depending on the parameter matrices, where is the input length.

This contrasts with recurrent networks: Changing a single input can have nonnegligible impact on the final state even for very long input. E.g., an RNN recognizing Parity through a hidden state that encodes parity of the current prefix will flip its hidden state if a single input bit is flipped.

This result entails that, as inputs become longer, soft attention transformers cannot robustly model many formal languages, and cannot even achieve good cross-entropies on a large class of prediction problems. To make this precise, we take a more quantitative angle and assign probability distributions over inputs for each input length , and consider the behavior of cross-entropy as . For Parity, we simply consider the distribution over i.i.d. bitstrings of length , and consider the task of predicting whether the next symbol can be the EndOfSequence symbol – which is true if and only if the prefix has an even number of ones. For 2Dyck

, we take the uniform distribution of well-bracketed prefixes of length

, and ask the model to classify the set of valid next characters, which is a subset of .333Only an exponentially small subset of the strings over symbols are well-labeled; thus, cross-entropy on the classification task is less meaningful for this language. Considering prediction of the next symbol sidesteps this issue. We consider the problem of predicting the label from the input separately for each input length , and consider cross-entropy as .

Theorem 6.

Let a soft attention transformer be given for one of the languages Parity and 2Dyck. As , cross-entropy converges to chance level.


For Parity, exchanging a single bit flips membership. Thus, for any member of the language, there is an equally likely non-member such that the output activations differ by only .

For 2Dyck, the probability that a well-bracketed prefix of length is exactly balanced is . In all other cases, the correct label includes one of the two closing parentheses, which depends on one single previous symbol, and both possibilities are equally likely if that symbol is unknown. ∎

5.1 Proof of the Lemma

Having established the main theorem, we proceed to proving the technical lemma:

Proof of Lemma.

Let us compare the activations at the decoder layer for two inputs that only differ in the input at the -th position. Let the norm of the difference of the input embeddings at this position.

We show by induction over that, for some some (depending on the parameter matrices) the difference between the activations , are bounded as:


Once we have shown this, we have found that the influence of any individual input on the final prediction is , with constants depending on the norms of parameter matrices and the number of layers.

For , ,444We are assuming that is addition or concatenation Vaswani et al. (2017); for operations such as an MLP, there would be an additional Lipschitz constant depending on parameters and activation functions. and for .

For , we first note that for activations , where is the Lipschitz constant of . As

is implemented as a ReLU MLP

Vaswani et al. (2017),

depends on the norms of the parameter matrices. Attention logits are bounded by

in the case of multiplicative/bilinear attention, and in the case of additive attention. Then any attention weight is upper bounded by .

Choose .

Recall , where . We first calculate

Then, is at most


If , this is bounded (as )


If , this is bounded above by

This proves the inductive step for . ∎

6 Discussion

We have shown that, even with infinite precision, transformers cannot robustly model non-counter-free regular languages, nor basic hierarchical structure.

How do our results compare to what is known about LSTMs? Recurrent networks such as SRNNs and LSTMs can perfectly emulate finite-state automata, and therefore can model any finite state language with optimal cross-entropy, as long as the state transition and symbol emission distributions are Markovian. In particular, Parity of i.i.d. bitstrings can be predicted with zero cross-entropy, independent of the input length.

Infinite-precision RNNs and LSTMs can model stacks Tabor (2000); Grüning (2006); Kirov and Frank (2012) and thus are theoretically capable of modeling 2Dyck and other deterministic context-free languages perfectly. This clear contrast between infinite-precision LSTMs and our findings for infinite-precision transformers may provide a theoretical explanation for the empirical finding that LSTMs seem to outperform transformers in modeling hierarchical structure (Tran et al., 2018).

Hierarchical structure, at least at the level of context-free languages, is widely thought to be essential to modeling the structure Everaert et al. (2015) and meaning Montague (1973)

of natural language. Our results entail that self-attention is strongly limited in their ability to model context-free languages or evaluate logical formulas. How should we reconcile this with the success of transformers at many natural language processing tasks? One possibility is that strong quantitative performance on NLP tasks can be achieved without genuine understanding of linguistic structure, as has been argued for other neural network models used in NLP that show limited knowledge of syntax despite delivering strong perplexity

Linzen et al. (2016); Marvin and Linzen (2018). This would suggest that models such as transformers may not be capable in principle to understand language in a fully human-like way. Another possibility is that humans also have limited capacity to solve such problems, which means that human-like language understanding may not require full computational power to solve such problems. For instance, it has long been noted that center embeddings, syntactic structures exhibiting iterated bracketing, are very challenging for humans to process Miller and Chomsky (1963); Gibson and Thomas (1999). Intriguingly, self-attention bears some resemblance to psycholinguistic models of memory in human sentence processing that assume that humans, while processing a word, attend to chunks that were stored in memory when processing some previous words Lewis and Vasishth (2005); Parker et al. (2017). Such processing models predict difficulty with center embedding because they cannot count brackets Lewis and Vasishth (2005), akin to what we have shown theoretically for neural network models based on self-attention.

While our hard attention results hold under extremely general assumptions, the analysis of soft attention builds on regularity properties of the operations that existing transformer models are composed of. It would be interesting to investigate to what extent computational power increases when other operations – e.g., discontinuous activations, or infinite attention logits – are allowed. One can show that the use of discontinuous activation functions such as the Heaviside function would enable perfect modeling of Parity; however, we do not know whether such changes would help with context-free languages such as 2Dyck.555Analogy with Boolean circuits suggests that such results might be extremely hard to obtain: If transformers with soft attention were shown unable to model context-free languages such as the set of true logical formulas, even when allowing arbitrary activation functions, this would separate the functions computed by linear-size circuits from the class . Separation of and is a widely believed but long-open conjecture whose solution would be a major breakthough in computational complexity Arora and Barak (2009).

7 Conclusion

We formally investigated the capabilities of self-attention in modeling regular languages and hierarchical structure. We showed that transformers cannot model periodic regular languages or basic recursion, either with hard or soft attention, and even if infinite precision is allowed. Our results theoretically confirm the idea that self-attention, by avoiding recurrence, has quite limited computational power.