1 Related Work
Prior Work on Self-Attention
Transformers were proposed by Vaswani et al. (2017), previous related work using self-attention includes Cheng et al. (2016); Parikh et al. (2016); Paulus et al. (2017); Lin et al. (2017). It has been a recurrent suggestion in the literature that transformers, relying entirely on self-attention, are restricted computationally, as they cannot process their input sequentially. Dehghani et al. (2018) suggested that transformers cannot compute functions that require sequential processing of input, without providing further details or proofs. Similarly, Shen et al. (2018a); Chen et al. (2018); Hao et al. (2019) have introduced extensions of transformers with recurrence, citing similar intuitions about limitations of transformers. Our results provide an explicit formalization of these limitations.
Other studies have experimentally tested the abilities of transformers to learn structures. Most related to our work, Tran et al. (2018) compared the ability of transformers and LSTMs to learn hierarchical structure, specifically English subject-verb agreeement and evaluating logical formulas. Based on their experimental results, they suggested that LSTMs are better at learning hierarchical structure.
Yang et al. (2019)
experimentally investigated the power of self-attention to extract word order information, finding differences between recurrent and self-attention models; however, these were modulated by the training objective.Lin et al. (2019) and Tenney et al. (2019) show that BERT Devlin et al. (2018) encodes syntactic information.
Theoretical study of transformers was initiated by Pérez et al. (2019)
, who theoretically studied the ability of Seq2Seq transformers to emulate the computation of Turing machines. While we consider incremental modeling of sequences, where the number of computation steps is bounded by the input length, they study the setting in which the transformer computes an unbounded number of autoregressive decoding steps, not bounded in the input length .
Investigating the Power of Sequence Modeling Architectures
The computational power of recurrent neural networks has been a focus of study. A particular focus has been on their ability to learn non-regular context-free languages, often thought to provide simple models of recursion and hierarchical structure as found in natural language.
A range of studies has experimentally examined the ability of recurrent networks to model counter languages such as Kalinke and Lehmann (1998); Gers and Schmidhuber (2001); Cartling (2008); Weiss et al. (2018); Suzgun et al. (2019). Other work has experimentally studied the performance of reccurent architectures on learning to recognize well-bracketed strings, a similar but more challenging problem Sennhauser and Berwick (2018); Skachkova et al. (2018); Bernardy (2018). Beyond modeling formal languages, another line of work has studied the ability of LSTMs to model hierarchical structure as occurring in realistic natural language data Linzen et al. (2016); Gulordava et al. (2018).
Recently, Merrill (2019) theoretically studied several types of recurrent networks, showing that – in the finite precision setting – LSTMs recognize a subset of the counter languages, whereas GRUs and simple RNNs recognize regular languages.
A related, though different, strand of research has investigated the power of neural networks to model Turing machines. A classical result Siegelman and Sontag (1995) states that – given unlimited computation time – recurrent networks can emulate the computation of Turing machines. Very recently, Pérez et al. (2019) have shown the same result for both (argmax-attention) Transformers and Neural GPUs. The crucial difference between these studies and studies of language recognition is that, in these studies, the networks are allowed to perform unbounded recurrent computations, arbitrarily longer than the input length.
Here we define self-attention as used in Transformers, following Vaswani et al. (2017). We have an input , where are input embeddings, and we assume that encodes an end-of-sequence symbol. We furthermore have a sequence of positional embeddings . These are independent of the input, and can be computed through some predefined scheme, or could be learned for each position occurring in the training data (Vaswani et al., 2017)
. Input and positional embeddings are combined (e.g., addition or concatenation) to vectors(), which we will refer to as Layer 0.
A transformer has a fixed number of layers; the activations at position of the -th layer () are defined as follows. Each layer has a set of attention heads; we first compute attention scores for the -th head:
where is a function combining the activations from the previous level into an attention score. In practice, this can be implemented e.g. using dot product or additive attention. The final activation of the head is computed by weighting according to attention weights:
In the soft attention version,
In the hard attention variant Pérez et al. (2019) , we take the actual maximum of attention values instead of softmax: .111This requires an arbitrary choice when there are multiple positions with maximal attention weight; we will assume that the one occurring first in the sequence is chosen. Our analysis would also work under other schemes of resolving ties, such as random selection. The final activation at each position is then computed as
where is implemented as a fully-connected feedforward network with a skip-connection Vaswani et al. (2017).
Hard and Soft Attention
There is a choice between soft attention and hard attention Shen et al. (2018b); Pérez et al. (2019). The one prior theoretical study of transformers Pérez et al. (2019) assumes hard attention. In practice, soft attention is easier to train with gradient descent methods; however, analysis studies suggest that attention often concentrates on one or a few positions Voita et al. (2019); Clark et al. (2019) and that the most important heads tend to be those that clearly focus on a few positions Voita et al. (2019), suggesting that attention often behaves like hard attention in practice. We will examine both hard (Section 4) and soft (Section 5) attention settings.
Formalizing Language Recognition
We consider the problem of language recognition, the task of classifying input strings as either belonging to or not belonging to a formal language. FollowingWeiss et al. (2018), we formalize this as the sequence-to-sequence task of transducing words to labels (‘in the language’) and (‘not in the language’). Following the construction of transformers in sequence-to-sequence tasks Vaswani et al. (2017)
, we compute a softmax probability vector for this label from the last activation, obtained after reading the end-of-sequence symbol.
3 Regular and Context-Free Languages
We will analyze the ability of transformers to recognize regular and context-free languages, using two prominent representatives (Parity and 2Dyck).
is the set of bit strings such that the number of s is even. This is a very simple regular language, recognized by a finite-state automaton with two states. The regular languages form the lowest layer of the Chomsky hierarchy, and even simple RNNs can compute all regular languages. Within the regular languages, a particularly basic class is formed by the counter-free or star-free languages McNaughton and Papert (1971), which can be expressed by regular expressions using only union, complementation, and concatenation. In some sense, Parity is the simplest non-counter-free, or periodic, regular language. This means, if transformers cannot compute Parity, they cannot recognize (almost)222More precisely, inability to compute Parity entails that they cannot recognize any regular language whose syntactic morphism is not quasi-aperiodic (Barrington et al., 1992, p. 488). any regular language that is not counter-free.
In the context of natural language, Parity
naturally arises in the context of evaluating logical formulas: Evaluating iterated negations is tantamount to counting whether the number of nested negations is even or odd. If we know that transformers cannot compute parity, this entails that they also cannot evaluate logical formulas accurately.
is the language of correctly bracketed words consisting of ‘(’, ‘[’ and ‘)’, ‘]’. This language is a very simple model of hierarchical structure. The Chomsky-Schützenberger theorem Chomsky and Schützenberger (1963) states that any context-free language arises from a variant of 2Dyck with multiple types of parentheses through intersection with a regular language and homomorphisms. Consequently, the ability of LSTMs to model languages such as 2Dyck has been an object of experimental study Sennhauser and Berwick (2018); Skachkova et al. (2018); Bernardy (2018). Our theoretical results will show that transformers are strongly limited in their ability to model 2Dyck, including variants with fewer or more types of parentheses.
4 Results for Hard Attention
We will start our analysis with the study of hard attention Pérez et al. (2019). We show that hard attention transformers cannot represent Parity or 2Dyck
. To keep the results maximally general, our analysis will use combinatorial arguments and make no assumption about, e.g., activation functions and the norms of parameter matrices. In fact, we do not even assume that the internal position-wise representationsin each layer are vector-valued, as opposed to, say, discrete structures.
The basic idea (see Figure 1) behind the proof is that, through fixing a small fraction of the input symbols in a particular way, we can ‘capture the attention’ of the transformer in such a way that it ends up ignoring almost all remaining input symbols. This shows that the transformer could not have solved a problem such as Parity, where every single input bit matters. The idea of using such input restrictions has been successful in the theory of Boolean circuits Furst et al. (1984); Yao (1985); Hastad et al. (1994). In particular, Furst et al. (1984) famously used it to prove that polynomial-size bounded-depth Boolean circuits with , and gates cannot compute Parity. We describe a new method to prove existence of suitable restrictions appropriate to transformers, as the proof approaches from the Boolean circuit literature do not seem to carry over to networks with continuous real-valued activations.
A restriction is a family of maps for . A restriction is applied to a transformer by restricting, when the input size is , inputs to the value if it is or . The output value of the resulting transformer only depends on those inputs such that .
The following technical result formalizes the intuition described above:
Let any transformer be given, and let . Then there is a restriction and an integer such that
(for all sufficiently large ) and such that the function computed by the transformer on the restricted input depends only on inputs, independent of input length .
We first show how this entails that transformers do not recognize the two formal languages:
Transformers with hard attention cannot model Parity or 2Dyck.
Take . For Parity, after applying a restriction, the output of the transformer depends on inputs. An input of sufficiently large size thus has unrestricted inputs that do not influence the output. But flipping a single input bit changes the value, so the transformer’s output cannot match membership in Parity beyond chance as soon as is sufficiently large.
For 2Dyck, we show that hard attention transformers cannot even solve the simpler variant 1Dyck with a single bracket type (‘(’, ‘)’). We first restrict the first input positions to ‘(’. After applying the restriction provided by the theorem, the resulting restricted input will still be compatible with both well-bracketed and non-well-bracketed inputs, but the prediction will depend only on a bounded number of positions. As the prediction depends on only a bounded number of positions, this shows the transformer could not recognize 1Dyck, and thus also not 2Dyck. ∎
It may be instructive to compare to similar languages that can be modeled by hard-attention transformers. First, (over the alphabet ) is the regular language of words that have only ones and no zeroes; its minimal automaton has two states, like Parity. A transformer can recognize this by having an attention head that attends to a position with zero input if it exists, and rejects if the head found such a position. Second, is a very basic context-free language. It can be recognized using suitable positional embeddings by (1) having one head attend to the largest position , (2) using this information to attend to any at position or any at position . If such a symbol is found, the model rejects, else it accepts. A crucial difference between these languages and Parity / 2Dyck is that fixing a few inputs can easily force nonmembership, e.g. a single 0 for , and an in the second half for . Therefore, such simple languages are immune to the depth reduction method, and indeed can be modeled perfectly with self-attention.
In general, the depth reduction method applies to languages that are sufficiently sensitive: If, for some , fixing input symbols cannot force a word to be inside or outside of the language, then hard-attention transformers cannot recognize this language. Sensitivity of functions has been studied in computational complexity Boppana (1997); Gilmer et al. (2015); Gopalan et al. (2016); Rossman (2018) and more recently linked to generalization in feedforward networks De Palma et al. (2018). We intend to investigate these connections in future work.
Proof Idea of the Theorem
Our approach for proving Theorem 1 will be to iteratively remove the lowest layer of a given transformer by constructing input restrictions. This process is illustrated in Figure 1. After the first step, each of the heads in the first layer will only depend on a bounded number of input positions. In the second step, we apply the same argument to the heads in the second layer, so that each head in the second layer only depends on a bounded number of heads in the first layer. After this step, we can collapse the first layer into a collection of feedforward networks that transform a bounded number of input positions into an activation of the lowest layer. After this step, the first layer has been entirely removed. Iterating this argument, we remove all layers until the prediction output only depends on a bounded number of input positions, bounded independently of input length.
After the removal of a layer, the resulting structure is not a transformer any more, as each head in the lowest layer now depends on a combination of input positions. We introduce a technical definition to make this concept precise:
Definition 3 (-Transformer).
Let a positive integer. A -transformer of depth is one in which the layer-0 activations depend on the embeddings not just at one position , but are a function of the embeddings at input positions:
Here, the indices depend on the input length .
Note that an ordinary transformer of depth is also a -transformer of depth .
With this technical notion, we show that we can reduce layers, iteratively removing the lowest layer until no self-attention layer is left:
Lemma 4 (Depth Reduction Lemma).
Given a -transformer with layers, and some restriction such that
() for all sufficiently large . Choose any . Then there is a restriction such that
for all sufficiently large , and such that the resulting function is computed by a -transformer with layers, for some integer (depending on ), where is the number of attention heads at each layer and position.
Before proving this lemma, we note that it implies Theorem 1.
Proof of Theorem 1.
The output of the transformer is determined by the last activation . Apply the Depth Reduction Lemma iteratively, choosing the constants in the lemma appropriately, until only the zero-th layer remains. Then, after applying the resulting restriction, the final activation is now computed by , which is determined by a bounded number of input bits. ∎
4.1 Proving the Depth Reduction Lemma
The rest of this section will be devoted to proving the Depth Reduction Lemma. We will do the first part of the argument for any integer . In the second part, we will select a sufficiently large .
Part 1: Preliminaries
We construct the restrictions separately for each . For each layer-1 attention head at position and each position , we compute the maximum possible attention value that can be achieved for this pair:
We order the positions downwards by this value, obtaining a sequence for each layer-1 attention head at a position (In the case of ties, we order by position).
For each layer-1 attention head, we then select a sequence such that (1) for each , there is at least one input that only feeds into the element and no other , (2) is minimal, i.e. there is no subsequence with smaller that also satisfies (1). There is only one case in which we cannot find such a subsequence, which is if all together depend only on inputs, in which case this head already satisfies the condition we’re aiming for.
A layer-1 head -depends on some input if and appears as an input to some for . A layer-1 head -depends on an input if and only if that input appears as an input to some (). (This is since is minimal).
Two layer-1 head are -neighbors if some for one and for the other both -depend on some input .
We will construct input restrictions using probabilistic arguments over i.i.d. random restrictions. For this approach to succeed, we require a sufficient amount of independence between the activations of different heads in layer 1. We thus need to ensure that the number of neighbors of each head is bounded.
Fix some (to be chosen later).
Let be the number of attention heads in each position of layer 1. First, assume there is some input that has many -depending layer-1 heads. Assume the number of such inputs is larger than for some . Then, the number of pairs of inputs and depending level-2 heads would be more than for this . On the other hand, there are only many pairs of inputs and depending layer-1 heads – contradiction. Thus, the number of such inputs is at most for all . We can therefore modify the restriction so that they are set to some fixed value (doesn’t matter which one) for these inputs, and unchanged for the others. After this manipulation, every input has at most many -depending layer-1 heads, independent of .
Furthermore, assume the number of inputs with depending layer-0 activations is . Then the number of pairs of inputs and layer-0 activations is . But there are at most such pairs, contradiction. So the number of inputs with depending layer-0 heads is . We can again restrict these inputs to some fixed value (again, it doesn’t matter which one).
After these preparations,
for all sufficiently large .
Part 2: Constructing Restrictions
After the previous part, we are in the setting where every input has many depending layer-1 heads, and consequently every layer-1 head has at most many -neighbors (for any ). Also, every input has depending layer-0 heads.
In order to prove the existence of suitable input restrictions, we apply the “probabilistic method”: we define a probability distribution over restrictionsof inputs, and show that the probability mass assigned to restrictions of the type we require is strictly greater than zero, showing that such a restriction exists.
For each input length , define the distribution that independently assigns to each input position the symbol with probability (to be chosen later), and or with equal probability else. On those inputs where , we restrict this random restriction to agree with . For the -th layer-1 attention head (there are at most many such), define to be the event that, for the -th head, none of are assigned the value that produces the highest attention weight (each of these is assigned either the other value, or ). Define to be the event that more than many inputs are set to .
We first show that the probability of decays exponentially in . Let () be the event that the layer-0 activation is not assigned the value that produces the highest attention weight, for the given attention head . Note that . We have . Any can be statistically dependent on at most other events . Therefore, there is a set of independent events among these. Call these . Then , and thus
Furthermore, a Chernoff bound gives Mitzenmacher and Upfal (2005)
since the number of independent multinomial samples in the sampling of is at least . We want to show that we can find , such that the following holds:
where . Once we have shown this, then by the asymmetric Lovász Local Lemma Mitzenmacher and Upfal (2005), there is some input restriction that avoids all events .
Choose , . First, we need to satisfy
For the RHS,
where we choose . So, if we choose large enough (independently of ), the RHS can be made arbitrarily close to , in particular, greater than the LHS.
In order to also satisfy
make , large enough to satisfy the inequality (again, choosing independent of ). In conclusion, there exists, for each sufficiently-large , a restriction that avoids all events .
With such a , is at least
for all sufficiently large . Then choose small, small, and (such that ) in such a way to achieve .
We now note that, since every layer-1 head depends only on many inputs after applying , each layer-1 activation only depends on many inputs. We can thus collapse layers 0 and 1, converting layer-1 activations into layer-0 activations , and obtain a -transformer performing the same computation as before when is applied. This concludes the proof of the Depth Reduction Lemma.
5 Results for Soft Attention
In the previous section, we showed that transformers using hard attention are not able to recognize a range of core formal languages. In this section, we study soft attention. Here, we will use the smoothness of the operations used in transformers to show that any transformer, as inputs get longer, will not be able to robustly separate members of the languages from other inputs. The idea behind the proof is that the impact of any single input on the output of the transformer is small if the input is long:
If we exchange one input symbol, then the change in the resulting activations at the decoder layers is bounded as with constants depending on the parameter matrices, where is the input length.
This contrasts with recurrent networks: Changing a single input can have nonnegligible impact on the final state even for very long input. E.g., an RNN recognizing Parity through a hidden state that encodes parity of the current prefix will flip its hidden state if a single input bit is flipped.
This result entails that, as inputs become longer, soft attention transformers cannot robustly model many formal languages, and cannot even achieve good cross-entropies on a large class of prediction problems. To make this precise, we take a more quantitative angle and assign probability distributions over inputs for each input length , and consider the behavior of cross-entropy as . For Parity, we simply consider the distribution over i.i.d. bitstrings of length , and consider the task of predicting whether the next symbol can be the EndOfSequence symbol – which is true if and only if the prefix has an even number of ones. For 2Dyck
, we take the uniform distribution of well-bracketed prefixes of length, and ask the model to classify the set of valid next characters, which is a subset of .333Only an exponentially small subset of the strings over symbols are well-labeled; thus, cross-entropy on the classification task is less meaningful for this language. Considering prediction of the next symbol sidesteps this issue. We consider the problem of predicting the label from the input separately for each input length , and consider cross-entropy as .
Let a soft attention transformer be given for one of the languages Parity and 2Dyck. As , cross-entropy converges to chance level.
For Parity, exchanging a single bit flips membership. Thus, for any member of the language, there is an equally likely non-member such that the output activations differ by only .
For 2Dyck, the probability that a well-bracketed prefix of length is exactly balanced is . In all other cases, the correct label includes one of the two closing parentheses, which depends on one single previous symbol, and both possibilities are equally likely if that symbol is unknown. ∎
5.1 Proof of the Lemma
Having established the main theorem, we proceed to proving the technical lemma:
Proof of Lemma.
Let us compare the activations at the decoder layer for two inputs that only differ in the input at the -th position. Let the norm of the difference of the input embeddings at this position.
We show by induction over that, for some some (depending on the parameter matrices) the difference between the activations , are bounded as:
Once we have shown this, we have found that the influence of any individual input on the final prediction is , with constants depending on the norms of parameter matrices and the number of layers.
For , ,444We are assuming that is addition or concatenation Vaswani et al. (2017); for operations such as an MLP, there would be an additional Lipschitz constant depending on parameters and activation functions. and for .
For , we first note that for activations , where is the Lipschitz constant of . As
is implemented as a ReLU MLPVaswani et al. (2017),
depends on the norms of the parameter matrices. Attention logits are bounded byin the case of multiplicative/bilinear attention, and in the case of additive attention. Then any attention weight is upper bounded by .
Recall , where . We first calculate
Then, is at most
If , this is bounded (as )
If , this is bounded above by
This proves the inductive step for . ∎
We have shown that, even with infinite precision, transformers cannot robustly model non-counter-free regular languages, nor basic hierarchical structure.
How do our results compare to what is known about LSTMs? Recurrent networks such as SRNNs and LSTMs can perfectly emulate finite-state automata, and therefore can model any finite state language with optimal cross-entropy, as long as the state transition and symbol emission distributions are Markovian. In particular, Parity of i.i.d. bitstrings can be predicted with zero cross-entropy, independent of the input length.
Infinite-precision RNNs and LSTMs can model stacks Tabor (2000); Grüning (2006); Kirov and Frank (2012) and thus are theoretically capable of modeling 2Dyck and other deterministic context-free languages perfectly. This clear contrast between infinite-precision LSTMs and our findings for infinite-precision transformers may provide a theoretical explanation for the empirical finding that LSTMs seem to outperform transformers in modeling hierarchical structure (Tran et al., 2018).
of natural language. Our results entail that self-attention is strongly limited in their ability to model context-free languages or evaluate logical formulas. How should we reconcile this with the success of transformers at many natural language processing tasks? One possibility is that strong quantitative performance on NLP tasks can be achieved without genuine understanding of linguistic structure, as has been argued for other neural network models used in NLP that show limited knowledge of syntax despite delivering strong perplexityLinzen et al. (2016); Marvin and Linzen (2018). This would suggest that models such as transformers may not be capable in principle to understand language in a fully human-like way. Another possibility is that humans also have limited capacity to solve such problems, which means that human-like language understanding may not require full computational power to solve such problems. For instance, it has long been noted that center embeddings, syntactic structures exhibiting iterated bracketing, are very challenging for humans to process Miller and Chomsky (1963); Gibson and Thomas (1999). Intriguingly, self-attention bears some resemblance to psycholinguistic models of memory in human sentence processing that assume that humans, while processing a word, attend to chunks that were stored in memory when processing some previous words Lewis and Vasishth (2005); Parker et al. (2017). Such processing models predict difficulty with center embedding because they cannot count brackets Lewis and Vasishth (2005), akin to what we have shown theoretically for neural network models based on self-attention.
While our hard attention results hold under extremely general assumptions, the analysis of soft attention builds on regularity properties of the operations that existing transformer models are composed of. It would be interesting to investigate to what extent computational power increases when other operations – e.g., discontinuous activations, or infinite attention logits – are allowed. One can show that the use of discontinuous activation functions such as the Heaviside function would enable perfect modeling of Parity; however, we do not know whether such changes would help with context-free languages such as 2Dyck.555Analogy with Boolean circuits suggests that such results might be extremely hard to obtain: If transformers with soft attention were shown unable to model context-free languages such as the set of true logical formulas, even when allowing arbitrary activation functions, this would separate the functions computed by linear-size circuits from the class . Separation of and is a widely believed but long-open conjecture whose solution would be a major breakthough in computational complexity Arora and Barak (2009).
We formally investigated the capabilities of self-attention in modeling regular languages and hierarchical structure. We showed that transformers cannot model periodic regular languages or basic recursion, either with hard or soft attention, and even if infinite precision is allowed. Our results theoretically confirm the idea that self-attention, by avoiding recurrence, has quite limited computational power.
- Arora and Barak (2009) Sanjeev Arora and Boaz Barak. 2009. Computational complexity: a modern approach. Cambridge University Press.
- Barrington et al. (1992) David A Mix Barrington, Kevin Compton, Howard Straubing, and Denis Thérien. 1992. Regular languages in nc1. Journal of Computer and System Sciences, 44(3):478–499.
- Bengio et al. (1994) Yoshua Bengio, Patrice Simard, Paolo Frasconi, et al. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166.
- Bernardy (2018) Jean-Philippe Bernardy. 2018. Can recurrent neural networks learn nested recursion? LiLT (Linguistic Issues in Language Technology), 16(1).
- Boppana (1997) Ravi B Boppana. 1997. The average sensitivity of bounded-depth circuits. Information processing letters, 63(5):257–261.
- Cartling (2008) Bo Cartling. 2008. On the implicit acquisition of a context-free grammar by a simple recurrent neural network. Neurocomputing, 71(7-9):1527–1537.
- Chen et al. (2018) Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. 2018. The best of both worlds: Combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849.
- Cheng et al. (2016) Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733.
- Child et al. (2019) Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
- Chomsky and Schützenberger (1963) Noam Chomsky and Marcel P Schützenberger. 1963. The algebraic theory of context-free languages. In Studies in Logic and the Foundations of Mathematics, volume 35, pages 118–161. Elsevier.
- Clark et al. (2019) Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? an analysis of BERT’s attention. In Proceedings of BlackboxNLP 2019.
- Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
- De Palma et al. (2018) Giacomo De Palma, Bobak Toussi Kiani, and Seth Lloyd. 2018. Deep neural networks are biased towards simple functions. arXiv preprint arXiv:1812.10156.
- Dehghani et al. (2018) Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. 2018. Universal transformers. arXiv preprint arXiv:1807.03819.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Everaert et al. (2015) Martin BH Everaert, Marinus AC Huybregts, Noam Chomsky, Robert C Berwick, and Johan J Bolhuis. 2015. Structures, not strings: linguistics as part of the cognitive sciences. Trends in cognitive sciences, 19(12):729–743.
- Furst et al. (1984) Merrick Furst, James B Saxe, and Michael Sipser. 1984. Parity, circuits, and the polynomial-time hierarchy. Mathematical systems theory, 17(1):13–27.
- Gers and Schmidhuber (2001) Felix A Gers and E Schmidhuber. 2001. Lstm recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks, 12(6):1333–1340.
- Gibson and Thomas (1999) Edward Gibson and James Thomas. 1999. Memory limitations and structural forgetting: The perception of complex ungrammatical sentences as grammatical. Language and Cognitive Processes, 14(3):225–248.
- Gilmer et al. (2015) Justin Gilmer, Michal Kouckỳ, and Michael E Saks. 2015. A new approach to the sensitivity conjecture. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 247–254. ACM.
- Gopalan et al. (2016) Parikshit Gopalan, Noam Nisan, Rocco A Servedio, Kunal Talwar, and Avi Wigderson. 2016. Smooth boolean functions are easy: Efficient algorithms for low-sensitivity functions. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, pages 59–70. ACM.
- Grüning (2006) André Grüning. 2006. Stack-like and queue-like dynamics in recurrent neural networks. Connection Science, 18(1):23–42.
- Gulordava et al. (2018) Kristina Gulordava, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni. 2018. Colorless green recurrent networks dream hierarchically. arXiv preprint arXiv:1803.11138.
- Hao et al. (2019) Jie Hao, Xing Wang, Baosong Yang, Longyue Wang, Jinfeng Zhang, and Zhaopeng Tu. 2019. Modeling recurrence for transformer. arXiv preprint arXiv:1904.03092.
- Hastad et al. (1994) Johan Hastad, Ingo Wegener, Norbert Wurm, and Sang-Zin Yi. 1994. Optimal depth, very small size circuits for symmetrical functions in ac0. Information and Computation, 108(2):200–211.
Kalinke and Lehmann (1998)
Yvonne Kalinke and Helko Lehmann. 1998.
Computation in recurrent neural networks: From counters to iterated
Australian Joint Conference on Artificial Intelligence, pages 179–190. Springer.
- Kirov and Frank (2012) Christo Kirov and Robert Frank. 2012. Processing of nested and cross-serial dependencies: an automaton perspective on srn behaviour. Connection Science, 24(1):1–24.
- Lewis and Vasishth (2005) Richard L Lewis and Shravan Vasishth. 2005. An activation-based model of sentence processing as skilled memory retrieval. Cognitive science, 29(3):375–419.
- Lin et al. (2019) Yongjie Lin, Yi Chern Tan, and Robert Frank. 2019. Open sesame: Getting inside bert’s linguistic knowledge. arXiv preprint arXiv:1906.01698.
- Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
- Linzen et al. (2016) Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
- Marvin and Linzen (2018) Rebecca Marvin and Tal Linzen. 2018. Targeted syntactic evaluation of language models. arXiv preprint arXiv:1808.09031.
- McNaughton and Papert (1971) Robert McNaughton and Seymour A Papert. 1971. Counter-Free Automata (MIT research monograph no. 65). The MIT Press.
- Merrill (2019) William Merrill. 2019. Sequential neural networks as automata. arXiv preprint arXiv:1906.01615.
- Miller and Chomsky (1963) George A Miller and Noam Chomsky. 1963. Finitary models of language users. In Handbook of Mathematical Psychology.
- Miller and Hardt (2018) John Miller and Moritz Hardt. 2018. When recurrent models don’t need to be recurrent. arXiv preprint arXiv:1805.10369.
- Mitzenmacher and Upfal (2005) Michael Mitzenmacher and Eli Upfal. 2005. Probability and Computing. Cambridge University Press, Cambridge.
- Montague (1973) Richard Montague. 1973. The proper treatment of quantification in ordinary english. In Approaches to natural language, pages 221–242. Springer.
- Parikh et al. (2016) Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933.
- Parker et al. (2017) Dan Parker, Michael Shvartsman, and Julie A Van Dyke. 2017. The cue-based retrieval theory of sentence comprehension: New findings and new challenges. Language processing and disorders, pages 121–144.
- Paulus et al. (2017) Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
- Pérez et al. (2019) Jorge Pérez, Javier Marinković, and Pablo Barceló. 2019. On the turing completeness of modern neural network architectures. arXiv preprint arXiv:1901.03429.
- Rossman (2018) Benjamin Rossman. 2018. The average sensitivity of bounded-depth formulas. computational complexity, 27(2):209–223.
- Sennhauser and Berwick (2018) Luzi Sennhauser and Robert Berwick. 2018. Evaluating the ability of lstms to learn context-free grammars. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 115–124.
- Shen et al. (2018a) Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018a. Disan: Directional self-attention network for rnn/cnn-free language understanding. In Thirty-Second AAAI Conference on Artificial Intelligence.
- Shen et al. (2018b) Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Sen Wang, and Chengqi Zhang. 2018b. Reinforced self-attention network: a hybrid of hard and soft attention for sequence modeling. arXiv preprint arXiv:1801.10296.
- Siegelman and Sontag (1995) Hava Siegelman and Eduardo D Sontag. 1995. On the computational power of neural nets. Journal of Computer and System Sciences, 50:132–150.
- Skachkova et al. (2018) Natalia Skachkova, Thomas Trost, and Dietrich Klakow. 2018. Closing brackets with recurrent neural networks. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 232–239.
- Suzgun et al. (2019) Mirac Suzgun, Yonatan Belinkov, and Stuart M Shieber. 2019. On evaluating the generalization of lstm models in formal languages. Proceedings of the Society for Computation in Linguistics (SCiL), pages 277–286.
- Tabor (2000) Whitney Tabor. 2000. Fractal encoding of context-free grammars in connectionist networks. Expert Systems, 17(1):41–56.
- Tenney et al. (2019) Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950.
- Tran et al. (2018) Ke Tran, Arianna Bisazza, and Christof Monz. 2018. The importance of being recurrent for modeling hierarchical structure. arXiv preprint arXiv:1803.03585.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Voita et al. (2019) Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418.
- Weiss et al. (2018) Gail Weiss, Yoav Goldberg, and Eran Yahav. 2018. On the practical computational power of finite precision rnns for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745.
- Yang et al. (2019) Baosong Yang, Longyue Wang, Derek F Wong, Lidia S Chao, and Zhaopeng Tu. 2019. Assessing the ability of self-attention networks to learn word order. arXiv preprint arXiv:1906.00592.
- Yao (1985) A. C. Yao. 1985. Separating the polynomial-time hierarchy by oracles. In 26th Annual Symposium on Foundations of Computer Science (sfcs 1985), pages 1–10.