Transformer [vaswani2017attention] is a recent self-attention based sequence-to-sequence architecture which has led to state of the art results across various NLP tasks including machine translation [ott-etal-2018-scaling, vaswani2017attention], language modeling [radford2018improving], question answering [devlin-etal-2019-bert, liu2019roberta] and semantic role labeling [strubell-etal-2018-linguistically]. Although a number of variants of Transformers have been proposed for different tasks, the original architecture still underlies these variants.
While the training and generalization of machine learning models such as Transformers are the central goals in their analysis, an essential prerequisite to this end is characterization of the computational power of the model: training a model for a certain task cannot succeed if the model is computationally incapable of carrying out the task. While the computational capabilities of recurrent networks (RNNs) have been studied for decades[kolen2001field, siegelmann2012neural], for Transformers we are still in early stages. The celebrated work of [siegelmann1992computational] showed, assuming arbitrary precision, that RNNs are Turing-complete
, meaning that they are capable of carrying out any algorithmic task formalized by Turing machines. Recently,[perez2019turing] have shown that vanilla Transformers with hard-attention can also simulate Turing machines given arbitrary precision.
The role of various components in the efficacy of Transformers is an important question for further improvements. Recently, some researchers have sought to empirically analyze different aspects of Transformers. [voita-etal-2019-analyzing], [michel2019sixteen] investigate the effect of pruning attention heads and their results suggest that different types of attention heads may have different roles in the functioning of the network. Since the Transformer doesn’t process the input sequentially, it requires some form of positional information. Various positional encoding schemes have been proposed to capture order information [shaw-etal-2018-self, dai-etal-2019-transformer, Huang2018AnIR]. At the same time, on machine translation, [yang2019assessing] showed that the performance of Transformers with only positional masking is comparable to that with positional encodings. [kerneltsai2019transformer] raised the question of whether explicit encoding is necessary if positional masking is used. Since [perez2019turing]’s Turing-completeness proof relied heavily on residual connections, they asked whether these connections are essential for Turing-completeness. In this paper, we take a step towards answering such questions.
Our primary goal is to better understand the role of different components with respect to the computational power of the network. The main contributions of our paper are:
(i) We provide an alternate proof to show that Transformers are Turing-complete by directly relating Transformers to RNNs. More importantly, we prove that Transformers with positional masking and without positional encoding are also Turing-complete.
(ii) We analyze the necessity of various components such as self attention blocks, residual connections and feedforward networks for Turing-completeness. Figure 2 provides an overview of our results.
(iii) We explore the implications of our results via experiments on machine translation and synthetic datasets.
2 Related Work
Self Attention Models became popular after the success of Transformers. Several variants were introduced to address the shortcomings of Vanilla Transformers on other seq-to-seq tasks. Variants such as universal Transformers [universaldehghani2018]
were proposed to improve the performance on learning general algorithms. Efforts were made to combine recurrent and self-attention models[hao-etal-2019-modeling], [chen-etal-2018-best]. A positional masking based approach to incorporate order information was advocated by [shen2018disan].
Computational Power of neural networks
Computational Power of neural networkshas been studied since the foundational paper [mcculloch1943logical]; in particular, this aspect of RNNs has long been studied [kolen2001field]. The seminal work by [siegelmann1992computational] showed that RNNs can simulate a Turing machine by using unbounded precision. [chen2017recurrent]
showed that RNNs with ReLU activations are also Turing-complete. Many recent works have explored the computational power of RNNs in practical settings. The ability of RNNs to recognize counter like languages has been studied by several works[gers2001lstm], [suzgun2018evaluating]. Recently, [weiss2018practical] and [suzgun2019lstm] showed that LSTMs can learn to exhibit counting like behavior in practice. The ability of RNNs to recognize well bracketed strings has also been recently studied [sennhauser-berwick-2018-evaluating], [skachkova-etal-2018-closing]. However, such analysis on Transformers have been relatively scarce.
Theoretical work on Transformers was initiated by [perez2019turing] who formalized the notion of Transformers and showed that it can simulate a Turing machine given arbitrary precision. [hahn2019theoretical] showed some limitations of Transformer encoders in modeling regular and context-free languages; however, these are not applicable to the complete Transformer architecture. It has been recently shown that Transformers are universal approximators of sequence to sequence functions given arbitrary precision [anonymous2020are]. With a goal similar to ours, [kerneltsai2019transformer] attempted to study the attention mechanism via a kernel formulation. However, a systematic study of various components of Transformer hadn’t been done.
All the numbers used in our computations will be rational denoted . For a sequence , we set for . We will work with an alphabet of size , with special symbols and
signifying the beginning and end of the input sequence, resp. The symbols are mapped to vectors via a given ‘base’ embedding, where is the dimension of the embedding. E.g., this embedding could be the one used for processing the symbols by the RNN. We set . Positional encoding is a function . Together, these provide embedding for a symbol at position given by , often taken to be simply . Vector
denotes one-hot encoding of.
We follow [siegelmann1992computational] in our definition of RNNs. To feed the sequences to the RNN, these are converted to the vectors where . The RNN is given by the following recurrence for
where is a multilayer feedforward network (FFN) with activation, matrices and , and the hidden state with given initial hidden state , is the hidden state dimension. In their construction, is a 4-layer FFN. Note that it is not equivalent to 4-layer stacked RNNs.
After the last symbol has been fed, we continue to feed the RNN with the terminal symbol until it halts. This allows the RNN to carry out computation after having read the input.
[siegelmann1992computational] Any seq-to-seq function computable by a Turing machine can also be computed by an RNN.
For details please see section B.1 in appendix.
3.2 Transformer Architecture
Vanilla Transformer. We describe the original Transformer architecture with positional encoding [vaswani2017attention] as formalized by [perez2019turing], with some modifications. All vectors in this subsection are from .
The transformer, denoted , is a seq-to-seq architecture. Its input consists of (i) a sequence of vectors, (ii) a seed vector . The output is a sequence of vectors. The sequence is obtained from the sequence of symbols by using the embedding mentioned earlier: . The transformer consists of composition of transformer encoder and a transformer decoder. For the feedforward networks in the transformer layers we use the activation as in [siegelmann1992computational]
, namely the saturated linear activation functionwhich takes value for , value for and value for . This activation can be easily replaced by the standard ReLU activation via .
Self-attention. The self-attention mechanism takes as input (i) a query vector
, (ii) a sequence of key vectors , and (iii) a sequence of value vectors . The -attention over and , denoted , is a vector , where
(ii) The normalization function is : for , if the maximum value occurs times among , then if is a maximum value and otherwise. In practice, the is often used but its output values are in general not rational.
(iii) For vanilla transformers, the scoring function used is a combination of multiplicative attention [vaswani2017attention] and a non-linear function: . This was also used by [perez2019turing].
Transformer encoder. A single-layer encoder is a function , with input a sequence of vector in , and parameters . The output is another sequence of vectors in . The parameters specify functions , and , all of type . The functions and
are linear transformations andan FFN. The output is computed by ():
The addition operations and are the residual connections. The operation in (5) is called the encoder-encoder attention block.
The complete -layer transformer encoder has the same input as the single-layer encoder.
By contrast, its output and contains two sequences.
is obtained by composition of single-layer encoders: let , and for , let
Transformer decoder. The input to a single-layer decoder is (i) output by the encoder, and (ii) sequence of vectors for . The output is another sequence .
Similar to the single-layer encoder, a single-layer decoder is parameterized by functions and and is defined by
The operation in (6) will be referred to as the decoder-decoder attention block and the operation in (7) as the decoder-encoder attention block. In (6), positional masking is applied to prevent the network from attending over symbols which are ahead of them.
An -layer Transformer decoder is obtained by repeated application of single-layer decoders each with its own parameters and a transformation function applied to the last vector in the sequence of vectors output by the final decoder. Formally, for and we have Note that while the output of a single-layer decoder is a sequence of vectors, the output of an -layer Transformer decoder is a single vector.
The complete Transformer. The output is computed by the recurrence , for . We get by adding positional encoding: .
Directional Transformer. For the dirctional case, standard multiplicative attention is used, that is, . The general architecture is the same as for the vanilla case; the differences due to positional masking are the following.
4 Primary Results
4.1 Turing-Completeness Results
The class of Transformer networks with positional encodings is Turing-complete.
The class of Transformer networks with positional encodings is Turing-complete.
In light of Theorem B.1, it suffices to show that the Transformer can simulate RNNs. The input is provided to the transformer as the sequence of vectors , where , which has as sub-vector the given base embedding and the positional encoding , along with extra coordinates set to constant values and will be used later.
The basic observation behind our construction of the simulating Transformer is that the transformer decoder can naturally implement the recurrence operations of the type used by RNNs. To this end, the FFN of the decoder, which plays the same role as the FFN component of the RNN, needs sequential access to the input in the same way as RNN. But the Transformer receives the whole input at the same time. We utilize positional encoding along with the attention mechanism to isolate at time and feed it to , thereby simulating the RNN.
As stated earlier, we append the input of the RNN with ’s until it halts. Since the Transformer takes its input all at once, appending by ’s is not possible (in particular, we do not know how long the computation would take). Instead, we append the input with a single . After encontering a once, the Transformer will feed (encoding of) to in subsequent steps until termination. Here we confine our discussion to the case ; the case is slightly different but simpler.
The construction is simple: it has only one head, one encoder layer and one decoder layer; moreover, the attention mechanisms in the encoder and the decoder-decoder attention layer of the decoder are trivial described below.
The encoder attention layer does trivial computation in that it merely computes the identity function: , which can be easily achieved, e.g. by using the residual connection and setting the value vectors to by setting the function to identically . The final and functions bring into useful forms by appropriate linear transformations: and . Thus, the key vectors only encode the positional information and the value vectors only encode the input symbols.
The output sequence of the decoder is . Our construction will ensure, by induction on , that contains the hidden states of the RNN as a sub-vector along with positional information: . This is easy to arrange for , and assuming it for we prove it for . As for the encoder, the decoder-decoder attention layer acts as the identity: . Now, using the last but one coordinate in representing the time , the attention mechanism can pick out the embedding of the -th input symbol . This is possible because in the key vector mentioned above, almost all coordinates other than the one representing the position are set to , allowing the mechanism to only focus on the positional information and not be distracted by the other contents of : the scoring function has value . For a given , it is maximized at for and at for . This use of scoring function is similar to [perez2019turing].
At this point, has at its disposal the hidden state (coming from via and the residual connection) and the input symbol (coming via the attention mechanism and the residual connection). Hence can act just like the FFN (refer to Lemma D.3) underlying the RNN to compute and thus , proving the induction hypothesis. The complete construction can be found in Sec. D in the appendix. ∎
The class of Transformer networks with positional masking and no explicit positional encodings is Turing complete.
As before, by Theorem B.1 it suffices to show that Transformers can simulate RNNs. The input is provided to the transformer as the sequence of vectors , where . The general idea for the directional case is similar to the vanilla case, namely we would like the FFN of the decoder to directly simulate the RNN. In the vanilla case, positional encoding and the attention mechanism helped us feed input at the -th iteration of the decoder to . We will implement this scheme but now work with positional masking. We no longer have explicit positional information in the input such as a coordinate with value . The key insight is that in fact we do not need the positional information as such to recover at step : in our construction, the attention mechanism will recover in an indirect manner even though it’s not able to “zero-in” on the -th position.
Let us first explain this without details of the construction. We maintain in vector , with a coordinate each for symbols in , the fraction of times the symbol has occurred up to step . Now, at a step , for the difference (which is part of the query vector), it can be shown easily that only the coordinate corresponding to is positive. Thus after applying the linearized sigmoid , we can isolate the coordinate corresponding to . Now using this query vector, the (hard) attention mechanism will be able to pick out the value vectors for all indices such that and output their average. Crucially, the value vector for an index is essentially which depends only on . Thus ,all these vectors are equal to , and so is their average. This recovers , which can now be fed to , simulating the RNN.
We now outline the construction and relate it to the above discussion. As before, for simplicity we restrict to the case . We use only one head, one single-layer encoder and two single-layer decoders. The encoder, as in the vanilla case, does very little other than pass information along. The vectors in are obtained by the trivial attention mechanism followed by simple linear transformations: and .
Our construction ensures that at step we have . As before, the proof is by induction on .
The first one-layer decoder, the decoder-decoder attention block is trivial: . In the decoder-encoder attention block, we give equal attention to all the values, which along with , leads to , where essentially , except with a change for the last coordinate due to the special status of the last symbol in the processing of RNN.
In the second layer, the decoder-decoder attention block is again trivial with . We remark that in this construction the scoring function is the standard multiplicative attention. Now , which is positive if and only if , as mentioned earlier. Thus attention weights in satisfy , where is a normalization constant and is the indicator. Refer to Lemma E.3 for more details.
At this point, has at its disposal the hidden state (coming from via and the residual connection) and the input symbol (coming via the attention mechanism and the residual connection). Hence can act just like the FFN underlying the RNN to compute and thus , proving the induction hypothesis. The complete construction can be found in Sec. E in the appendix. ∎
In practice, [yang2019assessing] found that for NMT, Transformers with only positional masking achieve comparable performance as compared to the ones with positional encodings. Similar evidence was found by [kerneltsai2019transformer]. Our proof for directional transformers entails that there is no loss of order information if positional information is only provided in the form of masking. However, we do not recommend using masking as a replacement for explicit encodings. In practice, one should explore each of the methods and maybe even combinations of encoding and masking. The computational equivalence of encoding and masking given by our results implies that any differences in their performance must come from differences in learning dynamics.
4.2 Analysis of Components
The results for various components follow from our construction in Theorem 4.1. Note that in both the encoder and decoder attention blocks, we need to compute the identity function. We can nullify the role of the attention heads by setting the value vectors to zero and making use of only the residual connections to implement the identity function. Thus, even if we remove those attention heads, the model is still Turing complete. On the other hand, we can remove the residual connections around the attention blocks and make use of the attention heads to implement the identity function by using positional encodings. Hence, either the attention head or the residual connection is sufficient to achieve Turing-completeness. A similar argument can be made for the FFN in the encoder layer: either one of the residual connection or the FFN is sufficient for Turing-completeness. For the decoder-encoder attention head, since it is the only way for the decoder to obtain information about the input, it is necessary for the completeness. The FFN is the only component that can perform computations based on the input and based on the computations performed earlier via recurrence, the model is not Turing complete without it. Figure 2 summarizes the role of different components with respect to the computational expressiveness of the network.
In practice, [michel2019sixteen] found that removing different kind of attention heads has different degrees of impact on a trained Transformer model. Their results show that removing encoder-decoder attention heads has a more significant impact compared to removing the self attention heads. They suggest that the former heads might be doing most of the heavy lifting. This is in line with our results, namely decoder-encoder attention block is indispensable as opposed to other attention blocks.
The class of Transformer networks without residual connection around the decoder-encoder block is not Turing-complete.
The result follows from the observation that without the residual connection, the decoder-encoder attention block gives , which leads to for some ’s such that . Since is produced from the encoder, the vector will have no information about its previous hidden state values. Since the previous hidden state information was computed and stored in , without the residual connection, the information in depends solely on the output of the encoder. One could argue that since the attention weights ’s depend on the query vector , one could still glean information from the vectors ’s. To see that it’s not always the case, consider any task with a single input and the number of outputs greater than one. Since there is a single input the vector will be a constant () at any step and hence the output of the network will always also be constant at all steps. Hence, a model cannot perform such a task. More details can be found in section C.2 in the appendix.
Discussion. It’s perhaps surprising that residual connection, originally proposed to assist in the learning of very deep networks, plays a vital role in the computational expressiveness of the network. Without it, the model is limited in its capability to make decisions based on predictions in the previous steps.
In this section, we explore practical implications of our results. Our experiments are geared towards answering the following questions:
Q1. Are there any practical implications of the limitation of Transformers without decoder-encoder residual connections? What kind of tasks can they do, or can not do compared to vanilla Transformers?
Q2. Is there any additional benefit of using positional masking as opposed to absolute positional encoding [vaswani2017attention]?
Although we showed that Transformers without decoder-encoder residual connection are not Turing complete, it doesn’t imply that they are incapable of performing all the tasks. Our results suggest that it is limited in terms of its capability to make inferences based on its previous computations, which is required for tasks such as counting and language modeling. However, it can be shown that the model is capable of performing tasks which only rely on information provided at a given step such as copying and mapping. For such tasks, given positional information at a particular step, the model can look up the corresponding input and map it via the FFN. We evaluate these hypotheses via our experiments.
|- Dec-Enc Residual||99.7||0.0|
|- Dec-Dec Residual||99.7||99.8|
For our experiments on synthetic data, we consider two tasks, namely the copy task and the counting task. For the copy task, the goal of a model is to reproduce the input sequence. We sample sentences of lengths between 5-12 words from Penn Treebank and create the train-test splits with all sentences belonging to the same range of length. In the counting task, we create a very simple dataset where the model is given one number between 0 and 100 as input and its goal is to predict the next five numbers. Since only a single input is provided to the encoder, it is necessary for the decoder to be able to make inferences based on its previous predictions to perform this task. The benefit of conducting these experiments on synthetic data is that they isolate the phenomena we wish to evaluate. We then assess the influence of the limitation on Machine Translation which requires a model to do a combination of both mapping and draw inference from computations in previous timesteps. We evaluate the models on IWSLT’14 German-English dataset and IWSLT’15 English-Vietnamese dataset. For each of these tasks we compare vanilla Transformer with the one without decoder-encoder residual connection. As a baseline we also consider the model without decoder-decoder residual connection, since according to our results, that connection doesn’t influence the computational power of the model. Specifications of the models, experimental setup, datasets and sample outputs can be found in section F in the appendix.
|- Dec-Enc Residual||24.1||21.8|
|- Dec-Dec Residual||30.6||27.2|
Results on the effect of residual connections on synthetic tasks can be found in Table 1. As per our hypothesis, all the variants are able to perfectly perform the copy task. For the counting task, the one without decoder-encoder residual connection is theoretically incapable of performing it since the final FFN network at the end of the decoder receives a constant input at every step. However, the other two are able to accomplish it by learning to make decisions based on their prior predictions. For the machine translation task, the results can be found in Table 2. While the drop from removing decoder-encoder residual connection is significant, it’s still able to perform reasonably well since the task can be largely fulfilled by mapping different words from one sentence to another.
For positional masking, our proof technique suggests that due to lack of positional encodings, the model must come up with its own mechanism to make order related decisions. Our hypothesis is that, if it’s able to develop such a mechanism, it should be able to generalize to higher lengths and not overfit on the data it is provided. To evaluate this claim, we simply extend the copy task. We consider two models, one which is provided absolute positional encodings and one where only positional masking is applied. We train both the models for copy tasks on sentences of lengths 5-12 and evaluate it on various lengths going from 5 to 30. Figure 3 shows the performance of these models across various lengths. The model with positional masking clearly generalizes up to higher lengths although its performance too degrades at extreme lengths. We found that the model with absolute positional encodings during training overfits on the fact that the 13th token is always the terminal symbol. Hence, when evaluated on higher lengths it never produces a sentence of length greater than 12. Other encoding schemes such as relative positional encodings [shaw-etal-2018-self, dai-etal-2019-transformer] can generalize better, since they are inherently designed to address this particular issue. However, our goal is not to propose masking as a replacement of positional encodings, rather it is to determine whether the mechanism that the model develops during training is helpful in generalizing to higher lengths. Note that, positional masking was not devised by keeping generalization or any other benefit in mind. Our claim is only that, the use of masking does not limit the model’s expressiveness and it may benefit in other ways, but during practice one should explore each of the mechanism and even a combination of both. [yang2019assessing] showed that a combination of both masking and encodings is better able to learn order information as compared to explicit encodings.
6 Discussion and Final Remarks
We showed that the class of languages recognized by Transformers and RNNs are exactly the same. This implies that the difference in performance of both the networks across different tasks can be attributed only to their learning capabilities. In contrast to RNNs, Transformers are composed of multiple components which are not essential for its computational expressiveness. However, in practice they may play a crucial role. Recently, [voita-etal-2019-analyzing] showed that the decoder-decoder attention heads in the lower layers of the decoder do play a significant role in the NMT task and suggest that they may be helping in language modeling. This indicates that components which are not essential for the computational power may play a vital role in improving the learning and generalization ability.
Take-Home Messages. We showed that order information can be provided either in the form of explicit encodings or masking without any loss of information. The decoder-encoder attention block plays a necessary role in conditioning the computation on the input sequence while the residual connection around it is necessary to keep track of previous computations. The feedforward network in the decoder is the only component capable of performing computations based on the input and prior computations. Our experimental results show that removing components essential for computational power inhibit the model’s ability to perform certain tasks. At the same time, the components which do not play a role in the computational power may be vital to the learning ability of the network.
Although our proofs rely on arbitrary precision, which is common practice while studying the computational power of neural networks in theory [siegelmann1992computational, perez2019turing, hahn2019theoretical, anonymous2020are], implementations in practice work over fixed precision settings. However, our construction provides a starting point to analyze Transformer under finite precision. Since RNNs with ReLU activation can recognize all regular languages in finite precision [korsky2019computational], it follows from our construction that Transformer can also recognize a large class of regular languages in finite precision. At the same time, it doesn’t imply that it can recognize all regular languages given the limitation due to the precision required to encode positional information. We leave the study of Transformers in finite precision for future work.
Appendix A Roadmap
We begin with varions definitions and results. We define simulation of Turing machines by RNNs and state the Turing-completeness result for RNNs. We define vanilla and directional Transformers and what it means for Transformers to simulate RNNs. Many of the definitions from the main paper are reproduced here, but in more detail. In Sec. C we discuss the effect on computational power of removing various components of Transformers. Sec. D contains the proof of Turing completeness of vanilla Transformers and Sec. E the corresponding proof for directional Transformers. Finally, Sec. F has further details of experiments.
Appendix B Definitions
Denote the set by . Functions defined for scalars are extended to vectors in the natural way: for a function defined on a set , for a sequence of elements in , we set . Indicator is , if predicate is true and is otherwise. For a sequence for some , we set for . We will work with an alphabet , with and . The special symbols and correspond to the beginning and end of the input sequence, resp. For a vector , by we mean the all- vector of the same dimension as . Let
b.1 RNNs and Turing-completeness
Here we summarize, somewhat informally, the Turing-completeness result for RNNs due to [siegelmann1992computational]. We recall basic notions from computability theory. In the main paper, for simplicity we stated the results for total recursive functions , i.e. a function that is defined on every and whose values can be computed by a Turing machine. While total recursive functions form a satisfactory formalization of seq-to-seq tasks, here we state the more general result for partial recursive functions. Let be partial recursive. A partial recursive function is one that need not be defined for every , and there exists a Turing Machine with the following property. The input is initially written on the tape of the Turing Machine and the output is the content of the tape upon acceptance which is indicated by halting in a designated accept state. On for which is undefined, does not halt.
We now specify how Turing machine is simulated by RNN . In the RNNs in [siegelmann1992computational] the hidden state has the form
where denotes the state of one-hot form. Numbers , called stacks, store the contents of the tape in a certain Cantor set like encoding (which is similar to, but slightly more involved, than binary representation) at each step. The simulating RNN , gets as input encodings of in the first steps, and from then on receives the vector as input in each step. If is defined on , then halts and accepts with the output the content of the tape. In this case, enters a special accept state, and encodes and . If does not halt then also does not enter the accept state.
[siegelmann1992computational] further show that from one can further explicitly produce the as its output. In the present paper, we will not deal with explicit production of the output but rather work with the definition of simulation in the previous paragraph. This is for simplicity of exposition, and the main ideas are already contained in our results. If the Turing machine computes in time , the simulation takes time to encode the input sequence and to compute .
Theorem B.1 ([siegelmann1992computational]).
Given any partial recursive function computed by Turing machine , there exists a simulating RNN .
In view of the above theorem, for establishing Turing-completeness of Transformers, it suffices to show that RNNs can be simulated by Transformers. Thus, in the sequel we will only talk about simulating RNNs.
b.2 Vanilla Transformer Architecture
Here we describe the original transformer architecture due to [vaswani2017attention] as formalized by [perez2019turing]. While our notation and definitions largely follow [perez2019turing], they are not identical. The transformer here makes use of positional encoding; later we will discuss the transformer variant using directional attention but without using positional encoding.
The transformer, denoted , is a sequence-to-sequence architecture. Its input consists of (i) a sequence of vectors in , (ii) a seed vector . The output is a sequence of vectors in . The sequence is obtained from the sequence of symbols by using the embedding mentioned earlier: for . The transformer consists of composition of transformer encoder and a transformer decoder. The transformer encoder is obtained by composing one or more single-layer encoders and similarly the transformer decoder is obtained by composing one or more single-layer decoders. For the feed-forward networks in the transformer layers we use the activation as in [siegelmann1992computational], namely the saturated linear activation function:
As mentioned in the main paper, we can easily work with the standard activation via . In the following, after defining these components, we will put them together to specify the full transformer architecture. But we begin with self-attention mechanism which is the central feature of the transformer.
The self-attention mechanism takes as input (i) a query vector , (ii) a sequence of key vectors , and (iii) a sequence of value vectors . All vectors are in .
The -attention over keys and values , denoted by , is a vector given by
The above definition uses two functions and which we now describe. For the normalization function we will use : for , if the maximum value occurs times among , then if is a maximum value and otherwise. In practice, the is often used but its output values are in general not rational. The names soft-attention and hard-attention are used for the attention mechanism depending on which normalization function is used.
For the Turing-completeness proof of vanilla transformers, the scoring function used is a combination of multiplicative attention [vaswani2017attention] and a non-linear function: . For directional transformers, the standard multiplicative attention is used, that is, .
A single-layer encoder is a function , where is the parameter vector and the input is a sequence of vector in . The output is another sequence of vectors in . The parameters specify functions , and , all of type . The functions and are usually linear transformations and this will be the case in our constructions:
where . The function is a feed-forward network. The single-layer encoder is then defined by
The addition operations and are the residual connections. The operation in (5) is called the encoder-encoder attention block.
The complete -layer transformer encoder has the same input as the single-layer encoder. By contrast, its output consists of two sequences , each a sequence of vectors in . The encoder is obtained by repeated application of single-layer encoders, each with its own parameters; and at the end, two trasformation functions and are applied to the sequence of output vectors at the last layer. Functions and are linear transformations in our constructions. Formally, for and , we have
The output of the -layer Transformer encoder is fed to the Transformer decoder which we describe next.
The input to a single-layer decoder is (i) , the sequences of key and value vectors output by the encoder, and (ii) a sequence of vectors in . The output is another sequence of vectors in .
Similar to the single-layer encoder, a single-layer decoder is parameterized by functions and and is defined by
The operation in (6) will be referred to as the decoder-decoder attention block and the operation in (7) as the decoder-encoder attention block. In the decoder-decoder attention block, positional masking is applied to prevent the network from attending over symbols which are ahead of them.
An -layer Transformer decoder is obtained by repeated application of single-layer decoders each with its own parameters and a transformation function applied to the last vector in the sequence of vectors output by the final decoder. Formally, for and we have
We use to denote an -layer Transformer decoder. Note that while the output of a single-layer decoder is a sequence of vectors, the output of an -layer Transformer decoder is a single vector.
The complete Transformer.
A Transformer network receives an input sequence , a seed vector , and . For its output is a sequence defined by
We get by adding positional encoding: . We denote the complete Transformer by . The Transformer “halts” when , where is a prespecified halting set.
Simulation of RNNs by Transformers.
We say that a Transformer simulates an RNN (as defined in Sec. B.1) if on input , at each step , the vector contains the hidden state as a subvector: , and halts at the same step as RNN.
Appendix C Components of Transformers and their effect on computational power
Before moving on to the complete Transformer architecture. We first take a look at the encoder. The Transformer encoder without any positional information is order invariant. That is, given two sequences which are permutations of each other, the output of the network remains the same. Since it is order invariant it cannot recognize regular languages such as . [perez2019turing] showed that even though it is permutations invariant it can recognize non-regular languages such as the language . However, it is trivial to see that even though it can compare the occurrence of two symbols, it cannot recognize well-balanced parenthesis languages such as Dyck-1. Let , then the language Dyck-1 denoted by is defined as all prefixes of contain no more ’s than ’s and the number of ’s in equals the number of ’s . Essentially the number of open brackets have to be greater than the number of closed brackets at every point and the total number of open and closed brackets should be the same at the end.
There exists a Transformer Encoder with positional masking that can recognize the language (Dyck-1)
Let denote a sequence . Let and . We use a single layer Encoder network where we set the weight matrix for key vectors to be null matrix, that is . The weights matrices corresponding query and value vectors are set to Identity. Thus,
for . Thus if in the first inputs there are open brackets and closed brackets, then where . This implies that if is greater than zero then, the number of open brackets higher and if its less than zero then the number of closed brackets are higher. We then use a simple feedforward network of the form where . Let , then . The first coordinate of will be nonzero only when the number of open brackets will be greater than the number of closed brackets and zero otherwise. Similarly, the second coordinate will be nonzero only when the number of closed brackets are greater than the number of open brackets and will be zero other wise. Thus for the for a word to be in language , the second coordinate must never be nonzero and the first coordinate of should be zero to ensure the number of open brackets and closed brackets are the same.
For an input sequence , the encoder will produce based on the construction specified above. A word belongs to language if for all ( over ) and . Else, the word does not belong to the language . ∎
The task of recognizing the language Dyck-1 is similar to the task of comparing the occurrences of symbols and in where . The main difference being that the order also matters. For an Encoder without any positional information, the sequence and will lead to the same outputs whereas for an Encoder with positional masking, it will be able to distinguish the two and recognize the one that belongs to Dyck-1.
c.2 Residual Connections
The Transformer without residual connection around the Encoder-Decoder Attention block in the Decoder is not Turing Complete
Recall that the vectors is produced from the Encoder-Decoder Attention block in the following way,
The result follows from the observation that without the residual connections, , which leads to for some s such that . Since is produced from the encoder, the vector will have no information about its previous hidden state values. Since the previous hidden state information was computed and stored in , without the residual connection, the information in depends solely on the output of the encoder.
One could argue that since the attention weights s depend on the query vector , it could still use it gain the necessary information from the vectors s. To see that its not true, consider the following simple problem, given a value , where , the network has to produce the values . That is, it has to produce values of the form for all such that . Note that, since there is only one input, the encoder will produce only one particular output vector and regardless of what the value of the query vector is, the vector will be constant at every timestep. Since is fed to feedforward network which maps it to , the output of the decoder will remain the same at every timestep and it cannot produce distinct values. Thus the model cannot perform the task defined above which RNNs and Vanilla Transformers can easily do with a simple counting mechanism via their recurrent connection.
In case of hard attention, the network without Encoder-Decoder residual connection with inputs can have atmost distinct values and hence will be unable to perform a task that takes inputs and has to produce more than outputs.
This implies that the model without Encoder-Decoder residual connection is limited in its capability to perform tasks which requires it to make inferences based on previously generated outputs.
Appendix D Completeness of Vanilla Transformers
d.1 Simulation of RNNs by Transformers with positional encoding
RNNs can be simulated by vanilla Transformers and hence the class of vanilla Transformers is Turing-complete.
The construction of the simulating transformer is simple: it uses a single head and both the encoder and decoder have one layer. Moreover, the encoder does very little and most of the action happens in the decoder. The main task for the simulation is to design the input embedding (building on the given base embedding ), the feedforward network and the matrices corresponding to functions .
The input embedding is obtained by summing the symbol and positional encodings which we next describe. These encodings have dimension , where