1 Introduction
There is an increasing interest in designing neural network architectures capable of learning algorithms from examples (Graves et al., 2014; Grefenstette et al., 2015; Joulin & Mikolov, 2015; Kaiser & Sutskever, 2016; Kurach et al., 2016; Dehghani et al., 2018). A key requirement for any such an architecture is thus to have the capacity of implementing arbitrary algorithms, that is, to be Turing complete
. Turing completeness often follows for these networks as they can be seen as a control unit with access to an unbounded memory; as such, they are capable of simulating any Turing machine.
On the other hand, the work by Siegelmann & Sontag (1995)
has established a different way of looking at the Turing completeness of neural networks. In particular, their work establishes that recurrent neural networks (RNNs) are Turing complete even if only a bounded number of resources (i.e., neurons and weights) is allowed. This is based on two conditions: (1) the ability of RNNs to compute
internal dense representations of the data, and (2) the mechanisms they use for accessing such representations. Hence, the view proposed by Siegelmann & Sontag shows that it is possible to release the full computational power of RNNs without arbitrarily increasing its model complexity.Most of the early neural architectures proposed for learning algorithms correspond to extensions of RNNs – e.g., Neural Turing Machines
(Graves et al., 2014) –, and hence they are Turing complete in the sense of Siegelmann & Sontag. However, a recent trend has shown the benefits of designing networks that manipulate sequences but do not directly apply a recurrence to sequentially process their input symbols. Architectures based on attention or convolutions are two prominent examples of this approach. In this work we look at the problem of Turing completeness à la Siegelmann & Sontag for two of the most paradigmatic models exemplifying these features: the Transformer (Vaswani et al., 2017) and the Neural GPU (Kaiser & Sutskever, 2016).The main contribution of our paper is to show that the Transformer and the Neural GPU are Turing complete based on their capacity to compute and access internal dense representations of the data. In particular, neither the Transformer nor the Neural GPU requires access to an external additional memory to become Turing complete. Thus the completeness holds for bounded architectures (bounded number of neurons and parameters). To prove this we assume that internal activations are represented as rational numbers with arbitrary precision. For the case of the Transformer we provide a direct simulation of a Turing machine, while for the case of the Neural GPU our result follows by simulating standard sequencetosequence RNNs. Our study also reveals some minimal sets of elements needed to obtain these completeness results. The computational power of Transformers and of Neural GPUs has been compared in the current literature (Dehghani et al., 2018), but both are only informally used. Our paper provides a formal way of approaching this comparison.
For the sake of space, we only include sketch of some proofs in the body of the paper. The details for every proof can be found in the appendix.
Background work
The study of the computational power of neural networks can be traced back to McCulloch & Pitts (1943) which established an analogy between neurons with hardthreshold activations and first order logic sentences, and Kleene (1956) that draw a connection between neural networks and finite automata. As mentioned earlier, the first work showing the Turing completeness of finite neural networks with linear connections was carried out by Siegelmann & Sontag (1992, 1995). Since being Turing complete does not ensure the ability to actually learn algorithms in practice, there has been an increasing interest in enhancing RNNs with mechanisms for supporting this task. One strategy has been the addition of inductive biases in the form of external memory, being the Neural Turing Machine (NTM) (Graves et al., 2014) a paradigmatic example. To ensure that NTMs are differentiable, their memory is accessed via a soft attention mechanism (Bahdanau et al., 2014). Other examples of architectures that extend RNNs with memory are the StackRNN (Joulin & Mikolov, 2015), and the (De)QueueRNNs (Grefenstette et al., 2015). By Siegelmann & Sontag’s results, all these architectures are Turing complete.
The Transformer architecture (Vaswani et al., 2017) is almost exclusively based on the attention mechanism, and it has achieved state of the art results on many languageprocessing tasks. While not initially designed to learn general algorithms, Dehghani et al. (2018) have advocated the need for enriching its architecture with several new features as a way to learn general procedures in practice. This enrichment is motivated by the empirical observation that the original Transformer architecture struggles to generalize to input of lengths not seen during training. We, in contrast, show that the original Transformer architecture is Turing complete, based on different considerations. These results do not contradict each other, but show the differences that may arise between theory and practice. For instance, Dehghani et al. (2018) assume fixed precision, while we allow arbitrary internal precision during computation. We think that both approaches can be complementary as our theoretical results can shed light on what are the intricacies of the original architecture, which aspects of it are candidates for change or improvement, and which others are strictly needed. For instance, our proof uses hard attention while the Transformer is often trained with soft attention (Vaswani et al., 2017). See Section 3.3 for a discussion on these differences.
The Neural GPU is an architecture that mixes convolutions and gated recurrences
over tridimensional tensors. It has been shown that NeuralGPUs are powerful enough to learn decimal multiplication from examples
(Freivalds & Liepins, 2018), being the first neural architecture capable of solving this problem endtoend. The similarity of Neural GPUs and cellular automata has been used as an argument to state the Turing completeness of the architecture (Kaiser & Sutskever, 2016; Price et al., 2016). Cellular automata are Turing complete (Smith III, 1971; Ollinger, 2012) and their completeness is established assuming an unbounded number of cells. In the Neural GPU architecture, in contrast, the number of cells that can be used during a computation is proportional to the size of the input sequence (Kaiser & Sutskever, 2016). One can cope with the need for more cells by padding the Neural GPU input with additional (dummy) symbols, as much as needed for a particular computation. Nevertheless, this is only a partial solution, as for a Turingcomplete model of computation, one cannot decide a priori how much memory is needed to solve a particular problem. Our results in this paper are somehow orthogonal to the previous argument; we show that one can leverage the dense representations of the Neural GPU cells to obtain Turing completeness without requiring to add cells beyond the ones used to store the input.
2 Preliminaries
We assume all weights and activations to be rational numbers of arbitrary precision. Moreover, we only allow the use of rational functions with rational coefficients. Most of our positive results make use of the piecewiselinear sigmoidal activation function , which is defined as
(1) 
We are mostly interested in sequencetosequence (seqtoseq) neural network architectures that we next formalize. A seqtoseq network receives as input a sequence
of vectors
, for some , and produces as output a sequence of vectors . Most of these types of architectures require a seed vector and some stopping criterion for determining the length of the output. The latter is usually based on the generation of a particular output vector called an end of sequence mark. In our formalization instead, we allow a network to produce a fixed number of output vectors. Thus, for convenience we see a general seqtoseq network as a function such that the value corresponds to an output sequence of the form . With this definition, we can view every seqtoseq network as a language recognizer of strings as follows.Definition 2.1.
A seqtoseq language recognizer is a tuple , where is a finite alphabet, is an embedding function, is a seqtoseq network, is a seed vector, and is a set of final vectors. We say that accepts the string , if there exists an integer such that and . The language accepted by , denoted by , is the set of all strings accepted by .
We impose two additional restrictions over recognizers. The embedding function should be computed by a Turing machine in time linear w.r.t. the size of . This covers the two most typical ways of computing input embeddings from symbols: the onehot encoding, and embeddings computed by fixed feedforward networks. Moreover, the set should also be recognizable in lineartime; given a vector , the membership should be decided by a Turing machine working in linear time with respect to the size (in bits) of . This covers the usual way of checking equality with a fixed endofsequence vector. We impose these restrictions to disallow the possibility of cheating by encoding arbitrary computations in the input embedding or the stop condition, while being permissive enough to construct meaningful embeddings and stoping criterions.
Finally, a class of seqtoseq neural network architectures defines the class composed of all the languages accepted by language recognizers that use networks in . From these notions, the formalization of Turing completeness of a class naturally follows.
Definition 2.2.
A class of seqtoseq neural network architectures is Turing Complete if is exactly the class of languages recognized by Turing machines.
Given an input sequence , a seed vector , and , an encoderdecoder RNN is given by the following two recursions
(2)  
(3) 
where are matrices, and are vectors, is an output function, and and
are activations functions. Equation (
2) is called the RNNencoder and (3) the RNNdecoder.The next Theorem follows by inspection of the proof by Siegelmann & Sontag (1992, 1995) after adapting it to our formalization of encoderdecoder RNNs.
Theorem 2.3 (Siegelmann & Sontag (1992, 1995)).
The class of encoderdecoder RNNs is Turing complete. Turing completeness holds even if we restrict to the class in which
is the zero matrix,
and are the zero vector, is the identity function, and and are the piecewiselinear sigmoidal activation .3 The Transformer architecture
In this section we present a formalization of the Transformer architecture (Vaswani et al., 2017), abstracting away from specific choices of functions and parameters. Our formalization is not meant to produce an efficient implementation of the Transformer, but to provide a simple setting over which its mathematical properties can be established in a formal way.
The Transformer is heavily based on the attention mechanism introduced next. Consider a scoring function and a normalization function , for . Assume that , and that and are tuples of elements in . A attention over , denoted by , is a vector defined as follows.
(4)  
(5) 
Usually, is called the query, the keys, and the values. We do not pose any restriction on the scoring and normalization functions, as some of our results hold in general. We only require the normalization function to satisfy that there is a function from to such that for each it is the case that the th component of is equal to . Thus, in Equation (5) is a convex combination of the vectors in .
When proving possibility results, we will need to pick specific scoring and normalization functions. A usual choice for the scoring function is a feed forward network with input sometimes called additive attention (Bahdanau et al., 2014). Another possibility is to use the dot product called multiplicative attention (Vaswani et al., 2017). We use a combination of both: multiplicative attention plus a non linear function. For the normalization function, is a standard choice. Nevertheless, in our proofs we use the function, which is obtained by setting if is the maximum value, and otherwise. Thus, for a vector in which the maximum value occurs times, we have that if is the maximum value of , and otherwise. We call it hard attention whenever is used as normalization function. As customary, for a function and a sequence , with , we write to denote the sequence .
Transformer Encoder and Decoder
A singlelayer encoder of the Transformer is a parametric function receiving a sequence of vectors in and returning a sequence of the same length of vectors in . In general, we consider the parameters in as functions , and , all of them from to . The singlelayer encoder is then defined as follows
(6)  
(7) 
In practice , , are typically matrix multiplications, and a feedforward network. The and summands are usually called residual connections (He et al., 2016a, b). When the particular functions used as parameters are not important, we simply write .
The Transformer encoder is defined simply as the repeated application of singlelayer encoders (with independent parameters), plus two final transformation functions and applied to every vector in the output sequence of the final layer. Thus the layer Transformer encoder is defined by the following recursion (with and ).
(8) 
We use to denote an layer Transformer encoder over the sequence .
A singlelayer decoder is similar to a singlelayer encoder but with additional attention to an external pair of keyvalue vectors . The input for the singlelayer decoder is a sequence plus the external pair , and the output is a sequence . When defining a decoder layer we denote by the sequence , for . The layer is also parameterized by four functions , , and and is defined as follows.
(9)  
(10)  
(11) 
Notice that the first (self) attention over considers the subsequence of only until index and is used to generate a query to attend the external pair . We denote the singledecoder layer by .
The Transformer decoder is a repeated application of singlelayer decoders, plus a transformation function applied to the final vector of the decoded sequence. Thus, the output of the decoder is a single vector . Formally, the layer Transformer decoder is defined as
(12) 
We use to denote an layer Transformer decoder.
The complete Tansformer
A Transformer network receives an input sequence , a seed vector , and a value . Its output is a sequence defined as
(13) 
We denote the output sequence of the transformer as .
3.1 Invariance under proportions
The Transformer, as defined above, is orderinvariant: two input sequences that are permutations of each other produce exactly the same output. This is a consequence of the following property of the attention function: if , , and is a permutation, then for every query . This weakness has motivated the need for including information about the order of the input sequence by other means; in particular, this is often achieved by using the socalled positional encodings (Vaswani et al., 2017; Shaw et al., 2018), which we study below.
But before going into positional encodings, a natural question is what languages the Transformer can recognize without them. As a standard yardstick we use the wellstudied class of regular languages, i.e., languages recognized by finite automata. Orderinvariance implies that not every regular language can be recognized by a Transformer network. As an example, there is no Transformer network that can recognize the regular language , as the latter is not orderinvariant. A reasonable question then is whether the Transformer can express all regular languages which are orderinvariant. It is possible to show that this is not the case by proving that the Transformer actually satisfies a stronger invariance property, which we call proportion invariance.
For a string and a symbol , we use to denote the ratio between the number of times that appears in and the length of . Consider now the set .
Proposition 3.1.
Let be a Transformer, a seed, , and an embedding function. Then , for each with .
As an immediate corollary we obtain the following.
Corollary 3.2.
Consider the orderinvariant regular language has an even number of symbols. Then cannot be recognized by a Transformer network.
On the other hand, languages recognized by Transformer networks are not necessarily regular.
Proposition 3.3.
There is a Transformer network that recognizes the nonregular language has strictly more symbols than symbols .
That is, the computational power of Transformer networks without positional encoding is both rather weak (they do not even contain orderinvariant regular languages) and not so easy to capture (as they can express counting properties that go beyond regularity). As we show in the next section, the inclusion of positional encodings radically changes the picture.
3.2 Positional Encodings and Completeness of the Transformer
Positional encodings come to remedy the orderinvariance issue by providing information about the absolute positions of the symbols in the input. A positional encoding is just a function . Function combined with an embedding function give rise to a new embedding function such that . Thus, given an input string , the result of the embedding function provides a “new” input
to the Transformer encoder. Similarly, the Transformer decoder instead of receiving the sequence as input, it receives now the sequence
As for the case of the embedding functions, we require the positional encoding to be computable by a Turing machine working in linear time w.r.t. the size (in bits) of .
The main result of this section is the completeness of Transformers with positional encodings.
Theorem 3.4.
The class of Transformer networks with positional encodings is Turing complete.
Proof Sketch.
We show that for every Turing machine there exists a transformer that simulates the complete execution of . We represent a string as a sequence of onehot vectors with their corresponding positional encodings. Denote by the state of at time when processing , and the symbol under ’s head at time . Similarly, is the symbol written by and the head direction. We next describe how to construct a transformer that with input produces a sequence such that contains information about and (encoded as onehot vectors).
The construction and proof goes by induction. Assume the decoder receives such that contains and . To construct , in the first layer we just implement ’s transition function ; note that thus, we use to compute for every and store them in the sequence . This computation can be done with a twolayer feedforward network. For the next layer, lets denote by the index of the cell that is pointing to at time . It can be proved that given one can compute (a representation of) and for every with a selfattention layer, and store them in . In particular, contains which is the index to which is going to be pointing to in the next time step. By using the residual connections we also store and in . The final piece of our construction is to compute the symbol that the tape holds at index , that is, the symbol under ’s head at time . For this we use the following observation: the symbol at index in time coincides with the last symbol written by at index . Thus, we need to find the maximum value such that and then copy which is the symbol that was written by at time step . This last computation can also be done with a selfattention layer. Thus, we attend directly to position (hard attention plus positional encodings) and copy which is exactly . We finally copy and into the output to construct . Figure 1 shows a highlevel diagram of the decoder computation.
There are several other details in the construction, in particular, at the beginning of the computation (first steps), the decoder needs to attend to the encoder and copy the input symbols so they can later be processed as described above. Another detail is when reaches a cell that has not been visited before, then the symbol under the head has to be set as (the blank symbol). We show that all these decisions can be implemented with feedforward networks plus attention. The complete construction uses one encoder layer, three decoder layers and vectors of dimension to store onehot representations of states, symbols and some additional working space. All details can be found in the appendix. ∎
3.3 Differences with Vaswani et al. (2017)’s framework
Although the general architecture that we presented closely follows that of Vaswani et al. (2017), some choices for functions and parameters in our positive results are different to the usual choices in practice. For instance, we use hard attention which allow us to attend directly to specific positions. In contrast, Vaswani et al. (2017) use to attend, plus  functions as positional encodings. The , and are not rational functions, and thus, are forbidden in our formalization. An interesting line for future work is to consider arbitrary functions but with additional restrictions, such as finite precision as done by Weiss et al. (2018). Another difference is that for the function in Equation (11) our proof uses a feedforward network with various layers, while in Vaswani et al. (2017) only two layers are used.
The need of arbitrary precision
Our Turingcomplete proof relies on having arbitrary precision for internal representations, in particular, for storing and manipulating positional encodings. Although having arbitrary precision is a standard assumption when studying the expressive power of neural networks (Cybenko (1989); Siegelmann & Sontag (1995)) practical implementations rely on fixed precision hardware. If fixed precision is used, then positional encodings can be seen as functions of the form where is a finite subset of . Thus, the embedding function can be seen as a regular embedding function where . Thus, whenever fixed precision is used, the net effect of having positional encodings is to just increase the size of the input alphabet. Then from Proposition 3.1 we obtain that the Transformer with positional encodings and fixed precision is not Turing complete. Although no longer Turing complete, one can still study the computational power of fixedprecision Transformers. We left this as future work.
4 Neural GPUs
The Neural GPU (Kaiser & Sutskever, 2016) is an architecture that mixes convolutions and gated recurrences over tridimensional tensors. It is parameterized by three functions (update function), (reset function), and . Given a tensor and a value , it produces a sequence given by the following recursive definition (with ).
where denotes the elementwise product, and is a tensor with only ’s. Neural GPUs force functions and to produce a tensor of the same shape as its input with all values in
. Thus, a Neural GPU resembles a gated recurrent unit
(Cho et al., 2014), with working as the update gate and as the reset gate. Functions , , and are defined as a convolution of its input with a 4dimensional kernel bank with shape plus a bias tensor, followed by a pointwise transformation(14) 
with different kernels and biases for , , and .
To have an intuition on how the convolution works, it is illustrative to think of as an grid of (row) vectors and as a grid of matrices. More specifically, let , and , then is a regular twodimensional convolution in which scalar multiplication has been replaced by vectormatrix multiplication as in the following expression
(15) 
where and . This intuition makes evident the similarity between Neural GPUs and cellular automata: is a grid of cells, and in every iteration each cell is updated considering the values of its neighbors according to a fixed rule given by (Kaiser & Sutskever, 2016). As customary, we assume zeropadding when convolving outside .
4.1 The computational power of Neural GPUs
To study the computational power of Neural GPUs, we cast them as a standard seqtoseq architecture. Given an input sequence, we put every vector in the first column of the tensor . We also need to pick a special cell of as the output cell from which we read the output vector in every iteration. We pick the last cell of the first column of . Formally, given a sequence with , and a fixed value , we construct the tensor by leting and for . The output of the Neural GPU, denoted by , is the sequence of vectors such that . Given this definition, we can naturally view the Neural GPUs as language recognizers (as formalized in Section 2).
Since the bias tensor in Equation (14) is of the same size than , the number of parameters in a Neural GPU grows with the size of the input. Thus, a Neural GPU cannot be considered as a fixed architecture. To tackle this issue we introduce the notion of uniform Neural GPU, as one in which for every bias there exists a matrix such that for each . Thus, uniform Neural GPUs can be finitely specified (as they have a constant number of parameters, not depending on the length of the input). We now establish the Turing completeness of this model.
Theorem 4.1.
The class of uniform Neural GPUs is Turing complete.
Proof sketch.
The proof is based on simulating a seqtoseq RNN; thus, completeness follows from Theorem 2.3. Consider an RNN encoderdecoder language recognizer, such that is of dimension and its encoder and decoder are defined by the equations and , respectively, where and is the length of the input. We use a Neural GPU with input tensor . Let and . The idea is to use for the encoder and for the decoder. We use kernel banks of shape with uniform bias tensors to simulate the following computation. In every step , we first compute the value of and store it in , and then reset to zero. Similarly, in step we update the vector in position storing in it the value (for the value of before the reset). We use the gating mechanism to ensure a sequential update of the cells such that at time we update only positions and for and . Thus the updates on the are always one iteration behind the update of . Since the vectors in are never reset to zero, they keep being updated which allows us to simulate an arbitrary long computation. In particular we prove that at iteration it holds that , and at iteration it holds that . We require components, as we need to implement several gadgets for properly using the update and reset gates. In particular, we need to store the value of before we reset it. The detailed construction and the correctness proof can be found in the appendix. ∎
The proof above makes use of kernels of shape to obtain Turing completeness. This is, in a sense, optimal, as one can easily prove that Neural GPUs with kernels of shape are not Turing complete, regardless of the size of . In fact, for kernels of this shape the value of a cell of at time depends only on the value of the same cell in time .
Zero padding vs circular convolution
The proof of Theorem 4.1 requires the application of zero padding in convolution. This allows us to clearly differentiate internal cells from cells corresponding to the endpoints of the input sequence. Interestingly, Turingcompleteness is lost if we replace zero padding with circular convolution. Formally, given , a circular convolution is obtained by defining for
. One can prove that uniform Neural GPUs with circular convolutions cannot differentiate among periodic sequences of different length; in particular, they cannot check if a periodic input sequence is of even or odd length. This yields the following:
Proposition 4.2.
Uniform Neural GPUs with circular convolutions are not Turing complete.
Related to this last result is the empirical observation by Price et al. (2016) that Neural GPUs that learn to solve hard problems, e.g., binary multiplication, and which generalize to most of the inputs, struggle with highly symmetric (and nearly periodic) inputs. Actually, Price et al. (2016) exhibit examples of the form failing for all inputs with eight or more s. We leave as future work to explore the implications of our theoretical results on this practical observation.
Bidimensional tensors and piecewise linear activations
Freivalds & Liepins (2018) simplified Neural GPUs and proved that, by considering piecewise linear activations and bidimensional input tensors instead of the original smooth activations and tridimensional tensors used by Kaiser & Sutskever (2016), it is possible to achieve substantially better results in terms of training time and generalization. Our Turing completeness proof also relies on a bidimensional tensor and uses piecewise linear activations, thus providing theoretical evidence that these simplifications actually retain the full expressiveness of Neural GPUs while simplifying its practical applicability.
5 Final Remarks and Future Work
We have presented an analysis of the Turing completeness of two popular neural architectures for sequenceprocessing tasks; namely, the Transformer, based on attention, and the Neural GPU, based on recurrent convolutions. We plan to further refine this analysis in the future. For example, our proof of Turing completeness for the Transformer requires the presence of residual connections, i.e., the , , , and summands in Equations (611), while our proof for Neural GPUs heavily relies on the gating mechanism. We will study whether these features are actually essential to obtain completeness.
We presented general abstract versions of both architectures in order to prove our theoretical results. Although we closely follow their original definitions, some choices for functions and parameters in our positive results are different to the usual choices in practice, most notably, the use of hard attention for the case of the Transformer, and the piecewise linear activation functions for both architectures. As we have mentioned, Freivalds & Liepins (2018) showed that for Neural GPUs piecewise linear activations actually help in practice, but for the case of the Transformer architecture more experimentation is needed to have a conclusive response. This is part of our future work.
Although our results are mostly of theoretical interest, they might lead to observations of practical interest. For example, Chen et al. (2018)
have established the undecidability of several practical problems related to probabilistic language modeling with RNNs. This means that such problems can only be approached in practice via heuristics solutions. Many of the results in
Chen et al. (2018) are, in fact, a consequence of the Turing completeness of RNNs as established by Siegelmann & Sontag (1995). We plan to study to what extent our analogous undecidability results for Transformers and Neural GPUs imply undecidability for language modeling problems based on these architectures.Finally, our results rely on being able to compute internal representations of arbitrary precision. It would be interesting to perform a theoretical study of the main properties of both architectures in a setting in which only finite precision is allowed, as have been recently carried out for RNNs (Weiss et al., 2018). We also plan to tackle this problem in our future work.
Acknowledgements
This work was supported by the Millennium Institute for Foundational Research on Data (IMFD).
References
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. URL http://arxiv.org/abs/1409.0473.
 Chen et al. (2018) Yining Chen, Sorcha Gilroy, Andreas Maletti, Jonathan May, and Kevin Knight. Recurrent neural networks as weighted language recognizers. In NAACLHLT 2018, pp. 2261–2271, 2018.

Cho et al. (2014)
Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry
Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio.
Learning phrase representations using RNN encoderdecoder for
statistical machine translation.
In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP
, pp. 1724–1734, 2014. 
Cybenko (1989)
George Cybenko.
Approximation by superpositions of a sigmoidal function.
MCSS, 2(4):303–314, 1989. doi: 10.1007/BF02551274. URL https://doi.org/10.1007/BF02551274.  Dehghani et al. (2018) Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. CoRR, abs/1807.03819, 2018. URL https://arxiv.org/abs/1807.03819.
 Freivalds & Liepins (2018) Karlis Freivalds and Renars Liepins. Improving the neural GPU architecture for algorithm learning. In Neural Abstract Machines & Program Induction (NAMPI), 2018.
 Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing Machines. arXiv preprint arXiv:1410.5401, 2014.
 Grefenstette et al. (2015) Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to transduce with unbounded memory. In Advances in Neural Information Processing Systems, pp. 1828–1836, 2015.

He et al. (2016a)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016
, pp. 770–778, 2016a.  He et al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision  ECCV 2016  14th European Conference, Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part IV, pp. 630–645, 2016b.
 Joulin & Mikolov (2015) Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stackaugmented recurrent nets. In Advances in neural information processing systems, pp. 190–198, 2015.
 Kaiser & Sutskever (2016) Lukasz Kaiser and Ilya Sutskever. Neural GPUs learn algorithms. In ICLR, 2016.
 Kleene (1956) S. C. Kleene. Representation of events in nerve nets and finite automata. In Claude Shannon and John McCarthy (eds.), Automata Studies, pp. 3–41. Princeton University Press, 1956.
 Kurach et al. (2016) Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural randomaccess machines. In ICLR, 2016.
 McCulloch & Pitts (1943) Warren McCulloch and Walter Pitts. A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:127–147, 1943.
 Ollinger (2012) Nicolas Ollinger. Universalities in cellular automata. In Handbook of Natural Computing, pp. 189–229. 2012.
 Price et al. (2016) Eric Price, Wojciech Zaremba, and Ilya Sutskever. Extensions and limitations of the neural GPU. CoRR, abs/1611.00736, 2016. URL http://arxiv.org/abs/1611.00736.
 Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Selfattention with relative position representations. In NAACLHLT, pp. 464–468, 2018.

Siegelmann & Sontag (1992)
Hava T. Siegelmann and Eduardo D. Sontag.
On the computational power of neural nets.
In
Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory, COLT
, pp. 440–449, 1992.  Siegelmann & Sontag (1995) Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. J. Comput. Syst. Sci., 50(1):132–150, 1995.
 Smith III (1971) Alvy Ray Smith III. Simple computationuniversal cellular spaces. Journal of the ACM (JACM), 18(3):339–353, 1971.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pp. 5998–6008, 2017.
 Weiss et al. (2018) Gail Weiss, Yoav Goldberg, and Eran Yahav. On the practical computational power of finite precision RNNs for language recognition. In ACL 2018, pp. 740–745, 2018.
Appendix A Proofs for Section 2
a.1 Proof of Theorem 2.3
We first sketch the main idea of Siegelmann & Sontag’s proof. We refer the reader to the original paper for details. Siegelmann & Sontag show how to simulate a twostack machine (and subsequently, a Turing machine) with a single RNN with as activation. They first construct a network that, with as initial state () and with a binary string as input sequence, produces a representation of as a rational number and stores it as one of its internal values. Their internal representation of strings encodes every as a rational number between and . In particular, they use base such that, for example, a string is encoded as that is, its encoding is
This representation allows one to easily simulate stack operations as affine transformations plus activations. For instance, if is the value representing string seen as a stack, then the operation can be defined as simply , since if and only if , and if and only if . Other stack operations can de similarly simulated. Using this representation, they construct a second network that simulates the twostacks machine by using one neuron value to simulate each stack. The input for the simulated machine is assumed to be at an internal value given to as an initial state . Thus, expects only zeros as input. Actually, to make work for steps, an input of the form should be provided.
Finally, they combine and to construct a network which expects an input of the following form: . The idea is that the first component contains the input string , the second component states when the input is active, and the third component is only when the input is inactive for the first time. Before the input vector the network is working. The input is used to simulate a change from to , and the rest of input vectors are provided to continue with for as many steps as needed. The number neurons that this construction needs to simulate a machine with states, is . ^{1}^{1}1The idea presented above allows one to linearly simulate , that is, each step of is simulated with a constant number of steps of the corresponding RNN. Siegelmann & Sontag show that, with a refinement of the above encoding one can simulate in realtime, that is, a single step of is simulated with a single step of the recurrent network. The is the bound given by a simulation with slowdown of two. See the original paper for details (Siegelmann & Sontag, 1995).
It is clear that Siegelmann & Sontag’s proof resembles a modern encoderdecoder RNN architecture, where is the encoder and is the decoder, thus it is straightforward to use the same construction to provide an RNN encoderdecoder and a language recognizer that uses and simulates the twostacks machine . There are some details that is important to notice. Assume that is given by the formulas in Equations (2) and (3). First, since in the above construction expects no input, we can safely assume that in Equation (3) is the null matrix. Moreover, since defines its own embedding function, we can ensure that every vector that we provide for the encoder part of has a in a fixed component, and thus we do not need the bias in Equation (2) since it can be simulated with one row of matrix . We can do a similar construction for the bias (Equation (3)). Finally, Siegelmann & Sontag show that its construction can be modified such that a particular neuron of , say , is always except for the first time an accepting state of is reached, in which case . Thus, one can consider (Equation (3)) as the identity function and add to the stopping criterion that just checks if is . This completes the proof sketch of Theorem 2.3.
Appendix B Proofs for Section 3
b.1 Proof of Proposition 3.1
We extend the definition of the function to sequences of vectors. Given a sequence we use to denote the set of all vectors occurring in . Similarly as for strings, we use as the number of times that occurs in divided by the length of . Now we are ready to extend with the following definition:
Notice that for every embedding function and string , we have that if then . Thus in order to prove that for every , it is enough to prove that
(16) 
To further simplify the exposition of the proof we introduce another notation. We denote by as the number of times that vector occurs in . Thus we have that if and only if, there exists a value such that for every it holds that .
We now have all the necessary to proceed with the proof of Proposition 3.1. We will prove it by proving the property in (16). Let be an arbitrary sequence of vectors, and let . Moreover, let and . We first prove the following property:
(17) 
Lets be a pair of indices such that . From Equations (67) we have that where . Thus, since , in order to prove it is enough to prove that . By equations (45) and the restriction over the form of normalization functions we have that
where . The above equation can be rewritten as
with . By a similar reasoning we can write