Over the past few years, Recurrent Neural Networks (RNNs) have become the prominent machine learning architectures for modeling sequential data, having been successfully employed for language modeling[39, 33, 16]4], online handwritten recognition , speech recognition [18, 2], and more. The success of recurrent networks in learning complex functional dependencies for sequences of varying lengths, readily implies that long-term and elaborate dependencies in the given inputs are somehow supported by these networks. Though connectivity contribution to performance of RNNs has been empirically investigated , formal understanding of the influence of a recurrent network’s structure on its expressiveness, and specifically on its ever-improving ability to integrate data throughout time (e.g. translating long sentences, answering elaborate questions), is lacking.
An ongoing empirical effort to successfully apply recurrent networks to tasks of increasing complexity and temporal extent, includes augmentations of the recurrent unit such as Long Short Term Memory (LSTM) networks and their variants (e.g. [15, 7]). A parallel avenue, which we focus on in this paper, includes the stacking of layers to form deep recurrent networks . Deep recurrent networks, which exhibit empirical superiority over shallow ones (see e.g. ), implement hierarchical processing of information at every time-step that accompanies their inherent time-advancing computation. Evidence for a time-scale related effect arises from experiments  – deep recurrent networks appear to model dependencies which correspond to longer time-scales than shallow ones. These findings, which imply that depth brings forth a considerable advantage both in complexity and in temporal capacity of recurrent networks, have no adequate theoretical explanation.
In this paper, we theoretically address the above presented issues. Based on the relative maturity of depth efficiency results in neural networks, namely results that show that deep networks efficiently express functions that would require shallow ones to have a super-linear size (see e.g. [9, 14, 40]), it is natural to assume that depth has a similar effect on the expressiveness of recurrent networks. Indeed, we show that depth efficiency holds for recurrent networks.
However, the distinguishing attribute of recurrent networks, is their inherent ability to cope with varying input sequence length. Thus, once establishing the above depth efficiency in recurrent networks, a basic question arises, which relates to the apparent depth enhanced long-term memory in recurrent networks: Do the functions which are efficiently expressed by deep recurrent networks correspond to dependencies over longer time-scales? We answer this question affirmatively, by showing that depth provides a super-linear (combinatorial) boost to the ability of recurrent networks to model long-term dependencies in their inputs.
In order to take-on the above question, we introduce in Section 2 a recurrent network referred to as a recurrent arithmetic circuit (RAC) that shares the architectural features of RNNs, and differs from them in the type of non-linearity used in the calculation. This type of connection between state-of-the-art machine learning algorithms and arithmetic circuits (also known as Sum-Product Networks ) has well-established precedence in the context of neural networks.  prove a depth efficiency result on such networks, and  theoretically analyze the class of Convolutional Arithmetic Circuits which differ from common ConvNets in the exact same fashion in which RACs differ from more standard RNNs. Conclusions drawn from such analyses were empirically shown to extend to common ConvNets ([36, 11, 12, 28]). Beyond their connection to theoretical models, RACs are similar to empirically successful recurrent network architectures. The modification which defines RACs resembles that of Multiplicative RNNs used by  and of Multiplicative Integration networks used by , which provide a substantial performance boost over many of the existing RNN models. In order to obtain our results, we make a connection between RACs and the Tensor Train (TT) decomposition , which suggests that Multiplicative RNNs may be related to a generalized TT-decomposition, similar to the way 
connected ReLU ConvNets to generalized tensor decompositions.
We move on to introduce in Section 3 the notion of Start-End separation rank as a measure of the recurrent network’s ability to model elaborate long-term dependencies. In order to analyze the long-term dependencies modeled by a function defined over a sequential input which extends time-steps, we partition the inputs to those which arrive at the first time-steps (“Start”) and the last time-steps (“End”), and ask how far the function realized by the recurrent network is from being separable w.r.t. this partition. Distance from separability is measured through the notion of separation rank , which can be viewed as a surrogate of the distance from the closest separable function. For a given function, high Start-End separation rank implies that the function induces strong dependency between the beginning and end of the input sequence, and vice versa.
In Section 4 we directly address the depth enhanced long-term memory question above, by examining depth RACs and proving that functions realized by these deep networks enjoy Start-End separation ranks that are combinatorially higher than those of shallow networks, implying that indeed these functions can model more elaborate input dependencies over longer periods of time. An additional reinforcing result is that the Start-End separation rank of the deep recurrent network grows combinatorially with the sequence length, while that of the shallow recurrent network is independent of the sequence length. Informally, this implies that vanilla shallow recurrent networks are inadequate in modeling dependencies of long input sequences, since in contrast to the case of deep recurrent networks, the modeled dependencies achievable by shallow ones do not adapt to the actual length of the input. Finally, we present and motivate a quantitative conjecture by which the Start-End separation rank of recurrent networks grows combinatorially with the network depth. A proof of this conjecture, which provides an even deeper insight regarding the advantages of depth in recurrent networks, is left as an open problem.
Finally, in Section 5 we present numerical evaluations which support of the above theoretical findings. Specifically, we perform two experiments that directly test the ability of recurrent networks to model complex long-term temporal dependencies. Our results exhibit a clear boost in memory capacity of deeper recurrent networks relative to shallower networks that are given the same amount of resources, and thus directly demonstrate the theoretical trends established in this paper.
2 Recurrent Arithmetic Circuits
In this section, we introduce a class of recurrent networks referred to as Recurrent Arithmetic Circuits (RACs), which shares the architectural features of standard RNNs. As demonstrated below, the operation of RACs on sequential data is identical to the operation of RNNs, where a hidden state mixes information from previous time-steps with new incoming data (see Figure 1). The two classes differ only in the type of non-linearity used in the calculation, as described by Equations (1)-(3). In the following sections, we utilize the algebraic properties of RACs for proving results regarding their ability to model long-term dependencies of their inputs.
We present below the basic framework of shallow recurrent networks (top of Figure 1), which describes both the common RNNs and the newly introduced RACs. A recurrent network is a network that models a discrete-time dynamical system; we focus on an example of a sequence to sequence classification task into one of the categories . Denoting the temporal dependence by , the sequential input to the network is
, and the output is a sequence of class scores vectors, where is the network depth, denotes the parameters of the recurrent network, and represents the extent of the sequence in time-steps. We assume the input lies in some input space that may be discrete (e.g. text data) or continuous (e.g. audio data), and that some initial mapping is preformed on the input, so that all input types are mapped to vectors . The function may be viewed as an encoding, e.g. words to vectors or images to a final dense layer via some trained ConvNet. The output at time of the shallow (depth ) recurrent network with hidden channels, depicted at the top of Figure 1, is given by:
where is the hidden state of the network at time ( is some initial hidden state), denotes the learned parameters , which are the input, hidden and output weights matrices respectively, and is some non-linear operation. We omit a bias term for simplicity. For common RNNs, the non-linearity is given by:
where is typically some point-wise non-linearity such as sigmoid, tanh etc. For the newly introduced class of RACs, is given by:
where the operation stands for element-wise multiplication between vectors, for which the resultant vector upholds . This form of merging the input and the hidden state by multiplication rather than addition is referred to as Multiplicative Integration .
The extension to deep recurrent networks is natural, and we follow the common approach (see e.g. ) where each layer acts as a recurrent network which receives the hidden state of the previous layer as its input. The output at time of the depth recurrent network with hidden channels in each layer, depicted at the bottom of Figure 1, is constructed by the following:
where is the state of the depth hidden unit at time ( is some initial hidden state per layer), and denotes the learned parameters. Specifically, are the input and hidden weights matrices at depth , respectively. For , the weights matrix which multiplies the inputs vector has the appropriate dimensions: . The output weights matrix is as in the shallow case, representing a final calculation of the scores for all classes through at every time-step. The non-linear operation determines the type of the deep recurrent network, where a common deep RNN is obtained by choosing [Equation (2)], and a deep RAC is obtained for [Equation (3)].
We consider the newly presented class of RACs to be a good surrogate of common RNNs. Firstly, there is an obvious structural resemblance between the two classes, as the recurrent aspect of the calculation has the exact same form in both networks (Figure 1). In fact, recurrent networks that include Multiplicative Integration similarly to RACs (and include additional non-linearities), have been shown to outperform many of the existing RNN models [39, 43]. Secondly, as mentioned above, arithmetic circuits have been successfully used as surrogates of convolutional networks. The fact that  laid the foundation for extending the proof methodologies of convolutional arithmetic circuits to common ConvNets with ReLU activations, suggests that such adaptations may be made in the recurrent network analog, rendering the newly proposed class of recurrent networks all the more interesting. Finally, RACs have recently been shown to operate well in practical settings . In the following sections, we make use of the algebraic properties of RACs in order to obtain clear-cut observations regarding the benefits of depth in recurrent networks.
3 Temporal Dependencies Modeled by Recurrent Networks
In this section, we establish means for quantifying the ability of recurrent networks to model long-term temporal dependencies in the sequential input data. We begin by introducing the Start-End separation-rank of the function realized by a recurrent network as a measure of the amount of information flow across time that can be supported by the network. We then tie the Start-End separation rank to the algebraic concept of grid tensors , which will allow us to employ tools and results from tensorial analysis in order to show that depth provides a powerful boost to the ability of recurrent networks to model elaborate long-term temporal dependencies.
3.1 The Start-End Separation Rank
We define below the concept of the Start-End separation rank for functions realized by recurrent networks after time-steps, i.e. functions that take as input . The separation rank quantifies a function’s distance from separability with respect to two disjoint subsets of its inputs. Specifically, let be a partition of input indices, such that and (we consider even values of throughout the paper for convenience of presentation). This implies that are the first (“Start”) inputs to the network, and are the last (“End”) inputs to the network. For a function , the Start-End separation rank is defined as follows:
In words, it is the minimal number of summands that together give , where each summand is separable w.r.t. , i.e. is equal to a product of two functions – one that intakes only inputs from the first time-steps, and another that intakes only inputs from the last time-steps.
The separation rank w.r.t. a general partition of the inputs was introduced in  for high-dimensional numerical analysis, and was employed for various applications, e.g. chemistry , particle engineering , and machine learning .  connect the separation rank to the distance of the function from the set of separable functions, and use it to measure dependencies modeled by deep convolutional networks.  tie the separation rank to the family of quantum entanglement measures, which quantify dependencies in many-body quantum systems.
In our context, if the Start-End separation rank of a function realized by a recurrent network is equal to , then the function is separable, meaning it cannot model any interaction between the inputs which arrive at the beginning of the sequence and the inputs that follow later, towards the end of the sequence. Specifically, if then there exist and such that , and the function cannot take into account consistency between the values of and those of . In a statistical setting, if
were a probability density function, this would imply thatand are statistically independent. The higher is, the farther is from this situation, i.e. the more it models dependency between the beginning and the end of the inputs sequence. Stated differently, if the recurrent network’s architecture restricts the hypothesis space to functions with low Start-End separation ranks, a more elaborate long-term temporal dependence, which corresponds to a function with a higher Start-End separation rank, cannot be learned.
In Section 4 we show that deep RACs support Start-End separations ranks which are combinatorially larger than those supported by shallow RACs, and are therefore much better fit to model long-term temporal dependencies. To this end, we employ in the following sub-section the algebraic tool of grid tensors that will allow us to evaluate the Start-End separation ranks of deep and shallow RACs.
3.2 Bounding the Start-End Separation Rank via Grid Tensors
We begin by laying out basic concepts in tensor theory required for the upcoming analysis. The core concept of a tensor may be thought of as a multi-dimensional array. The order of a tensor is defined to be the number of indexing entries in the array, referred to as modes. The dimension of a tensor in a particular mode is defined as the number of values taken by the index in that mode. If is a tensor of order and dimension in each mode , its entries are denoted , where the index in each mode takes values . A fundamental operator in tensor analysis is the tensor product, which we denote by . It is an operator that intakes two tensors and (orders and respectively), and returns a tensor (order ) defined by: . An additional concept we will make use of is the matricization of w.r.t. the partition , denoted , which is essentially the arrangement of the tensor elements as a matrix whose rows correspond to and columns to (formally presented in Appendix C).
We consider the function realized by a shallow RAC with hidden channels, which computes the score of class at time . This function, which is given by a recursive definition in Equations (1) and (3), can be alternatively written in the following closed form:
where the order tensor , which lies at the heart of the above expression, is referred to as the shallow RAC weights tensor, since its entries are polynomials in the network weights . Specifically, denoting the rows of the input weights matrix, , by (or element-wise: ), the rows of the hidden weights matrix, , by (or element-wise: ), and the rows of the output weights matrix, , by (or element-wise: ), the shallow RAC weights tensor can be gradually constructed in the following fashion:
having set , where is the pseudoinverse operation. In the above equation, the tensor products, which appear inside the sums, are directly related to the Multiplicative Integration property of RACs [Equation (3)]. The sums originate in the multiplication of the hidden states vector by the hidden weights matrix at every time-step [Equation (1)]. The construction of the shallow RAC weights tensor, presented in Equation (7), is referred to as a Tensor Train (TT) decomposition of TT-rank in the tensor analysis community  and is analogously described by a Matrix Product State (MPS) Tensor Network (see ) in the quantum physics community. See Appendix A for the Tensor Networks construction of deep and shallow RACs, which provides graphical insight regarding the complexity brought forth by depth in recurrent networks.
We now present the concept of grid tensors, which are a form of function discretization. Essentially, the function is evaluated for a set of points on an exponentially large grid in the input space and the outcomes are stored in a tensor. Formally, fixing a set of template vectors , the points on the grid are the set . Given a function , the set of its values on the grid arranged in the form of a tensor are called the grid tensor induced by , denoted . The grid tensors of functions realized by recurrent networks, will allow us to calculate their separations ranks and establish definitive conclusions regarding the benefits of depth these networks. Having presented the tensorial structure of the function realized by a shallow RAC, as given by Equations (6) and (7) above, we are now in a position to tie its Start-End separation rank to its grid tensor, as formulated in the following claim: Let be a function realized by a shallow RAC (top of Figure 1) after time-steps, and let be its shallow RAC weights tensor, constructed according to Equation (7). Assume that the network’s initial mapping functions are linearly independent, and that they, as well as the functions in the definition of Start-End separation rank [Equation (5)], are measurable and square-integrable. Then, there exist template vectors such that the following holds:
where is the grid tensor of with respect to the above template vectors.
We first note that though square-integrability may seem as a limitation at first glance (for example neuronswith sigmoid or ReLU activation , do not meet this condition), in practice our inputs are bounded (e.g. image pixels by holding intensity values, etc). Therefore, we may view these functions as having compact support, which, as long as they are continuous (holds in all cases of interest), ensures square-integrability.
We begin by proving the equality . As shown in , for any function which follows the structure of Equation (6) with a general weights tensor , assuming that are linearly independent, measurable, and square-integrable (as assumed in Claim 3.2), it holds that . Specifically, for and the above equality holds.
It remains to prove that there exists template vectors for which . For any given set of template vectors , we define the matrix such that , for which it holds that:
The right-hand side in the above equation can be regarded as a linear transformation ofspecified by the tensor operator , which is more commonly denoted by . According to lemma 5.6 in , if is non-singular then . To conclude the proof, we simply note that  showed that if are linearly independent then there exists template vectors for which is non-singular. ∎
The above claim establishes an equality between the Start-End separation rank and the rank of the matrix obtained by the corresponding grid tensor matricization, denoted , with respect to a specific set of template vectors. Note that the limitation to specific template vectors does not restrict our results, as grid tensors are merely a tool used to bound the separation rank. The additional equality to the rank of the matrix obtained by matricizing the shallow RAC weights tensor, will be of use to us when proving our main results below (Theorem 4.1).
Due to the inherent use of data duplication in the computation preformed by a deep RAC (see Appendix A.3 for further details), it cannot be written in a closed tensorial form similar to that of Equation (6). This in turn implies that the equality shown in Claim 3.2 does not hold for functions realized by deep RACs. The following claim introduces a fundamental relation between a function’s Start-End separation rank and the rank of the matrix obtained by the corresponding grid tensor matricization. This relation, which holds for all functions, is formulated below for functions realized by deep RACs: Let be a function realized by a depth RAC (bottom of Figure 1) after time-steps. Then, for any set of template vectors it holds that:
where is the grid tensor of with respect to the above template vectors.
If then the inequality is trivially satisfied. Otherwise, assume that , and let be the functions of the respective decomposition to a sum of separable functions, i.e. that the following holds:
Then, by definition of the grid tensor, for any template vectors the following equality holds:
where and are the tensors holding the values of and , respectively, at the points defined by the template vectors. Under the matricization according to the partition, it holds that and are column and row vectors, respectively, which we denote by and . It follows that the matricization of the grid tensor is given by:
which means that . ∎
Claim 3.2 will allow us to provide a lower bound on the Start-End separation rank of functions realized by deep RACs, which we show to be significantly higher than the Start-End separation rank of functions realized by shallow RACs (to be obtained via Claim 3.2). Thus, in the next section, we employ the above presented tools to show that a compelling enhancement of the Start-End separation rank is brought forth by depth in recurrent networks.
4 Depth Enhanced Long-Term Memory in Recurrent Networks
In this section, we present the main theoretical contributions of this paper. In Section 4.1, we formally present a result which clearly separates between the memory capacity of a deep () recurrent network and a shallow () one. Following the formal presentation of results in Theorem 4.1, we discuss some of their implications and then conclude by sketching a proof outline for the theorem (full proof is relegated to Appendix B). In Section 4.2, we present a quantitative conjecture regarding the enhanced memory capacity of deep recurrent networks of general depth , which relies on the inherent combinatorial properties of the recurrent network’s computation. We leave the formal proof of this conjecture for future work.
4.1 Separating Between Shallow and Deep Recurrent Networks
Theorem 4.1 states, that the dependencies modeled between the beginning and end of the input sequence to a recurrent network, as measured by the Start-End separation rank (see Section 3.1), can be considerably more complex for deep networks than for shallow ones:
Let be the function computing the output after time-steps of an RAC with layers, hidden channels per layer, weights denoted by , and initial hidden states (Figure 1 with ). Assume that the network’s initial mapping functions are linearly independent and square integrable. Let be the Start-End separation rank of [Equation (5)]. Then, the following holds almost everywhere, i.e. for all values of the parameters but a set of Lebesgue measure zero:
, for (shallow network).
, for (deep network),
where is the multiset coefficient, given in the binomial form by .
The above theorem readily implies that depth entails an enhanced ability of recurrent networks to model long-term temporal dependencies in the sequential input. Specifically, Theorem 4.1 indicates depth efficiency – it ensures us that upon randomizing the weights of a deep RAC with hidden channels per layer, with probability the function realized by it after time-steps may only be realized by a shallow RAC with a number of hidden channels that is combinatorially large. Stated alternatively, this means that almost all functional dependencies which lie in the hypothesis space of deep RACs with hidden channels per layer, calculated after time-steps, are inaccessible to shallow RACs with less than a super-linear number of hidden channels. Thus, a shallow recurrent network would require an impractical amount of parameters if it is to implement the same function as a deep recurrent network.
The established role of the Start-End separation rank as a dependency measure between the beginning and the end of the sequence (see Section 3.1), implies that these functions, which are realized by almost any deep network and can never be realized by a shallow network of a reasonable size, represent more elaborate dependencies over longer periods of time. The above notion is strengthened by the fact that the Start-End separation rank of deep RACs increases with the sequence length , while the Start-End separation rank of shallow RACs is independent of it. This indicates that shallow recurrent networks are much more restricted in modeling long-term dependencies than the deep ones, which enjoy a combinatorially increasing Start-End separation rank as time progresses. Below, we present an outline of the proof for Theorem 4.1 (see Appendix B for the full proof):
Proof sketch of Theorem 4.1.
For a shallow network, Claim 3.2 establishes that the Start-End separation rank of the function realized by a shallow () RAC is equal to the rank of the matrix obtained by matricizing the corresponding shallow RAC weights tensor [Equation (6)] according to the Start-End partition: . Thus, it suffices to prove that in order to satisfy bullet (1) of the theorem, as the rank is trivially upper-bounded by the dimension of the matrix, . To this end, we call upon the TT-decomposition of , given by Equation (7), which corresponds to the MPS Tensor Network presented in Appendix A. We rely on a recent result by , who state that the rank of the matrix obtained by matricizing any tensor according to a partition , is equal to a minimal cut separating from in the Tensor Network graph representing this tensor. The required equality follows from the fact that the TT-decomposition in Equation (7) is of TT-rank , which in turn implies that the min-cut in the appropriate Tensor Network graph is equal to .
For a deep network, Claim 3.2 assures us that the Start-End separation rank of the function realized by a depth RAC is lower bounded by the rank of the matrix obtained by the corresponding grid tensor matricization: . Thus, proving that for all of the values of parameters but a set of Lebesgue measure zero, would satisfy the theorem. We use a lemma proved in , which states that since the entries of are polynomials in the deep recurrent network’s weights, it suffices to find a single example for which the rank of the matricized grid tensor is greater than the desired lower bound. Finding such an example would indeed imply that for almost all of the values of the network parameters, the desired inequality holds.
We choose a weight assignment such that the resulting matricized grid tensor resembles a matrix obtained by raising a rank- matrix to the Hadamard power of degree . This operation, which raises each element of the original rank- matrix to the power of , was shown to yield a matrix with a rank upper-bounded by the multiset coefficient (see e.g. ). We show that our assignment results in a matricized grid tensor with a rank which is not only upper-bounded by this value, but actually achieves it. Under our assignment, the matricized grid tensor takes the form:
where the set can be viewed as the set of all possible states of a bucket containing balls of colors, where for specifies the number of balls of the ’th color. By definition: and , therefore the matrices and uphold: ; , and for the theorem to follow we must show that they both are of rank (note that ).
We observe the sub-matrix defined by the subset of the rows of such that we select the row only if it upholds that . Note that there are exactly such rows, thus is a square matrix. Similarly we observe a sub-matrix of denoted , for which we select the column only if it upholds that , such that it is also a square matrix. Finally, by employing a variety of technical lemmas, we show that the determinants of these square matrices are non vanishing under the given assignment, thus satisfying the theorem.
4.2 Increase of Memory Capacity with Depth
Theorem 4.1 provides a lower bound of on the Start-End separation rank of depth recurrent networks, combinatorially separating deep recurrent networks from shallow ones. By a trivial assignment of weights in higher layers, the Start-End separation rank of even deeper recurrent networks () is also lower-bounded by this expression, which does not depend on . In the following, we conjecture that a tighter lower bound holds for networks of depth , the form of which implies that the memory capacity of deep recurrent networks grows combinatorially with the network depth:
Under the same conditions as in Theorem 4.1, for all values of but a set of Lebesgue measure zero, it holds for any that:
We motivate Conjecture 4.2 by investigating the combinatorial nature of the computation performed by a deep RAC. By constructing Tensor Networks which correspond to deep RACs, we attain an informative visualization of this combinatorial perspective. In Appendix A, we provide full details of this Tensor Networks construction and present the formal motivation for the conjecture in Appendix A.4. Below, we qualitatively outline our approach.
A Tensor Network is essentially a graphical tool for representing algebraic operations which resemble multiplications of vectors and matrices, between higher order tensors. Figure 2 (top) shows an example of the Tensor Network representing the computation of a depth RAC after time-steps. This well-defined computation graph hosts the values of the weight matrices at its nodes. The inputs are marked by their corresponding time-step , and are integrated in a depth dependent and time-advancing manner (see further discussion regarding this form in Appendix A.3), as portrayed in the example of Figure 2. We highlight in red the basic unit in the Tensor Network which connects “Start” inputs and “End” inputs
. In order to estimate a lower bound on the Start-End separation rank of a depthrecurrent network, we employ a similar strategy to that presented in the proof sketch of the case (see Section 4.1). Specifically, we rely on the fact that it is sufficient to find a specific instance of the network parameters for which achieves a certain rank, in order for this rank to bound the Start-End separation rank of the network from below.
Indeed, we find a specific assignment of the network weights, presented in Appendix A.4, for which the Tensor Network effectively takes the form of the basic unit connecting “Start” and “End”, raised to the power of the number of its repetitions in the graph (bottom of Figure 2). This basic unit corresponds to a simple computation represented by a grid tensor with Start-End matricization of rank . Raising such a matrix to the Hadamard power of any , results in a matrix with a rank upper bounded by , and the challenge of proving the conjecture amounts to proving that the upper bound is tight in this case. In Appendix A.4, we prove that the number of repetitions of the basic unit connecting “Start” and “End” in the deep RAC Tensor Network graph, is exactly equal to for any depth . For example, in the network illustrated in Figure 2, the number of repetitions indeed corresponds to . It is noteworthy that for the bound in Conjecture 4.2 coincides with the bounds that were proved for these depths in Theorem 4.1.
Conjecture 4.2 indicates that beyond the proved combinatorial advantage in memory capacity of deep networks over shallow ones, a further combinatorial separation may be shown between recurrent networks of different depths. We leave the proof of this result, which can reinforce and refine the understanding of advantages brought forth by depth in recurrent networks, as an open problem. In the following, we empirically investigate the theoretical outcomes presented in this section.
In this section, we provide an empirical demonstration supporting the theoretical findings of this paper. The results above are formulated for the class of RACs (presented in Section 2), and the experiments presented hereinafter demonstrate their extension to more commonly used RNN architectures. As noted in Section 1, the advantage of deep recurrent networks over shallow ones is well established empirically, as the best results on various sequential tasks have been achieved by stacking recurrent layers [8, 30, 18]. Below, we focus on two tasks which highlight the ‘long-term memory’ demand of recurrent networks, and show how depth empowers the network’s ability to express the appropriate distant temporal dependencies.
We address two synthetic problems. The first is the Copying Memory Task, to be described in Section 5.1
, which was previously used to test proposed solutions to the gradient issues of backpropagation through time[24, 29, 3, 42, 25]. We employ this task as a test for the recurrent network’s expressive ability to ‘remember’ information seen in the distant past. The second task is referred to as the Start-End Similarity Task, to be described in Section 5.2, which is closely related to the Start-End separation rank measure proposed in Section 3.1. In both experiments we use a successful RNN variant referred to as Efficient Unitary Recurrent Neural Network (EURNN) 
, which was shown to enable efficient optimization without the need to use gating units such as in LSTM networks to overcome the vanishing gradient problem. Moreover, EURNNs are known to perform exceptionally well on the Copying Memory Task. Specifically, we use EURNN in its most basic form, with orthogonal hidden-to-hidden matrices, and with the tunable parameter (see Section 4.2) set to 2. Under the notations we presented in section 2 and portrayed in Figure 1, EURNNs employ , where is the modReLU function ( Section 4.5), and the matrices
are restricted to being orthogonal. Throughout both experiments we use RMSprop as the optimization algorithm, where we took the best of several moving average discount factor values between (in accordance with ) and the default value of , and with a learning rate of . We use a training set of size 100K, a test set of size 10K, and a mini-batch size of .
The methodology we employ in the experiments below is aimed at testing the following practical hypothesis, which is commensurate with the theoretical outcomes in Section 4: Given a certain resource budget for a recurrent network that is intended to solve a ‘long-term memory problem’, adding recurrent layers is significantly preferable to increasing the number of channels in existing layers. Specifically, we train RNNs of depths , , and over increasingly hard variants of each problem (requiring longer-term memory), and report the maximal amount of memory capabilities for each architecture in Figures 3 and 4.
5.1 Copying Memory Task
In the Copying Memory Task, the network is required to memorize a sequence of characters of fixed length , and then to reproduce it after a long lag of time-steps, known as the delay time. The input sequence is composed of characters drawn from a given alphabet , and two special symbols: a blank symbol denoted by ‘_’, and a trigger symbol denoted by ‘:’. The input begins with a string of data characters randomly drawn from the alphabet, and followed by occurrences of blank symbols. On the ’th before last time-step the trigger symbol is entered, signaling that the data needs to be presented. Finally the input ends with an additional blank characters. In total, the sequence length is . The correct sequential output of this task is referred to as the target. The target character in every time-step is always the blank character, except for the last time-steps, in which the target is the original data characters of the input. For example, if and , then a legal input-output pair could be “ABA_____:__” and “________ABA”, respectively.
In essence, the data length and alphabet size control the number of bits to be memorized, and the delay time controls the time these bits need to stay in memory – together these parameters control the hardness of the task. Previous works have used values such as and  or similar, which amount to memorizing bits of information, for which it was demonstrated that even shallow recurrent networks are able to solve this task for delay times as long as or more. To allow us to properly separate between the performance of networks of different depths, we consider a much harder variant with and , which requires memorizing bits of information.
We present the results for this task in Figure 3, where we compare the performance for networks of depths and of size in the range of - , measured in the number of multiply-accumulate operations (MACs). Our measure of performance in the Copy Memory Task is referred to as the data-accuracy, calculated as , where is the sample size, the correct output character at time for example , and the predicted character. The data-accuracy effectively reflects the per-character data reproduction ability, therefore it is defined only over the final time-steps when the memorized data is to be reproduced. In Figure 3, we display for each network the longest delay time for which it is able to solve the task, demonstrating a clear advantage of depth in this task. We measure the size of the network using MACs due to the fact that while orthogonal matrices have an effective smaller number of parameters, EURNN still require the same number of MACs at inference time, hence it is a better representation of the resources they demand. Clearly, given an amount of resources, it is advantageous to allocate them in a stacked layer fashion for this long-term memory based task.
5.2 Start-End Similarity Task
The Start-End Similarity Task directly tests the recurrent network’s expressive ability to integrate between the two halves of the input sequence. In this task, the network needs to determine how similar the two halves are. The input is a sequence of characters from an alphabet , where the first characters are denoted by ‘Start’ and the rest by ‘End’, similarly to previous sections. Considering pairs of characters in the same relative position in ‘Start’ and ‘End’, i.e. the pairs , we divide each input sequence into one of the following classes:
1-similar: ‘Start’ and ‘End’ are exactly the same length string.
0.5-similar: ‘Start’ and ‘End’ have exactly matching pairs of characters (a randomly positioned half of the string is identical, and the other half is not).
0-similar: no pair of characters match.
The task we examine is a classification task of a dataset distributed uniformly over these three classes. Here, the recurrent networks are to produce a meaningful output only in the last time-step, determining in which class the input was, i.e. how similar the beginning of the input sequence is to its end. Figure 4 shows the performance for networks of depths and sizes - , measured in MACs as explained above, on the Start-End Similarity Task. The clear advantage of depth is portrayed in this task as well, empirically demonstrating the enhanced ability of deep recurrent networks to model long-term elaborate dependencies in the input string.
Overall, the empirical results presented in this section reflect well our theoretical findings, presented in Section 4.
The notion of depth efficiency, by which deep networks efficiently express functions that would require shallow networks to have a super-linear size, is well established in the context of convolutional networks. However, recurrent networks differ from convolutional networks, as they are suited by design to tackle inputs of varying lengths. Accordingly, depth efficiency alone does not account for the remarkable performance of deep recurrent networks on long input sequences. In this paper, we identified a fundamental need for a quantifier of ‘time-series expressivity’, quantifying the memory capacity of recurrent networks, which can account for the empirically undisputed advantage of depth in hard sequential tasks. In order to meet this need, we proposed a measure of the ability of recurrent networks to model long-term temporal dependencies, in the form of the Start-End separation rank. The separation rank was used to quantify dependencies in convolutional networks, and has roots in the field of quantum physics. The proposed Start-End separation rank measure adjusts itself to the temporal extent of the input series, and quantifies the ability of the recurrent network to correlate the incoming sequential data as time progresses.
We analyzed the class of Recurrent Arithmetic Circuits, which are closely related to successful RNN architectures, and proved that the Start-End separation rank of deep RACs increases combinatorially with the number of channels and as the input sequence extends, while that of shallow RACs increases linearly with the number of channels and is independent of the input sequence length. These results, which demonstrate that depth brings forth an overwhelming advantage in the ability of recurrent networks to model long-term dependencies, were achieved by combining tools from the fields of measure theory, tensorial analysis, combinatorics, graph theory and quantum physics. The above presented empirical evaluations support our theoretical findings, and provide a demonstration of their relevance for commonly used classes of recurrent networks.
Such analyses may be readily extended to other architectural features employed in modern recurrent networks. Indeed, the same time-series expressivity question may now be applied to the different variants of LSTM networks, and the proposed notion of Start-End separation rank may be employed for quantifying their memory capacity. We have demonstrated that such a treatment can go beyond unveiling the origins of the success of a certain architectural choice, and leads to new insights. The above established observation that dependencies achievable by vanilla shallow recurrent network do not adapt at all to the sequence length, is an exemplar of this potential.
Moreover, practical recipes may emerge by such theoretical analyses. The experiments preformed in , suggest that shallow layers of recurrent networks are related to short time-scales, e.g. in speech: phonemes, syllables, words, while deeper layers appear to support dependencies of longer time-scales, e.g. full sentences, elaborate questions. These findings open the door to further depth related investigations in recurrent networks, and specifically the role of each layer in modeling temporal dependencies may be better understood.  establish theoretical observations which translate into practical conclusions regarding the number of hidden channels to be chosen for each layer in a deep convolutional network. The conjecture presented in this paper, by which the Start-End separation rank of recurrent networks grows combinatorially with depth, can similarly entail practical recipes for enhancing their memory capacity. Such analyses can lead to a profound understanding of the contribution of deep layers to the recurrent network’s memory. Indeed, we view this work as an important step towards novel methods of matching the recurrent network architecture to the temporal dependencies in a given sequential dataset.
We acknowledge useful discussions with Nati Linial and Noam Weis, as well as the contribution of Nadav Cohen who originally suggested the connection between shallow Recurrent Arithmetic Circuits and the Tensor Train decomposition.
This work is supported by the European Research Council (TheoryDL project) and by ISF Center grant 1790/12. Y.L. is supported by the Adams Fellowship Program of the Israel Academy of Sciences and Humanities.
-  Amini, A., Karbasi, A. & Marvasti, F. (2012) Low-Rank Matrix Approximation Using Point-Wise Operators. IEEE Transactions on Information Theory, 58(1), 302–310.
-  Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G. et al. (2016) Deep speech 2: End-to-end speech recognition in english and mandarin. in International Conference on Machine Learning, pp. 173–182.
-  Arjovsky, M., Shah, A. & Bengio, Y. (2016) Unitary evolution recurrent neural networks. in International Conference on Machine Learning, pp. 1120–1128.
-  Bahdanau, D., Cho, K. & Bengio, Y. (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
-  Beylkin, G., Garcke, J. & Mohlenkamp, M. J. (2009) Multivariate regression and machine learning with sums of separable functions. SIAM Journal on Scientific Computing, 31(3), 1840–1857.
-  Beylkin, G. & Mohlenkamp, M. J. (2002) Numerical operator calculus in higher dimensions. Proceedings of the National Academy of Sciences, 99(16), 10246–10251.
-  Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H. & Bengio, Y. (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
-  Cireşan, D. C., Meier, U., Gambardella, L. M. & Schmidhuber, J. (2010) Deep, big, simple neural nets for handwritten digit recognition. Neural computation, 22(12), 3207–3220.
-  Cohen, N., Sharir, O. & Shashua, A. (2016) On the Expressive Power of Deep Learning: A Tensor Analysis. Conference On Learning Theory (COLT).
-  Cohen, N. & Shashua, A. (2016) Convolutional Rectifier Networks as Generalized Tensor Decompositions. International Conference on Machine Learning (ICML).
-  (2017) Inductive bias of deep convolutional networks through pooling geometry. in 5th International Conference on Learning Representations (ICLR).
-  Cohen, N., Tamari, R. & Shashua, A. (2017) Boosting Dilated Convolutional Networks with Mixed Tensor Decompositions. arXiv preprint arXiv:1703.06846.
-  Delalleau, O. & Bengio, Y. (2011) Shallow vs. Deep Sum-Product Networks. in Advances in Neural Information Processing Systems 24, ed. by J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, & K. Q. Weinberger, pp. 666–674. Curran Associates, Inc.
-  Eldan, R. & Shamir, O. (2016) The power of depth for feedforward neural networks. in Conference on Learning Theory, pp. 907–940.
-  Gers, F. A. & Schmidhuber, J. (2000) Recurrent nets that time and count. in Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on, vol. 3, pp. 189–194. IEEE.
-  Graves, A. (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
-  Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H. & Schmidhuber, J. (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern analysis and machine intelligence, 31(5), 855–868.
-  Graves, A., Mohamed, A.-r. & Hinton, G. (2013) Speech recognition with deep recurrent neural networks. in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645–6649. IEEE.
-  Hackbusch, W. (2006) On the efficient evaluation of coalescence integrals in population balance models. Computing, 78(2), 145–159.
-  (2012) Tensor spaces and numerical tensor calculus, vol. 42. Springer Science & Business Media.
-  Hardy, G. H., Littlewood, J. E. & Pólya, G. (1952) Inequalities. Cambridge university press.
-  Harrison, R. J., Fann, G. I., Yanai, T. & Beylkin, G. (2003) Multiresolution quantum chemistry in multiwavelet bases. in Computational Science-ICCS 2003, pp. 103–110. Springer.
-  Hermans, M. & Schrauwen, B. (2013) Training and analysing deep recurrent neural networks. in Advances in Neural Information Processing Systems, pp. 190–198.
-  Hochreiter, S. & Schmidhuber, J. (1997) Long short-term memory. Neural computation, 9(8), 1735–1780.
-  Jing, L., Shen, Y., Dubček, T., Peurifoy, J., Skirlo, S., LeCun, Y., Tegmark, M. & Soljačić, M. (2016) Tunable efficient unitary neural networks (EUNN) and their application to RNNs. arXiv preprint arXiv:1612.05231.
-  Khrulkov, V., Novikov, A. & Oseledets, I. (2018) Expressive power of recurrent neural networks. in 6th International Conference on Learning Representations (ICLR).
LeCun, Y., Cortes, C. & Burges, C. J.
(1998) The MNIST database of handwritten digits. .
-  Levine, Y., Yakira, D., Cohen, N. & Shashua, A. (2018) Deep Learning and Quantum Entanglement: Fundamental Connections with Implications to Network Design. in 6th International Conference on Learning Representations (ICLR).
-  Martens, J. & Sutskever, I. (2011) Learning recurrent neural networks with hessian-free optimization. in Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1033–1040. Citeseer.
Mohamed, A.-r., Dahl, G. E. & Hinton, G.
(2012) Acoustic modeling using deep belief networks.IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 14–22.
-  Orús, R. (2014) A practical introduction to tensor networks: Matrix product states and projected entangled pair states. Annals of Physics, 349, 117–158.
-  Oseledets, I. V. (2011) Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5), 2295–2317.
-  Pascanu, R., Mikolov, T. & Bengio, Y. (2013) On the difficulty of training recurrent neural networks. in International Conference on Machine Learning, pp. 1310–1318.
-  Poon, H. & Domingos, P. (2011) Sum-product networks: A new deep architecture. in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 689–690. IEEE.
-  Schmidhuber, J. H. (1992) Learning complex, extended sequences using the principle of history compression.. Neural Computation.
-  Sharir, O. & Shashua, A. (2018) On the Expressive Power of Overlapping Architectures of Deep Learning. in 6th International Conference on Learning Representations (ICLR).
-  Sharir, O., Tamari, R., Cohen, N. & Shashua, A. (2016) Tractable Generative Convolutional Arithmetic Circuits. .
Stoudenmire, E. & Schwab, D. J.
(2016) Supervised Learning with Tensor Networks. inAdvances in Neural Information Processing Systems 29, ed. by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett, pp. 4799–4807. Curran Associates, Inc.
-  Sutskever, I., Martens, J. & Hinton, G. E. (2011) Generating text with recurrent neural networks. in Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024.
-  Telgarsky, M. (2015) Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101.
-  Tieleman, T. & Hinton, G. (2012) Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 26–31.
-  Wisdom, S., Powers, T., Hershey, J., Le Roux, J. & Atlas, L. (2016) Full-capacity unitary recurrent neural networks. in Advances in Neural Information Processing Systems, pp. 4880–4888.
-  Wu, Y., Zhang, S., Zhang, Y., Bengio, Y. & Salakhutdinov, R. R. (2016) On multiplicative integration with recurrent neural networks. in Advances in Neural Information Processing Systems, pp. 2856–2864.
-  Zhang, S., Wu, Y., Che, T., Lin, Z., Memisevic, R., Salakhutdinov, R. R. & Bengio, Y. (2016) Architectural complexity measures of recurrent neural networks. in Advances in Neural Information Processing Systems, pp. 1822–1830.
Appendix A Tensor Network Representation of Recurrent Arithmetic circuits
In this section, we expand our algebraic view on recurrent networks and make use of a graphical approach to tensor decompositions referred to as Tensor Networks (TNs). The tool of TNs is mainly used in the many-body quantum physics literature for a graphical decomposition of tensors, and has been recently connected to the deep learning field by , who constructed a deep convolutional network in terms of a TN. The use of TNs in machine learning has appeared in an empirical context, where  trained a Matrix Product State (MPS) TN to preform supervised learning tasks on the MNIST dataset . The constructions presented in this section suggest a separation in expressiveness between recurrent networks of different depths, as formulated by Conjecture 4.2.
We begin in Appendix A.1 by providing a brief introduction to TNs. Next, we present in Appendix A.2 the TN which corresponds to the calculation of a shallow RAC, and tie it to a common TN architecture referred to as a Matrix Product State (MPS) (see overview in e.g. ), and equivalently to the tensor train (TT) decomposition . Subsequently, we present in Appendix A.3 a TN construction of a deep RAC, and emphasize the characteristics of this construction that are the origin of the enhanced ability of deep RACs to model elaborate temporal dependencies. Finally, in Appendix A.4, we make use of the above TNs construction in order to formally motivate Conjecture 4.2, according to which the Start-End separation rank of RACs grows combinatorially with depth.
a.1 Introduction to Tensor Networks
A TN is a weighted graph, where each node corresponds to a tensor whose order is equal to the degree of the node in the graph. Accordingly, the edges emanating out of a node, also referred to as its legs, represent the different modes of the corresponding tensor. The weight of each edge in the graph, also referred to as its bond dimension, is equal to the dimension of the appropriate tensor mode. In accordance with the relation between mode, dimension and index of a tensor presented in Section 3.2, each edge in a TN is represented by an index that runs between and its bond dimension. Figure 5a shows three examples: (1) A vector, which is a tensor of order , is represented by a node with one leg. (2) A matrix, which is a tensor of order , is represented by a node with two legs. (3) Accordingly, a tensor of order is represented in the TN as a node with legs.
We move on to present the connectivity properties of a TN. Edges which connect two nodes in the TN represent an operation between the two corresponding tensors. A index which represents such an edge is called a contracted index, and the operation of contracting that index is in fact a summation over all of the values it can take. An index representing an edge with one loose end is called an open index. The tensor represented by the entire TN, whose order is equal to the number of open indices, can be calculated by summing over all of the contracted indices in the network. An example for a contraction of a simple TN is depicted in Figure 5b. There, a TN corresponding to the operation of multiplying a vector by a matrix is performed by summing over the only contracted index, . As there is only one open index, , the result of contracting the network is an order tensor (a vector): which upholds . Though we use below the contraction of indices in more elaborate TNs, this operation can be essentially viewed as a generalization of matrix multiplication.
a.2 Shallow RAC Tensor Network
The computation of the output at time that is preformed by the shallow recurrent network given by Equations (1) and (3), or alternatively by Equations (6) and (7), can be written in terms of a TN. Figure 6a shows this TN, which given some initial hidden state , is essentially a temporal concatenation of a unit cell that preforms a similar computation at every time-step, as depicted in Figure 6b. For any time , this unit cell is composed of the input weights matrix, , contracted with the inputs vector, , and the hidden weights matrix, , contracted with the hidden state vector of the previous time-step, . The final component in each unit cell is the legged triangle representing the order tensor , referred to as the tensor, defined by:
with , i.e. its entries are equal to only on the super-diagonal and are zero otherwise. The use of a triangular node in the TN is intended to remind the reader of the restriction given in Equation (10). The recursive relation that is defined by the unit cell, is given by the TN in Figure 6b:
where . In the first equality, we simply follow the TN prescription and write a summation over all of the contracted indices in the left hand side of Figure 6b, in the second equality we use the definition of matrix multiplication, and in the last equality we use the definition of the tensor. The component-wise equality of Equation (11) readily implies , reproducing the recursive relation in Equations (1) and (3), which defines the operation of the shallow RAC. From the above treatment, it is evident that the restricted tensor is in fact the component in the TN that yields the element-wise multiplication property. After repetitions of the unit cell calculation with the sequential input , a final multiplication of the hidden state vector by the output weights matrix yields the output vector .
The tensor network which represents the order shallow RAC weights tensor , which appears in Equations (6) and (7), is given by the TN in the upper part of Figure 6a. In Figure 6c, we show that by a simple contraction of indices, the TN representing the shallow RAC weights tensor can be drawn in the form of a standard MPS TN. This TN allows the representation of an order tensor with a linear (in ) amount of parameters, rather than the regular exponential amount ( has entries). The decomposition which corresponds to this MPS TN is known as the Tensor Train (TT) decomposition of rank in the tensor analysis community, its explicit form given in Equation (7).
The presentation of the shallow recurrent network in terms of a TN allows the employment of the min-cut analysis, which was introduced by  in the context of convolutional networks, for quantification of the information flow across time modeled by the shallow recurrent network. This was indeed preformed in our proof of the shallow case of Theorem 4.1 (see Appendix B.1 for further details). We now move on to present the computation preformed by a deep recurrent network in the language of TNs.
a.3 Deep RAC Tensor Network
The construction of a TN which matches the calculation of a deep recurrent network is far less trivial than that of the shallow case, due to the seemingly innocent property of reusing information which lies at the heart of the calculation of deep recurrent networks. Specifically, all of the hidden states of the network are reused, since the state of each layer at every time-step is duplicated and sent as an input to the calculation of the same layer in the next time-step, and also as an input to the next layer up in the same time-step (see bottom of Figure 1). The required operation of duplicating a vector and sending it to be part of two different calculations, which is simply achieved in any practical setting, is actually impossible to represent in the framework of TNs. We formulate this notion in the following claim: Let be a vector. is represented by a node with one leg in the TN notation. The operation of duplicating this node, i.e. forming two separate nodes of degree , each equal to , cannot be achieved by any TN.
We assume by contradiction that there exists a Tensor Network which operates on any vector and clones it to two separate nodes of degree , each equal to , to form an overall TN representing . Component wise, this implies that upholds . By our assumption, duplicates the standard basis elements of , denoted , meaning that :