Recurrent Neural Networks are firmly established to be one of the best deep learning techniques when the task at hand requires processing sequential data, such as text, audio, or video(Graves et al., 2013; Mikolov et al., 2011; Gers et al., 1999). The ability of these neural networks to efficiently represent a rich class of functions with a relatively small number of parameters is often referred to as depth efficiency, and the theory behind this phenomenon is not yet fully understood. A recent line of work (Cohen & Shashua, 2016; Cohen et al., 2016; Khrulkov et al., 2018; Cohen et al., 2018) focuses on comparing various deep learning architectures in terms of their expressive power.
It was shown in (Cohen et al., 2016) that ConvNets with product pooling are exponentially more expressive than shallow networks, that is there exist functions realized by ConvNets which require an exponentially large number of parameters in order to be realized by shallow nets. A similar result also holds for RNNs with multiplicative recurrent cells (Khrulkov et al., 2018). We aim to extend this analysis to RNNs with rectifier nonlinearities which are often used in practice. The main challenge of such analysis is that the tools used for analyzing multiplicative networks, namely, properties of standard tensor decompositions and ideas from algebraic geometry, can not be applied in this case, and thus some other approach is required. Our objective is to apply the machinery of generalized tensor decompositions, and show universality and existence of depth efficiency in such RNNs.
2 Related work
Tensor methods have a rich history of successful application in machine learning.(Vasilescu & Terzopoulos, 2002)
, in their framework of TensorFaces, proposed to treat facial image data as multidimensional arrays and analyze them with tensor decompositions, which led to significant boost in face recognition accuracy.(Bailey & Aeron, 2017) employed higher-order co-occurence data and tensor factorization techniques to improve on word embeddings models. Tensor methods also allow to produce more accurate and robust recommender systems by taking into account a multifaceted nature of real environments (Frolov & Oseledets, 2017).
In recent years a great deal of work was done in applications of tensor calculus to both theoretical and practical aspects of deep learning algorithms. (Lebedev et al., 2015) represented filters in a convolutional network with CP decomposition (Harshman, 1970; Carroll & Chang, 1970) which allowed for much faster inference at the cost of a negligible drop in performance. (Novikov et al., 2015) proposed to use Tensor Train (TT) decomposition (Oseledets, 2011) to compress fully–connected layers of large neural networks while preserving their expressive power. Later on, TT was exploited to reduce the number of parameters and improve the performance of recurrent networks in long–term forecasting (Yu et al., 2017) and video classification (Yang et al., 2017) problems.
In addition to the practical benefits, tensor decompositions were used to analyze theoretical aspects of deep neural nets. (Cohen et al., 2016) investigated a connection between various network architectures and tensor decompositions, which made possible to compare their expressive power. Specifically, it was shown that CP and Hierarchial Tucker (Grasedyck, 2010) decompositions correspond to shallow networks and convolutional networks respectively. Recently, this analysis was extended by (Khrulkov et al., 2018) who showed that TT decomposition can be represented as a recurrent network with multiplicative connections. This specific form of RNNs was also empirically proved to provide a substantial performance boost over standard RNN models (Wu et al., 2016).
First results on the connection between tensor decompositions and neural networks were obtained for rather simple architectures, however, later on, they were extended in order to analyze more practical deep neural nets. It was shown that theoretical results can be generalized to a large class of CNNs with ReLU nonlinearities (Cohen & Shashua, 2016) and dilated convolutions (Cohen et al., 2018), providing valuable insights on how they can be improved. However, there is a missing piece in the whole picture as theoretical properties of more complex nonlinear RNNs have yet to be analyzed. In this paper, we elaborate on this problem and present new tools for conducting a theoretical analysis of such RNNs, specifically when rectifier nonlinearities are used.
3 Architectures inspired by tensor decompositions
Let us now recall the known results about the connection of tensor decompositions and multiplicative architectures, and then show how they are generalized in order to include networks with ReLU nonlinearities.
3.1 Score functions and feature tensor
Suppose that we are given a dataset of objects with a sequential structure, i.e. every object in the dataset can be written as
We also introduce a parametric feature map which essentially preprocesses the data before it is fed into the network. Assumption 1
holds for many types of data, e.g. in the case of natural images we can cut them into rectangular patches which are then arranged into vectors. A typical choice for the feature map in this particular case is an affine map followed by a nonlinear activation: . To draw the connection between tensor decompositions and feature tensors we consider the following score functions111By logits we mean immediate outputs of the last hidden layer before applying nonlinearity. This term is adopted from classification tasks where neural network usually outputs logits
and following softmax nonlinearity transforms them into valid probabilities.):
where is a trainable –way weight tensor and is a rank 1 feature tensor, defined as
where we have used the operation of outer product , which is important in tensor calculus. For a tensor of order and a tensor of order their outer product is a tensor of order defined as:
3.2 Tensor Decompositions
Working the entire weight tensor in eq. 2 is impractical for large and , since it requires exponential in number of parameters. Thus, we compactly represent it using tensor decompositions, which will further lead to different neural network architectures, referred to as tensor networks (Cichocki et al., 2017).
In the equation above, outer products are taken between scalars and coincide with the ordinary products between two numbers. However, we would like to keep this notation as it will come in handy later, when we generalize tensor decompositions to include various nonlinearities.
Another tensor decomposition is Tensor Train (TT) decomposition (Oseledets, 2011) which is defined as follows
where and by definition. If we gather vectors for all corresponding indices and we will obtain three–dimensional tensors (for and we will get matrices and ). The set of all such tensors is called TT–cores and minimal values of such that decomposition equation 7 exists are called TT–ranks. In the case of TT decomposition, the score function has the following form:
3.3 Connection between TT and RNN
Now we want to show that the score function for Tensor Train decomposition exhibits particular recurrent structure similar to that of RNN. We define the following hidden states:
Such definition of hidden states allows for more compact form of the score function.
Under the notation introduced in eq. 9, the score function can be written as
Note that with a help of TT–cores we can rewrite eq. 9 in a more convenient index form:
where the operation of tensor contraction is used. Combining all weights from and into a single variable and denoting the composition of feature map, outer product, and contraction as we arrive at the following vector form:
This equation can be considered as a generalization of hidden state equation for Recurrent Neural Networks as here all hidden states may in general have different dimensionalities and weight tensors depend on the time step. However, if we set and we will get simplified hidden state equation used in standard recurrent architectures:
Note that this equation is applicable to all hidden states except for the first and for the last
, due to two–dimensional nature of the corresponding TT–cores. However, we can always pad the input sequence with two auxiliary vectorsand to get full compliance with the standard RNN structure. Figure 1 depicts tensor network induced by TT decomposition with cores .
4 Generalized tensor networks
4.1 Generalized outer product
In the previous section we showed that tensor decompositions correspond to neural networks of specific structure, which are simplified versions of those used in practice as they contain multiplicative nonlinearities only. One possible way to introduce more practical nonlinearities is to replace outer product in eq. 6 and eq. 10 with a generalized operator in analogy to kernel methods when scalar product is replaced by nonlinear kernel function. Let be an associative and commutative binary operator ( and ). Note that this operator easily generalizes to the arbitrary number of operands due to associativity. For a tensor of order and a tensor of order we define their generalized outer product as an order tensor with entries given by:
Now we can replace in eqs. 6 and 10 with and get networks with various nonlinearities. For example, if we take we will get an RNN with rectifier nonlinearities; if we take we will get an RNN with softplus nonlinearities; if we take we will get a simple RNN defined in the previous section. Concretely, we will analyze the following networks.
Generalized shallow network with –nonlinearity
Parameters of the network:
Generalized RNN with –nonlinearity
Parameters of the network:
Note that in eq. 16 we have introduced the matrices acting on the input states. The purpose of this modification is to obtain the plausible property of generalized shallow networks being able to be represented as generalized RNNs of width (i.e., with all ) for an arbitrary nonlinearity . In the case of , the matrices were not necessary, since they can be simply absorbed by via tensor contraction (see Appendix A for further clarification on these points).
Initial hidden state
Note that generalized RNNs require some choice of the initial hidden state . We find that it is convenient both for theoretical analysis and in practice to initialize as unit of the operator , i.e. such an element that . Henceforth, we will assume that such an element exists (e.g., for we take , for we take ), and set . For example, in eq. 9 it was implicitly assumed that .
4.2 Grid tensors
Introduction of generalized outer product allows us to investigate RNNs with wide class of nonlinear activation functions, especially ReLU. While this change looks appealing from the practical viewpoint, it complicates following theoretical analysis, as the transition from obtained networks back to tensors is not straightforward.
In the discussion above, every tensor network had corresponding weight tensor and we could compare expressivity of associated score functions by comparing some properties of this tensors, such as ranks (Khrulkov et al., 2018; Cohen et al., 2016). This method enabled comprehensive analysis of score functions, as it allows us to calculate and compare their values for all possible input sequences . Unfortunately, we can not apply it in case of generalized tensor networks, as the replacement of standard outer product with its generalized version leads to the loss of conformity between tensor networks and weight tensors. Specifically, not for every generalized tensor network with corresponding score function now exists a weight tensor such that . Also, such properties as universality no longer hold automatically and we have to prove them separately. Indeed as it was noticed in (Cohen & Shashua, 2016) shallow networks with no longer have the universal approximation property. In order to conduct proper theoretical analysis, we adopt the apparatus of so-called grid tensors, first introduced in (Cohen & Shashua, 2016).
Given a set of fixed vectors referred to as templates, the grid tensor of is defined to be the tensor of order and dimension in each mode, with entries given by:
where each index can take values from , i.e. we evaluate the score function on every possible input assembled from the template vectors . To put it simply, we previously considered the equality of score functions represented by tensor decomposition and tensor network on set of all possible input sequences , and now we restricted this set to exponentially large but finite grid of sequences consisting of template vectors only.
Define the matrix which holds the values taken by the representation function on the selected templates :
Using the matrix we note that the grid tensor of generalized shallow network has the following form (see Appendix A for derivation):
Construction of the grid tensor for generalized RNN is a bit more involved. We find that its grid tensor can be computed recursively, similar to the hidden state in the case of a single input sequence. The exact formulas turned out to be rather cumbersome and we moved them to Appendix A.
5 Main results
With grid tensors at hand we are ready to compare the expressive power of generalized RNNs and generalized shallow networks. In the further analysis, we will assume that , i.e., we analyze RNNs and shallow networks with rectifier nonlinearity. However, we need to make two additional assumptions. First of all, similarly to (Cohen & Shashua, 2016) we fix some templates such that values of the score function outside of the grid generated by are irrelevant for classification and call them covering templates. It was argued that for image data values of of order are sufficient (corresponding covering template vectors may represent Gabor filters). Secondly, we assume that the feature matrix is invertible, which is a reasonable assumption and in the case of for any distinct template vectors the parameters and can be chosen in such a way that the matrix is invertible.
As was discussed in section 4.2 we can no longer use standard algebraic techniques to verify universality of tensor based networks. Thus, our first result states that generalized RNNs with are universal in a sense that any tensor of order and size of each mode being can be realized as a grid tensor of such RNN (and similarly of a generalized shallow network).
Theorem 5.1 (Universality).
Let be an arbitrary tensor of order . Then there exist a generalized shallow network and a generalized RNN with rectifier nonlinearity such that grid tensor of each of the networks coincides with .
Part of Theorem 5.1 which corresponds to generalized shallow networks readily follows from (Cohen & Shashua, 2016, Claim 4). In order to prove the statement for the RNNs the following two lemmas are used.
Given two generalized RNNs with grid tensors , , and arbitrary -nonlinearity, there exists a generalized RNN with grid tensor satisfying
This lemma essentially states that the collection of grid tensors of generalized RNNs with any nonlinearity is closed under taking arbitrary linear combinations. Note that the same result clearly holds for generalized shallow networks because they are linear combinations of rank shallow networks by definition.
Let be an arbitrary one–hot tensor, defined as
Then there exists a generalized RNN with rectifier nonlinearities such that its grid tensor satisfies
This lemma states that in the special case of rectifier nonlinearity any basis tensor can be realized by some generalized RNN.
Proof of Theorem 5.1.
By Lemma 5.2 for each one–hot tensor there exists a generalized RNN with rectifier nonlinearities, such that its grid tensor coincides with this tensor. Thus, by Lemma 5.1 we can construct an RNN with
For generalized shallow networks with rectifier nonlinearities see the proof of (Cohen & Shashua, 2016, Claim 4). ∎
We see that at least with such nonlinearities as and all the networks under consideration are universal and can represent any possible grid tensor. Now let us head to a discussion of expressivity of these networks.
As was discussed in the introduction, expressivity refers to the ability of some class of networks to represent the same functions as some other class much more compactly. In our case the parameters defining size of networks are ranks of the decomposition, i.e. in the case of generalized RNNs ranks determine the size of the hidden state, and in the case of generalized shallow networks rank determines the width of a network. It was proven in (Cohen et al., 2016; Khrulkov et al., 2018) that ConvNets and RNNs with multiplicative nonlinearities are exponentially more expressive than the equivalent shallow networks: shallow networks of exponentially large width are required to realize the same score functions as computed by these deep architectures. Similarly to the case of ConvNets (Cohen & Shashua, 2016), we find that expressivity of generalized RNNs with rectifier nonlinearity holds only partially, as discussed in the following two theorems. For simplicity, we assume that is even.
Theorem 5.2 (Expressivity I).
For every value of there exists a generalized RNN with ranks and rectifier nonlinearity which is exponentially more efficient than shallow networks, i.e., the corresponding grid tensor may be realized only by a shallow network with rectifier nonlinearity of width at least .
This result states that at least for some subset of generalized RNNs expressivity holds: exponentially wide shallow networks are required to realize the same grid tensor. Proof of the theorem is rather straightforward: we explicitly construct an example of such RNN which satisfies the following description. Given an arbitrary input sequence assembled from the templates, these networks (if ) produce if has the property that , and in every other case, i.e. they measure pairwise similarity of the input vectors. A precise proof is given in Appendix A.
In the case of multiplicative RNNs (Khrulkov et al., 2018) almost every network possessed this property. This is not the case, however, for generalized RNNs with rectifier nonlinearities.
Theorem 5.3 (Expressivity II).
For every value of there exists an open set (which thus has positive measure) of generalized RNNs with rectifier nonlinearity , such that for each RNN in this open set the corresponding grid tensor can be realized by a rank shallow network with rectifier nonlinearity.
In other words, for every rank we can find a set of generalized RNNs of positive measure such that the property of expressivity does not hold. In the numerical experiments in Section 6 and Appendix A we validate whether this can be observed in practice, and find that the probability of obtaining CP–ranks of polynomial size becomes negligible with large and . Proof of Theorem 5.3 is provided in Appendix A.
Note that all the RNNs used in practice have shared weights, which allows them to process sequences of arbitrary length. So far in the analysis we have not made such assumptions about RNNs (i.e., ). By imposing this constraint, we lose the property of universality; however, we believe that the statements of Theorems 5.3 and 5.2 still hold (without requiring that shallow networks also have shared weights). Note that the example constructed in the proof of Theorem 5.3 already has this property, and for Theorem 5.2 we provide numerical evidence in Appendix A.
In this section, we study if our theoretical findings are supported by experimental data. In particular, we investigate whether generalized tensor networks can be used in practical settings, especially in problems typically solved by RNNs (such as natural language processing problems). Secondly, according toTheorem 5.3
for some subset of RNNs the equivalent shallow network may have a low rank. To get a grasp of how strong this effect might be in practice we numerically compute an estimate for this rank in various settings.
For the first experiment, we use two computer vision datasets MNIST(LeCun et al., 1990) and CIFAR–10 (Krizhevsky & Hinton, 2009)
, and natural language processing dataset for sentiment analysis IMDB(Maas et al., 2011). For the first two datasets, we cut natural images into rectangular patches which are then arranged into vectors (similar to (Khrulkov et al., 2018)) and for IMDB dataset the input data already has the desired sequential structure.
Figure 2 depicts test accuracy on IMDB dataset for generalized shallow networks and RNNs with rectifier nonlinearity. We see that generalized shallow network of much higher rank is required to get the level of performance close to that achievable by generalized RNN. Due to limited space, we have moved the results of the experiments on the visual datasets to Appendix B.
For the second experiment we generate a number of generalized RNNs with different values of TT-rank and calculate a lower bound on the rank of shallow network necessary to realize the same grid tensor (to estimate the rank we use the same technique as in the proof of Theorem 5.2). Figure 3 shows that for different values of and generalized RNNs of the corresponding rank there exist shallow networks of rank realizing the same grid tensor, which agrees well with Theorem 5.3. This result looks discouraging, however, there is also a positive observation. While increasing rank of generalized RNNs, more and more corresponding shallow networks will necessarily have exponentially higher rank. In practice we usually deal with RNNs of (dimension of hidden states), thus we may expect that effectively any function besides negligible set realized by generalized RNNs can be implemented only by exponentially wider shallow networks. The numerical results for the case of shared cores and other nonlinearities are given in Appendix B.
In this paper, we sought a more complete picture of the connection between Recurrent Neural Networks and Tensor Train decomposition, one that involves various nonlinearities applied to hidden states. We showed how these nonlinearities could be incorporated into network architectures and provided complete theoretical analysis on the particular case of rectifier nonlinearity, elaborating on points of generality and expressive power. We believe our results will be useful to advance theoretical understanding of RNNs. In future work, we would like to extend the theoretical analysis to most competitive in practice architectures for processing sequential data such as LSTMs and attention mechanisms.
We would like to thank Andrzej Cichocki for constructive discussions during the preparation of the manuscript and anonymous reviewers for their valuable feedback. This work was supported by the Ministry of Education and Science of the Russian Federation (grant 14.756.31.0001).
- Bailey & Aeron (2017) Eric Bailey and Shuchin Aeron. Word embeddings via tensor factorization. arXiv preprint arXiv:1704.02686, 2017.
- Carroll & Chang (1970) J Douglas Carroll and Jih-Jie Chang. Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika, 1970.
- Cichocki et al. (2017) Andrzej Cichocki, Anh-Huy Phan, Qibin Zhao, Namgil Lee, Ivan Oseledets, Masashi Sugiyama, Danilo P Mandic, et al. Tensor networks for dimensionality reduction and large-scale optimization: Part 2 applications and future perspectives. Foundations and Trends® in Machine Learning, 9(6):431–673, 2017.
- Cohen & Shashua (2016) Nadav Cohen and Amnon Shashua. Convolutional rectifier networks as generalized tensor decompositions. In International Conference on Machine Learning, pp. 955–963, 2016.
- Cohen et al. (2016) Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pp. 698–728, 2016.
- Cohen et al. (2018) Nadav Cohen, Ronen Tamari, and Amnon Shashua. Boosting dilated convolutional networks with mixed tensor decompositions. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=S1JHhv6TW.
- Frolov & Oseledets (2017) Evgeny Frolov and Ivan Oseledets. Tensor methods and recommender systems. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(3):e1201, 2017.
- Gers et al. (1999) Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with LSTM. 1999.
- Girosi & Poggio (1990) Federico Girosi and Tomaso Poggio. Networks and the best approximation property. Biological cybernetics, 63(3):169–176, 1990.
Hierarchical singular value decomposition of tensors.SIAM Journal on Matrix Analysis and Applications, 31(4):2029–2054, 2010.
- Graves et al. (2013) Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645–6649. IEEE, 2013.
- Harshman (1970) Richard A Harshman. Foundations of the PARAFAC procedure: Models and conditions for an ”explanatory” multimodal factor analysis. 1970.
- Khrulkov et al. (2018) Valentin Khrulkov, Alexander Novikov, and Ivan Oseledets. Expressive power of recurrent neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=S1WRibb0Z.
- Kolda & Bader (2009) Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
- Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
Lebedev et al. (2015)
Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor
Speeding-up convolutional neural networks using fine-tuned cp-decomposition.International Conference on Learning Representations, 2015.
- LeCun et al. (1990) Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pp. 396–404, 1990.
- Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
- Mikolov et al. (2011) Tomáš Mikolov, Stefan Kombrink, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. Extensions of recurrent neural network language model. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pp. 5528–5531. IEEE, 2011.
- Novikov et al. (2015) Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pp. 442–450, 2015.
- Oseledets (2011) Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
- Vasilescu & Terzopoulos (2002) M Alex O Vasilescu and Demetri Terzopoulos. Multilinear analysis of image ensembles: Tensorfaces. In European Conference on Computer Vision, pp. 447–460. Springer, 2002.
- Wu et al. (2016) Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, and Ruslan R Salakhutdinov. On multiplicative integration with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 2856–2864, 2016.
- Yang et al. (2017) Yinchong Yang, Denis Krompass, and Volker Tresp. Tensor-train recurrent neural networks for video classification. arXiv preprint arXiv:1707.01786, 2017.
- Yu et al. (2017) Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long-term forecasting using tensor-train RNNs. arXiv preprint arXiv:1711.00073, 2017.
Appendix A Proofs
If we replace the generalized outer product in eq. 16 with the standard outer product , we can subsume matrices into tensors without loss of generality.
Grid tensor of generalized shallow network has the following form (eq. 20):
Let denote an arbitrary sequence of templates. Corresponding element of the grid tensor defined in eq. 20 has the following form:
Grid tensor of a generalized RNN has the following form:
Let these RNNs be defined by the weight parameters
We claim that the desired grid tensor is given by the RNN with the following weight settings.
It is straightforward to verify that the network defined by these weights possesses the following property:
concluding the proof. We also note that these formulas generalize the well–known formulas for addition of two tensors in the Tensor Train format (Oseledets, 2011). ∎
For any associative and commutative binary operator , an arbitrary generalized rank shallow network with –nonlinearity can be represented in a form of generalized RNN with unit ranks () and –nonlinearity.
Let be the parameters specifying the given generalized shallow network. Then the following weight settings provide the equivalent generalized RNN (with being the unity of the operator ).
Indeed, in the notation defined above, hidden states of generalized RNN have the following form:
The score function of generalized RNN is given by eq. 16:
which coincides with the score function of rank 1 shallow network defined by parameters