Introduction
Recurrent Neural Networks (RNNs) have achieved great success in analyzing sequential data in various applications, such as computer vision
[Byeon et al.2015, Liang et al.2016, Theis and Bethge2015], natural language processing, etc.. Thanks to the ability in capturing longrange dependencies from input sequences
[Rumelhart, Hinton, and Williams1988, Sutskever, Vinyals, and Le2014]. To address the gradient vanishing issue which often leads to the failure of longterm memory in vanilla RNNs, advanced variants such as Gate Recurrent Unit (GRU) and LongShort Term Memory (LSTM) have been proposed and applied in many learning tasks [Byeon et al.2015, Liang et al.2016, Theis and Bethge2015].Despite the success, LSTMs and GRUs suffer from the huge number of parameters, which makes the training process notoriously difficult and easily overfitting. In particular, in the task of action recognition from videos, a video frame usually forms a highdimensional input, which makes the size of the inputtohidden matrix extremely large. For example, in the UCF11 [Liu, Luo, and Shah2009], a video action recognition dataset, an RGB video clip is a frame with a size of
, and the dimension of the input vector fed to the vanilla RNN can be over 57,000. Assume that the length of hidden layer vector is 256. Then, the inputtohidden layer matrix has a load of parameters up to 14 millions. Although a preprocessing feature extraction step via deep convolutional neural networks can be utilized to obtain static feature maps as inputs to RNNs
[Donahue et al.2015], the overparametric problem is still not fully solved.A promising direction to reduce the parameter size is to explore the lowrank structures in the weight matrices. Inspired from the success of tensor decomposition methods in CNNs [Novikov et al.2015, Li et al.2017], various tensor decomposition methods have been explored in RNNs [Yang, Krompass, and Tresp2017, Ye et al.2018]. In particular, in [Yang, Krompass, and Tresp2017], the tensor train (TT) decomposition has been applied to RNNs in an endtoend way to replace the inputtohidden matrix, and achieved stateoftheart performance in finding the lowrank structure in RNNs. However, the restricted setting on the ranks and the restrained order of core tensors makes TTRNN models sensitive to parameter selection. In detail, the optimal setting of TTranks is that they are small in the border cores and large in middle cores, e.g., like an olive [Zhao et al.2016].
To address this issue, we propose to use the tensor ring decomposition (TRD) [Zhao et al.2016] to extract the lowrank structure of the inputtohidden matrix in RNNs. Specifically, the inputtohidden matrices are reshaped into a highdimensional tensor and then factorized using TRD. Since this corresponding tensor ring layer automatically models the interparameter correlations, the number of parameters can be much smaller than the original size of the linear projection layer in standard RNNs. In this way, we present a new TRRNN model with a similar representation power but with several orders of fewer parameters. In addition, since TRD can alleviate the strict constraints in tensor train decomposition via interconnecting the first and the last core tensors circularly [Zhao et al.2016], we expect TRRNNs to have more expressive power. It is important to note that the tensor ring layer can be optimized in an endtoend training, and can also be utilized as a building block into current LSTM variants. For illustration, we implement an LSTM with the tensor ring layer, named as TRLSTM.^{2}^{2}2Note that the tensor ring layer can also be plugged in the vanilla RNN and GRU.
We have conducted empirical evaluations on two realworld action recognition datasets, i.e., UCF11 and HMDB51 [Kuehne et al.2011]. For a fair comparison with standard LSTM and TTLSTM [Yang, Krompass, and Tresp2017] and BTLSTM [Ye et al.2018], we conduct experiments in an endtoend training. As shown in Figure 7, the proposed TRLSTM has obtained an accuracy value of 0.869, which outperforms the results of standard LSTM (i.e., 0.681), TTLSTM (i.e.,0.803), and BTLSTM (i.e., 0.856). Meanwhile, the compression ratio over the standard LSTM is over 34,000, which is also much higher than the compression ratio given by TTLSTM and BTLSTM. Moreover, with the output by pretrained CNN as the input to LSTMs, TRLSTM has outperformed most previous competitors including LSTM, TTLSTM, BTLSTM, and others recently proposed action recognition methods. Since the TR layer can be used as a building block to other LSTMbased approaches, such as the twostream LSTM [Gammulle et al.2017], we believe the proposed TR decomposition can be a promising approach for action recognition, by considering the tricks used in other stateoftheart methods.
Model
To handle the high dimensional input of RNNs, we introduce the tensor ring decomposition to represent the inputtohidden layer in RNNs in a compact structure. In the following, we first present preliminaries and background of tensor decomposition, including graphical illustrations of the tensor train decomposition and the tensor ring decomposition, followed by our proposed LSTM model, namely, TRLSTM.
Preliminaries and Background
Notation
In this paper, a order tensor, e.g., is denoted by a boldface Euler script letter. With all subscripts fixed, each element of a tensor is expressed as: . Given a subset of subscripts, we can get a subtensor. For example, given a subset , we can obtain a subtensor . Specifically, we denote a vector by a bold lowercase letter, e.g., , and matrices by bold uppercase letters, e.g., . We regard vectors and matrices as 1order tensors and 2order tensors, respectively. Figure 1 draws the tensor diagrams presenting the graphical notations and the essential operations.
Tensor Contraction
Tensor contraction can be performed between two tensors if some of their dimensions are matched. For example, given two 3order tensors and , when , the contraction between these two tensors result in a tensor with the size of , where the matching dimension is reduced, as shown in Equation (1):
(1) 
Tensor Train Decomposition
Through the tensor train decomposition (TTD), a highorder tensor can be decomposed as the product of a sequence of loworder tensors. For example, a order tensor can be decomposed as follows:
(2) 
where each is called a core tensor. The tensor train rank, shorted as TTrank, , for each , , corresponds to the complexity of tensor train decomposition. Generally, in tensor train decomposition, the constraint should be satisfied and other ranks are chosen manually. Figure 2 illustrates the form of tensor train decomposition.
Tensor Ring Decomposition
The main drawback in tensor train decomposition is the limit of the ranks’ setting, which hinders the representation ability and the flexibility of the TTbased models. At the same time, a strict order must be followed when multiplying TT cores, so that the alignment of the tensor dimensions is extremely important in obtaining the optimized TT cores, but it is still a challenging issue in finding the best alignment[Zhao et al.2016].
In the tensor ring decomposition(TRD), an important modification is interconnecting the first and the last core tensors circularly and constructing a ringlike structure to alleviate the aforementioned limitations of the tensor train. Formally, we set and , and conduct the decomposition as:
(3) 
For a order tensor, by fixing the index where , the first order of the beginning core tensor and the last order of the ending core tensor , can be reduced to matrices. Thus, along each of the slices of , we can separate the tensor ring structure as a summation of of tensor trains. For example, by fixing , the product of has the form tensor train decomposition. Therefore, the tensor ring model is essentially the linear combination of different tensor train models. Figure 3 demonstrates the tensor ring structure, and the alternative interpretation as a summation of multiple tensor train structures.
TRRNN model
The core conception of our model is elaborated in this section. By transforming the inputtohidden weight matrices in TR form, and applying them into RNN and its variants, we get our TRRNN models.
Tensorizing , and
Without loss of generality, we tensorize the input vector , output vector , and weight matrix into tensors , , and , shown in Equation (4):
(4) 
where
Decomposing
For an order input and order output, we decompose the weight tensor into the form of TRD with core tensors multiplied one by one, each of which is corresponding to an input dimension or an output dimension, referring to Equation (5). Without loss of generality, the core tensors corresponding to the input dimensions and output dimensions are grouped respectively, as shown in Figure 4.
(5) 
The tensor contraction from input to hidden layer in TR form is shown in Equation (6). We multiply the input tensor with input core tensors and output core tensors sequentially. The complexity analysis of forward and backward process is elaborated in the appendix.
(6) 
Compared with the redundant inputtohidden weight matrix, the compression ratio in TR form is shown in Equation (7).
(7) 
Tensor Ring Layer (TRL)
After reshaping the input vector and the weight matrix into tensor, and decomposing weight tensor into TR representation, we can get the output tensor by manipulating and . The final output vector can be obtained by reshaping the output tensor into vector. Because the weight matrix is factorized with TRD, we denote the whole calculation from the input vector to output vector as tensor ring layer (TRL):
(8) 
which is illustrated in Figure 4.
TrRnn
By replacing the multiplication between weight matrix and input vector with TRL in vanilla RNN. We get our TRRNN model. The hidden state at time can be expressed as:
(9) 
where
denotes the sigmoid function and the hidden state is denoted by
. The inputtohidden layer weight matrix is denoted by , and denotes the hiddentohidden layer matrix.TrLstm
By applying TRL to the standard LSTM, which is the stateoftheart variant of RNN, we can get the TRLSTM model as follows.
(10) 
where , and denote the elementwise product, the sigmoid function and the hyperbolic function, respectively. The weight matrices (where can be or ) denote the mapping from the input to hidden matrix, for the input gate , the forget gate , the output gate , and the cell update vector , respectively. The weight matrice are defined similarly for the hidden state .
Remark. As shown in Equation (6) and demonstrated in Figure 4, the multiplication between the input tensor data and the input core tensors (for ) will produce a hidden matrix in the size of . It is important to note that the size of the hidden matrix is much smaller than the original data size. In some sense, the “compressed” hidden matrix can be regarded as the information bottleneck [Tishby, Pereira, and Bialek2000, ShwartzZiv and Tishby2017], which seeks to achieve the balance between maximally compressing the input information and preserving the prediction information of the output. Thus the proposed TRLSTM has high potentials to reduce the redundant information in the highdimensional input while achieving good performance compared with the standard LSTM.
Experiments
To evaluate the proposed TRLSTM model, we first design a synthetic experiment to validate the advantage of tensor ring decomposition over the tensor train decomposition. Through two realworld action recognition datasets, i.e., UCF11(YouTube action dataset) [Liu, Luo, and Shah2009] and HMDB51 [Kuehne et al.2011], we evaluate our model from two settings: (1) endtoend training, where video frames are directly fed into the TRLSTM; and (2) pretraining to obtain features prior to LSTMs, where a pretrained CNN was used to extract meaningful lowdimensional features and then forwarded these features to the TRLSTM. For a fair comparison, we first compare our proposed method with the standard LSTM and previous lowrank decomposition methods, and then with the stateoftheart action recognition methods.
Synthetic Experiment
To verify the effectiveness of tensor decomposition methods in recover the original weights, we design a synthetic dataset. Given a lowrank weight matrix , which is illustrated in Figure 5(a)
. We first sample 3200 examples, and each dimension follows a normal distribution, i.e.,
whereis the identity matrix. We then calculate
according to for each where is a random Gaussian noise andis the variance. Since the
is generated from , the recovered weight matrix should be similar to the ground truth. We use the root mean square error (RMSE) to measure the performance. Should be noted that since the purpose of this experiment is to provide a qualitative and intuitive comparison, we do not add any regularization to the models.Based on the input data and responses, we estimate the weight matrix
by running the linear regression, tensor train decomposition, and tensor ring decomposition, respectively. For tensor train and tensor ring, we first reshape input data to a tensor of
, and reshape the weight matrix to a tensor of the same size. For illustration, Figure 5 shows one of the recovered (reshaped as a matrix) for the three models when the noise variance is set to 0.05 . Clearly, the proposed tensor ring model performs the best among the three models. As for the tensor train model, it is even worse than the linear regression model. We further illustrate the recovered error of with different levels of noises in Figure 6. It demonstrates that the weight recovered by the tensor ring model has the best tolerance with respect to various injected noises.Experiments on the UCF11 Dataset
The UCF11 dataset contains 1600 video clips of a resolution divided into 11 action categories (e.g., basketball shooting, biking/cycling, diving, etc.). Each category consist of 25 groups of video, within more than 4 clips in one group. It is a challenging dataset due to large variations in camera motion, object appearance and pose, object scale, cluttered background, and so on.
In this part, we conduct two experiments described as “EndtoEnd Training” and “Pretrain with CNN” on this dataset. In the “EndtoEnd Training”, we compare the proposed TRLSTM model with other decomposition models (eg. TTLSTM [Yang, Krompass, and Tresp2017] and BTLSTM [Ye et al.2018]) to show the superior performance. And in another experiment, we apply decomposition on a more general model, achieving a better performance with less parameters.
EndtoEnd Training
Recent years, some tensor decomposition models are proposed to classify videos like TTLSTM
[Yang, Krompass, and Tresp2017], BTLSTM [Ye et al.2018] and others. For the reason that they use the endtoend model for training, we set this experiment to compare with them. In this experiments, we scale down the original resolution to , and sample 6 frames from each video clip randomly as the input data. Since every frame is RGB, the input data vector at each step is , and there are 6 steps in every sample. We set the hidden layer as 256. So there should be a fullyconnected layer of parameters to achieve the mapping for the standard LSTM.Method  #Params  Accuracy 

LSTM  59M  0.697 
TTLSTM  3360  0.796 
BTLSTM  3387  0.853 
TRLSTM  1725  0.869 
We compare our model with BTLSTM and TTLSTM, while using a standard LSTM as a baseline. The hyperparameters in BTLSTM and TTLSTM are set as announced in their papers. Figure 7(c) shows all decomposition methods converging faster than the LSTM. The accuracy of BTLSTM is 0.856 which is much higher than TTLSTM with 0.803 while the LSTM only gain an accuracy of 0.681. In our TRLSTM, the shape of input tensor is , the output tensor’s shape is set as and all the TRranks are set as 5 except . Results are compared in Table 1. With 1725 parameters in our model, about half of TTLSTM and BTLSTM with parameters 3360 and 3387 respectively. We gain the top accuracy 0.869, showing the outstanding performance of our model in this experiment.
Method  Accuracy 

[Hasan and RoyChowdhury2014]  54.5% 
[Liu, Luo, and Shah2009]  71.2% 
[IkizlerCinbis and Sclaroff2010]  75.2% 
[Liu, Shyu, and Zhao2013]  76.1% 
[Sharma, Kiros, and Salakhutdinov2015]  85.0% 
[Wang et al.2011]  84.2% 
[Sharma, Kiros, and Salakhutdinov2015]  84.9% 
[Cho et al.2014]  88.0% 
[Gammulle et al.2017]  94.6% 
CNN + LSTM  92.3% 
CNN + TRLSTM  93.8% 
Pretrain with CNN
Recently, some methods based on RNNs achieved higher accuracy by using the extracted feature as input vectors in computer vision [Donahue et al.2015]. Compared with using frames as input data, extracted features are more compact. But there is still some room for improving the ability of the models. The overparametric problem is just partial solved. To get better performance, we use extracted features via the CNN model InceptionV3 as input data to LSTM.
We set the size of the hidden layer as , which is consistent with the size of the output via InceptionV3. After using the extracted feature as the inputs of LSTM, the accuracy of the vanilla LSTM attains 0.923. At the same time, the accuracy of our TRLSTM model whose ranks are set as achieves 93.8. By replacing the standard LSTM with our model, a compression ratio of 25 can be obtained. We compare some stateoftheart methods in Table 2 on UCF11. The Two Stream LSTM[Gammulle et al.2017] with highest accuracy has more than 141M parameters. The TRLSTM can be used to replace the vanilla LSTMs in the Two Stream LSTM model to reduce the parameters.
Experiments on the HMDB51 Dataset
The HMDB51 dataset is a large collection of realistic videos from various sources, such as movies and web videos. The dataset is composed of 6766 video clips from 51 action categories.
Method  Accuracy 

[Cai et al.2014]  55.9% 
[Wang and Schmid2013]  57.2% 
[Tran et al.2015]  56.8% 
[Feichtenhofer, Pinz, and Zisserman2016]  56.8% 
[Wang, Qiao, and Tang2015]  63.2% 
[Carreira and Zisserman2017]  66.4% 
[Ni et al.2015]  65.5% 
[Jain, Jegou, and Bouthemy2013]  52.1% 
[Zhu et al.2013]  54.0% 
CNN + LSTM  62.9% 
CNN + TRLSTM  63.8% 
In this experiment, we still use extracted features as the input vector via InceptionV3 and reshape it into . We sample 12 frames from each video clip randomly and be processed through the CNN as the input data. The shape of hidden layer tensor is set as . The ranks of our TRLSTM are . Some of the stateoftheart models like I3D [Carreira and Zisserman2017] are presented in Table 3. The I3D model with highest accuracy based on 3D ConvNets, which is not RNNbased method, has 25M parameters, while the TRLSTM model only has 0.7M parameters. The TRLSTM gains a higher accuracy of 63.8% than the standard LSTM with a compressing ratio of 25.
Related Work
In the past decades, a number of variants of recurrent neural networks (RNNs) were proposed to capture sequential information more accurately [Liu et al.2018b]. However, when dealing with the large input data, in the field of computer vision, the inputtohidden weight matrix owns loads of parameters. The limitation of computation resources and the severe overfitting problem are emerging. Some methods used CNNs as feature extractors to preprocessing the input data into a more compact way [Donahue et al.2015]. These methods have improved the classification accuracy, but the overparametric problem is just still partially solved. In this work, we focus on designing lowrank structure to replace the redundant inputtohidden weight matrix in RNNs, while compressing the whole model and maintaining the model performance.
The most straightforward way to apply lowrank constraint is implementing matrix decomposition on weight matrices. Singular Value Decomposition (SVD) has been applied in convolutional neural networks to reduce parameters
[Denton et al.2014] but incurred a loss in model performance. Besides, the compression ratio is limited because of the rank in matrix decomposition still relatively large.Compared with matrix decomposition, tensor decomposition [Xu, Yan, and Qi2015, Zhe et al.2016, Hao et al.2018, He et al.2018, Liu et al.2018a] conducts data decomposition in a higher dimension, capturing higherorder correlations while maintaining several orders of fewer parameters [Kim et al.2015, Novikov et al.2015]. Among these methods, [Lebedev et al.2014] utilized CP decomposition to speed up convolution computation, which has the similar design philosophy with the widelyused depthwise separable convolutions [Howard et al.2017]. However, the instability issue [de Silva and Lim2008] hinders the lowrank CP decomposition from solving many important computer vision tasks. [Kim et al.2015] used Tucker decomposition to decompose both the convolution layer and the fully connected layer, reducing runtime and energy significantly in mobile applications with minor accuracy drop. BlockTerm tensor decomposition combines the CP and Tucker by summing up multiple Tucker models to overcome their drawbacks and has obtained a better performance in RNNs [Ye et al.2018]. However, the computation of the core tensor in the Tucker model is highly inefficient due to the complex tensor flatten and permutation operations. Recent years, the tensor train decomposition also used to substitute the redundant fully connected layer in both CNNs and RNNs [Novikov et al.2015, Yang, Krompass, and Tresp2017], preserving the performance while reducing the number of parameters significantly up to 40 times.
But tensor train decomposition has some limitations: 1) certain constraints for TTrank, i.e., the ranks of the first and last factors are restricted to be 1, limiting its representation power and flexibility. 2) A strict order must be followed when multiplying TT cores, so that the alignment of the tensor dimensions is extremely important in obtaining the optimized TT cores, but it is still a challenging issue in finding the best alignment. In this paper, we use the Tensor Ring(TR) decomposition [Zhao et al.2016] to overcome the drawbacks in TTD, while achieving more computation efficiency than BT decomposition.
Conclusion
In this paper, we applied TRD to plain RNNs to replace the overparametric inputtohidden weight matrix when dealing with highdimensional input data. The lowrank structure of TRD can capture the correlation between feature dimensions with fewer orders of magnitude parameters. Our TRLSTM model achieved best compression ratio with the highest classification accuracy on UCF11 dataset among other endtoend training RNNs based on lowrank methods. At the same time, when processing the extracted feature through InceptionV3 as the input vector, our TRLSTM model can still compress the LSTM while improving the accuracy. We believe that our models provide fundamental modules for RNNs, and can be widely used to handle large input data. In future work, since our models are easy to be extended, we want to apply our models to more advanced RNN structures [Gammulle et al.2017] to get better performance.
Acknowledgments
We thank the anonymous reviewers for valuable comments to improve the quality of our paper. This work was partially supported by National Natural Science Foundation of China (Nos.61572111 and 61876034), and a Fundamental Research Fund for the Central Universities of China (No.ZYGX2016Z003).
References
 [Byeon et al.2015] Byeon, W.; Breuel, T. M.; Raue, F.; and Liwicki, M. 2015. Scene labeling with lstm recurrent neural networks. In CVPR, 3547–3555.
 [Cai et al.2014] Cai, Z.; Wang, L.; Peng, X.; and Qiao, Y. 2014. Multiview super vector for action recognition. In CVPR, 596–603. IEEE Computer Society.
 [Carreira and Zisserman2017] Carreira, J., and Zisserman, A. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, 4724–4733. IEEE Computer Society.
 [Cho et al.2014] Cho, J.; Lee, M.; Chang, H. J.; and Oh, S. 2014. Robust action recognition using local motion and group sparsity. Pattern Recognition 2014 47(5):1813–1825.
 [Cichocki et al.2016] Cichocki, A.; Lee, N.; Oseledets, I. V.; Phan, A. H.; Zhao, Q.; and Mandic, D. P. 2016. Lowrank tensor networks for dimensionality reduction and largescale optimization problems: Perspectives and challenges PART 1. CoRR abs/1609.00893.
 [de Silva and Lim2008] de Silva, V., and Lim, L. 2008. Tensor rank and the illposedness of the best lowrank approximation problem. SIAM J. Matrix Analysis Applications 30(3):1084–1127.
 [Denton et al.2014] Denton, E. L.; Zaremba, W.; Bruna, J.; LeCun, Y.; and Fergus, R. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 1269–1277.
 [Donahue et al.2015] Donahue, J.; Hendricks, L. A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Darrell, T.; and Saenko, K. 2015. Longterm recurrent convolutional networks for visual recognition and description. In CVPR, 2625–2634. IEEE Computer Society.
 [Feichtenhofer, Pinz, and Zisserman2016] Feichtenhofer, C.; Pinz, A.; and Zisserman, A. 2016. Convolutional twostream network fusion for video action recognition. In CVPR, 1933–1941. IEEE Computer Society.
 [Gammulle et al.2017] Gammulle, H.; Denman, S.; Sridharan, S.; and Fookes, C. 2017. Two stream LSTM: A deep fusion framework for human action recognition. In WACV, 177–186. IEEE.

[Hao et al.2018]
Hao, L.; Liang, S.; Ye, J.; and Xu, Z.
2018.
Tensord: A tensor decomposition library in tensorflow.
Neurocomputing 318:196–200.  [Hasan and RoyChowdhury2014] Hasan, M., and RoyChowdhury, A. K. 2014. Incremental activity modeling and recognition in streaming videos. In CVPR, 796–803. IEEE.
 [He et al.2018] He, L.; Liu, B.; Li, G.; Sheng, Y.; Wang, Y.; and Xu, Z. 2018. Knowledge base completion by variational bayesian neural tensor decomposition. Cognitive Computation.
 [Howard et al.2017] Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; and Adam, H. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861.
 [IkizlerCinbis and Sclaroff2010] IkizlerCinbis, N., and Sclaroff, S. 2010. Object, scene and actions: Combining multiple features for human action recognition. In ECCV, 494–507. Springer.
 [Jain, Jegou, and Bouthemy2013] Jain, M.; Jegou, H.; and Bouthemy, P. 2013. Better exploiting motion for better action recognition. In CVPR 2013, 2555–2562. IEEE Computer Society.
 [Kim et al.2015] Kim, Y.; Park, E.; Yoo, S.; Choi, T.; Yang, L.; and Shin, D. 2015. Compression of deep convolutional neural networks for fast and low power mobile applications. CoRR abs/1511.06530.
 [Kuehne et al.2011] Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T. A.; and Serre, T. 2011. HMDB: A large video database for human motion recognition. In ICCV, 2556–2563. IEEE Computer Society.
 [Lebedev et al.2014] Lebedev, V.; Ganin, Y.; Rakhuba, M.; Oseledets, I. V.; and Lempitsky, V. S. 2014. Speedingup convolutional neural networks using finetuned cpdecomposition. CoRR abs/1412.6553.
 [Li et al.2017] Li, G.; Ye, J.; Yang, H.; Chen, D.; Yan, S.; and Xu, Z. 2017. Btnets: Simplifying deep neural networks via block term decomposition. CoRR abs/1712.05689.
 [Liang et al.2016] Liang, X.; Shen, X.; Xiang, D.; Feng, J.; Lin, L.; and Yan, S. 2016. Semantic object parsing with localglobal long shortterm memory. In CVPR, 3185–3193.
 [Liu et al.2018a] Liu, B.; He, L.; Li, Y.; Zhe, S.; and Xu, Z. 2018a. Neuralcp: Bayesian multiway data analysis with neural tensor decomposition. Cognitive Computation.

[Liu et al.2018b]
Liu, H.; He, L.; Bai, H.; Dai, B.; Bai, K.; and Xu, Z.
2018b.
Structured inference for recurrent hidden semimarkov model.
In IJCAI, 2447–2453.  [Liu, Luo, and Shah2009] Liu, J.; Luo, J.; and Shah, M. 2009. Recognizing realistic actions from videos “in the wild”. In CVPR, 1996–2003. IEEE.
 [Liu, Shyu, and Zhao2013] Liu, D.; Shyu, M.L.; and Zhao, G. 2013. Spatialtemporal motion information integration for action detection and recognition in nonstatic background. In IRI, 626–633. IEEE.
 [Ni et al.2015] Ni, B.; Moulin, P.; Yang, X.; and Yan, S. 2015. Motion part regularization: Improving action recognition via trajectory group selection. In CVPR, 3698–3706. IEEE Computer Society.
 [Novikov et al.2015] Novikov, A.; Podoprikhin, D.; Osokin, A.; and Vetrov, D. P. 2015. Tensorizing neural networks. In NIPS, 442–450.
 [Rumelhart, Hinton, and Williams1988] Rumelhart, D. E.; Hinton, G. E.; and Williams, R. J. 1988. Learning representations by backpropagating errors. Cognitive modeling 5(3):1.
 [Sharma, Kiros, and Salakhutdinov2015] Sharma, S.; Kiros, R.; and Salakhutdinov, R. 2015. Action recognition using visual attention. CoRR abs/1511.04119.
 [ShwartzZiv and Tishby2017] ShwartzZiv, R., and Tishby, N. 2017. Opening the black box of deep neural networks via information. CoRR abs/1703.00810.
 [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS, 3104–3112.
 [Theis and Bethge2015] Theis, L., and Bethge, M. 2015. Generative image modeling using spatial lstms. In NIPS, 1927–1935.
 [Tishby, Pereira, and Bialek2000] Tishby, N.; Pereira, F. C. N.; and Bialek, W. 2000. The information bottleneck method. CoRR physics/0004057.
 [Tran et al.2015] Tran, D.; Bourdev, L. D.; Fergus, R.; Torresani, L.; and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 4489–4497. IEEE Computer Society.
 [Wang and Schmid2013] Wang, H., and Schmid, C. 2013. Action recognition with improved trajectories. In ICCV, 3551–3558. IEEE Computer Society.
 [Wang et al.2011] Wang, H.; Kläser, A.; Schmid, C.; and Liu, C. 2011. Action recognition by dense trajectories. In CVPR, 3169–3176. IEEE.
 [Wang, Qiao, and Tang2015] Wang, L.; Qiao, Y.; and Tang, X. 2015. Action recognition with trajectorypooled deepconvolutional descriptors. In CVPR, 4305–4314. IEEE Computer Society.
 [Xu, Yan, and Qi2015] Xu, Z.; Yan, F.; and Qi, Y. A. 2015. Bayesian nonparametric models for multiway data analysis. IEEE Trans. Pattern Anal. Mach. Intell. 37(2):475–487.
 [Yang, Krompass, and Tresp2017] Yang, Y.; Krompass, D.; and Tresp, V. 2017. Tensortrain recurrent neural networks for video classification. In ICML, 3891–3900.
 [Ye et al.2018] Ye, J.; Wang, L.; Li, G.; Chen, D.; Zhe, S.; Chu, X.; and Xu, Z. 2018. Learning compact recurrent neural networks with blockterm tensor decomposition. In CVPR.
 [Zhao et al.2016] Zhao, Q.; Zhou, G.; Xie, S.; Zhang, L.; and Cichocki, A. 2016. Tensor ring decomposition. CoRR abs/1606.05535.
 [Zhe et al.2016] Zhe, S.; Zhang, K.; Wang, P.; Lee, K.; Xu, Z.; Qi, Y.; and Ghahramani, Z. 2016. Distributed flexible nonlinear tensor factorization. In NIPS, 920–928.
 [Zhu et al.2013] Zhu, J.; Wang, B.; Yang, X.; Zhang, W.; and Tu, Z. 2013. Action recognition with actons. In ICCV, 3559–3566. IEEE Computer Society.
Appendix
Complexity Analysis
Complexity in Forward Process
Since the weight tensor has been decomposed into the form of TRD with core tensors, the order of multiplication among input tensor and core tensors in Equation (6) determines the computational cost. In our implementation, we multiply the input tensor with input core tensors and output core tensors sequentially. We can rewrite the Equation (6) as:
(11) 
In Equation (11), the symbols and denote the Tensor Contraction Operation as described in [Cichocki et al.2016], which is a fundamental and important operation in tensor networks. Tensor contraction can be considered as a higherdimensional analogue of matrix multiplication, inner product, and outer product. Note the th component can be formulated as and the th component can be formulated as .
Our model is trained via Back Propagation Through Time. In Equation (11), according to the lefttoright multiplication order, our forward computational complexity reaches while the forward space complexity is , where all the ranks in our model are set to .
Method  Time  Memory 

RNN  
RNN  
TTRNN  
TTRNN  
BTRNN  
BTRNN  
TRRNN  
TRRNN 
Complexity in Backward Process
We implement our back propagation with considering the backward complexity analyze of TT [Novikov et al.2015]. The backward computational complexity of different cores is different due to the multiplication order, so we choose , to represent the backward complexity of each core, which has the highest computational and space complexity in all the cores:
(12) 
In Equation (12), the notations , and are the same as and in Equation (11) and here , , . The backward computational complexity is . Note that the number of cores is in our model, the total backward computational complexity reaches to while the backward space complexity is . The statistics of comparison with some other compressing methods are shown in Table 4.
Comments
There are no comments yet.