1 Introduction
Tensor has been attracting increasing interests from the machine learning community over past decades. One of the reasons for such appreciation towards tensor is the natural representation of multimodal data using the tensor structure. Such multimodal dataset are often encountered in scientific fields including image analysis
[14], signal processing [3] and spatiotemporal analysis [1, 23]. Tensor methods allow statistical models to efficiently learn multilinear relationship between inputs and outputs by leveraging multilinear algebra and efficient lowrank constraints. The lowrank constraints on higherorder multivariate regression can be interpreted as a regularization technique. As shown in [19], efficient lowrank multilinear regression model with tensor response can improve the performance of regression.Incorporating tensor methods into deep neural network has become a prominent area of studies. In particular, over the past decade, tensor decomposition and approximation algorithms have been introduced to deep neural networks, notably for 1) efficient compression of the model with lowrank constraints [17] and 2) leveraging the multimodal structure of the highdimensional dataset [9]
. For illustration, Kossaifi et al. proposed tensor regression layer (TRL) which replaces the vectorization operation and fullyconnected layers of Convolutional Neural Networks (CNNs) with higherorder multivariate regression
[9]. The advantage of such replacement is the high compression rate of the model while preserving multimodal information of dataset by enforcing efficient lowrank constraints. Given such highdimensional dataset, the vectorization operation will lead to the loss of multimodal information. The higherlevel dependencies among various modes are lost when the data is mapped to a linear space. For instance, applying flattening operation to a colored image (rdorder tensor) will remove the relationship between the redchannel and the bluechannel. Tensor regression layer is able to capture such multimodal information by performing multilinear regression tasks between the output of the last convolutional layer and the softmax.Following [9], we investigate the property and performance of tensor regression layers from the perspectives of regularization and compression. We interpret lowrank constraints as a regularization technique for higherorder multivariate regression and enforce lowrank constraints on the weight tensor between output tensors of CNN and output vectors. Furthermore, we compare tensor regression layer with various tensor decomposition approximations. We aim to provide a comparative insight on different lowrank constraints that can be enforced on higherorder multivariate regression. We compare the performances of TRL using Tucker, CP and Tensor Train decompositions in a small standard CNN on MNIST and FashionMNIST. We also investigate such comparison in Residual Networks (ResNet) [5, 6] on CIFAR10. To investigate the regularization effect, we employed shallow CNNs and trained them with different numbers of training samples and compare the performances.
We show that a compression rate of 54 can be achieved using TT decomposition with a sacrifice of accuracy less than 0.3% with respect to the weight matrix of a 32Layer Residual Network with fullyconnected layer on CIFAR10 dataset. Surprisingly, we also show that an even better compression rate with a smaller loss in accuracy on CIFAR10 can be achieved by simply using global average pooling (GAP) followed by a small fully connected layer. However, using the same trick on the smaller CNN on MNIST led to very poor results.
The remaining of this paper is organized as follows. We start by reviewing background knowledge of multilinear algebra and tensor decomposition formats in Section 2. In Section 3, we present and investigate tensor regression layer with different tensor decomposition formats. We show that global average pooling GAP) layer is a special case of TRL with Tucker decomposition in Section 4. In Section 5 we present a simple analysis of lowrank constraints showing how particular choices of the tensor rank parameters can drastically affect the expressiveness of the network. We demonstrate empirical performance of lowrank TRL in Section 6 followed by discussion and conclusion of our work in Section 7.
2 Background
2.1 Tensor Algebra
We begin with a concise review of notations and basics of tensor algebra. For a more comprehensive review, we refer the reader to [8]. Throughout this paper, a vector is denoted by boldface lowercase letter, e.g. . Matrices and higherorder tensors are denoted by boldface uppercase and calligraphic letters respectively, e.g. and . Given an thorder tensor , its th entry is denoted by or , where . The notation denotes the range of integers from to inclusive. Given a rd order tensor , its slices are the matrices obtained by fixing all but two indices; the horizontal, lateral and frontal slices of are denoted by , and respectively. Similarly, the moden fibers of are the vectors obtained by fixing every index but the nth one. The moden matricization or moden unfolding of a tensor is the matrix having its mode fibers as columns and is denoted by . Given vectors , the outer product of these vectors is denoted by and is defined by for all where . An Nth order tensor is called rankone if it can be written as the outer product of N vectors (i.e. ). The mode product of a tensor with a matrix is denoted by and is defined by
for all , where . Similarly, we denote an mode product of a tensor and a vector by for all and it is defined by .
The Kronecker product of matrices and is the block matrix of size and is denoted by . Given matrices and , both of size , their Hadamard product (or componentwise product) is denoted by and defined by . The KhatriRao product of matrices and is the matrix defined by
(1) 
where (resp. ) denotes the th column of (resp. B).
2.2 Various Tensor Decompositions
In this section we present three of the commonly used tensor decomposition formats: Candecomp/Parafac, Tucker and TensorTrain.
CP decomposition. The CP decomposition [2, 4] approximates a tensor with a summation of rankone tensors [8]. The rank of the decomposition is simply the number of rankone tensors used to approximate the input tensor: given an input tensor , its approximation with a CP decomposition of rank is defined by
(2) 
In Eq. (2), denotes the CP approximation of where each matrix consists of the column vectors for .
We have the following useful expression of Eq. (2) in terms of the matricization of :
(3) 
Tucker decomposition. The Tucker decomposition approximates a tensor by the product of a core tensor and factor matrices for :
(4) 
The matricization of from Eq. (4) can be written as
(5) 
The tuple is the rank of the Tucker decomposition and determines the size of the core tensor . An example of a Tucker approximation of a fourth order tensor is given in Figure 3.
Tensor train decomposition. The tensor train (TT) decomposition [18] provides a spaceefficient representation for higherorder tensors. It approximates a tensor with the product of third order tensors called core tensors or simply cores. The rank of the TT decomposition is the tuple where .
Given a tensor , the approximation by TT decomposition is defined as
(6) 
where denotes the matrix product.
In order to express Eq. (6) in terms of matricizations of , we first define the following contraction operation on core tensors.
3 Tensor Regression Layer
In this section, we introduce tensor regression layer via various lowrank tensor approximations. As stated in Section 1, the last fullyconnected layer of traditional CNN represents a large proportion of the model parameters. In addition to such large consumption of computational resources, the flattening operation leads to the loss of rich multimodal information in the last convolutional layer. Tensor regression layer [9] replaces such last flattening and fully connected layers of CNN by a multilinear map with low Tucker rank. In this work, we explore imposing other lowrank constraints on the weight tensor and we compare the compression and regularization effects of using either CP, Tucker or TT decompositions.
Given an input tensor and a weight tensor , we investigate the function where is the number of classes. Given such two tensors, the function is defined as
(9) 
where
is a bias vector added to the product of
and . The tensor network representation of an example of Eq. (9) is given in Figure 5. The main idea behind tensor tensor regression layers is to enforce a low tensor rank structure on in order to both reduce memory usage and to leverage the multilinear structure of the input .Throughout the paper, we denote a TRL with TT decomposition by TTTRL. Similarly we use CPTRL and TuckerTRL for a TRL with CP or Tucker decomposition.
CP decomposition. First we investigate applying CP decomposition to approximate the weight tensor . Using Eq. (2) and Eq.(3), Eq. (9) can be rewritten as
(10) 
We can use this formulation to obtain the partial derivatives needed to implement gradient based optimization methods (e.g. backpropagation), indeed
(11) 
for all of the matrices for . Furthermore, for a given mode , we can naturally arrange these partial derivatives into a third order tensor and obtain their expression using unfolding:
for , and
Tucker decomposition. As described in Section 2, the Tucker decomposition approximates an input tensor by a core tensor and a set of factor matrices. We can rewrite Eq. (9) using approximation of the tensor by Tucker decomposition as
(12) 
where the tensor is approximated with
(13) 
The tensor network representations of Eq. (12) is shown in Figure 8. Given a tensor of size , the function maps such tensor to the space with lowrank constraints.
We can again obtain concise expressions for the partial derivatives using unfoldings, for example:
(14) 
(15) 
and
(16) 
Tensor Train decomposition. The tensor network visualization is given in Figure 8, where the weight tensor is replaced with its TT representation. Using Eq. (6) and (8), in the case of TT decomposition Eq. (9) can be rewritten as
(17) 
where the second equality follows from the fact that . Similarly to the case of CP and Tucker decomposition, the partial derivatives can be summarized with
(18) 
for all and , and
(19) 
4 Tensor perspective on Global Average Pooling layer
In this section, we provide an insight on Global Average Pooling layer from the perspective of tensor algebra. In particular, we show that GAP layer is a special case of TuckerTRL.
It is a traditional practice to apply flattening operation to the output tensor (i.e. the last convolutional layer) before extracting its features. The problem of such approach lies in the generalization ability to the test dataset. Some work on deep neural networks show that fullyconnected layers are prone to overfitting, thus leading to poor performance on test dataset [7, 13, 11].
In order to tackle such generalizability problem and to provide regularization, Global Average Pooling (GAP) layer was presented by Lin et al. [13]. It replaces the combination of vectorization operation and fullyconnected layer with averaging operation over all slices along the output channel axis. The output of a GAP layer is thus a single vector of size the number of output channels. GAP layer was empirically shown to significantly reduce the number of model parameters in CNNs [13].
The authors of [13] claims not only that GAP layer reduces the trainable model parameters but also that it can prevent the model from overfitting during the training stage. Over the last decade, GAP layer has been adopted to some of the most successful image classification models such as Residual Networks and VGG16 [5, 20].
More general interpretation of the convolutional output is that it is a highorder tensor in a space . Given such tensor, the GAP layer will output a vector defined by
(20) 
We here assume that the axis for the output channel corresponds to the last mode of the tensor . We now show that a GAP layer mapping to is equivalent to a specific TuckerTRL with rank . Indeed, let
be the regression tensor of a TuckerTRL, with where for each , and . We have
(21) 
Observe that the composition of a GAP layer with a fully connected layer mapping to can also be achieved using a unique TuckerTRL by setting to be the weight matrix of the fully connected layer instead of the identity. A graphical representation of this equivalence is shown in Figure 12.
5 Observations on Rank Constraints
In this section, we provide a simple guideline for choosing one of the components of lowrank constraints enforced to TRL. In particular, we observe that the CP rank parameter and and the last Tucker/TT rank parameter affects the dimension of the image of the function computed by the TRL. For example, as a consequence of this observation, if a TRL is used as the last layer of a network before a softmax activation function in a classification tasks with
classes, setting the rank parameter to values greater thanleads to unnecessary redundancy, while setting it to smaller values can detrimentally limit the expressiveness of the resulting classifier.
First, we start with a simple lemma necessary to provide the upperbound on the dimension of the image of the regression function. We show that if an input matrix admits a factorization, then a function which maps such matrix to a linear space has an upperbound on the dimension of the image.
Lemma 1.
If with and , then where .
Proof.
Given such function f, the dimension of the image of the function f is , which is the dimension of the space that is spanned by column vectors of = . That is, . It is clear that each column vector of the matrix is linear combination of column vectors of from the equation where denotes th column vector of . Since matrix is in the space , the dimension of the span of the column vectors of is upperbounded by , namely . ∎
Using Lemma 1, we can provide upperbounds on the dimension of the image spanned by the regression function of a TRL for different tensor rank constraints.
Proposition 2.
Let where . The following hold:

if admits a TT decomposition of rank , then ,

if admits a Tucker decomposition of rank , then ,

if admits a CP decomposition of rank , then .
Proof.
We have shown that the dimension of the image mapped by the function is upperbounded by one of the tensor rank parameters. We refer to this specific component of the rank tuple as the bottleneck rank.
Definition 2.
Given a regression tensor , if admits a Tucker Decomposition with rank , we define the rank as the bottleneck rank. Similarly, if admits a TT decomposition with TTrank , we define as the bottleneck rank.
This observation on the rank constraints used in a tensor regression layer can provide a simple guideline for choosing the bottleneck rank. For instance, when a TRL is used as the last layer of an architecture for a classification task, setting the bottleneck rank to a value smaller than the number of classes could limit the expressiveness of TRL (which we will empirically demonstrate in Section 6.1), while setting it to a value higher than could lead to redundancy in the model parameters.
6 Experiments
In this section we provide experimental evidence which 1) supports our analysis on TRLs in Section 5 and 2) investigate the compressive and regularization power of the different lowrank constraints. We present experiments with tensor regression layer using CP, Tucker and TT decomposition on the benchmark datasets MNIST [12], FashionMNIST [22], CIFAR10 and CIFAR100 [10].
6.1 MNIST and FashionMNIST dataset
MNIST dataset [12] consists of 1channel images of handwritten digits from to . The dataset contains k training and a test set of k examples. The purpose of the experiment is to provide insights on regularization power of different lowrank constraints. We set our baseline classifier to be CNN with convolutional layers followed by
fullyconnected layer. Rectified linear units (ReLU)
[15] were introduced between each layer as nonlinear activations. We tested the model with three tensor approximations; CP, Tucker and TT. By applying various lowrank constraints, we aim to show that as such constraints become larger, the smaller the approximation error becomes, therefore the accuracy of the lowrank model approaches to that of the model without regularizations (i.e. lowrank constraints).We concisely review the choice of lowrank constraints for Tucker, TT and CP models. Detailed experimental configuration is available online^{1}^{1}1https://github.com/xwcao/LowRankTRN. Given an output tensor from final convolutional layer where denotes the number of samples in one batch, we constrain the weight tensor with the rank of Tucker decomposition . Following Proposition 2, the bottleneck rank is set to for TRL with Tucker and TT constraints.
Following [11]
, we initialized the weights in each layer from zero mean normal distribution with standard deviation
. The bias term of each layer is initialized with constant . For TuckerTRL, we conducted a total of experiments. This is per each lowrank TuckerTRL where were set with constraints , and respectively. A set of experiments were conducted for TTTRL as well. We set TTrank to be and . For CPTRL, we simply evaluated the performance with rank from a set .We evaluate empirical performance of TRL with another MNISTlike dataset: FashionMNIST. The dataset consists of k training and k testing images where each sample belongs to one of ten classes of fashion items such as shoes, clothes and hats. We used the same CNN architecture and hyperparamters as for the MNIST dataset.
Experimental outcomes for both datasets are provided in Figure 13 where we can see that all lowrank approximation models exhibit similar performance in both MNIST and FashionMNIST dataset. As for the regularization effect, however, it is observable that as we relax the lowrank constraints, the accuracies of each model gradually converge to that of baseline model. This result illustrates the effect of regularization power that lowrank constraints provide. We also conducted experiments where we used GAP layer instead of fullyconnected layer on both MNIST and FashionMNIST dataset. In both cases, the model performed very poorly compared to that of fully connected layer; with MNIST dataset and with FashionMNIST.
We conducted similar experiment to provide a empirical support to Proposition 2. In Section 5 we showed that the dimension of the image of TRL is upperbounded by the bottleneck rank. We conducted experiments where we fix the bottleneck rank to be one of . The experimental result presented in Figure 14 shows the clear distinctions among models with different bottleneck ranks. It is observable that bottleneck rank affects the test accuracy by providing upperbound to the dimension of the image of TRL.
6.2 CIFAR10 and CIFAR100 dataset with Residual Networks
We evaluate the performance of tensor regression layer with another benchmark dataset; CIFAR10 and CIFAR100 with deep CNNs. CIFAR10 dataset [10] consists of k training and k test images from 10 classes; airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. Similarly, CIFAR100 consists of colored images of 100 classes [10]. We employ Residual structure network [6] and replaced the GAP layer with CP, Tucker and TTTRL. Following [6], we trained the model with initial learning rate of with momentum of . The learning rate is multiplied by at k and k iteration steps and the training process is terminated at k steps. The size of each batch was set to . We set the weight decay to
. The image is preprocessed with whitening normalization followed by the random horizontal flip and cropping with padding size of 2 pixels on each side.
The experimental result are reported in Table 1 for a layer Residual network [6] on CIFAR10 and a layer ResNet on CIFAR100. In order to compare the compression rate, we set the baseline model to be the Residual network with fullyconnected layer instead of GAP layer. The errors in Table 1 are obtained by choosing the with the best validation score. The experiment shows that CPTRL achieves comparable test accuracy to ResNet with GAP layer, however, GAP layer performed the best in terms of both compression and accuracy.
Layer Type  Rank  CIFAR10  CIFAR100  

Vali  Test  CR  Vali  Test  CR  
FC    8.36  8.28  1.0  36.68  36.36  1.0 
GAP    7.62  8.18  64.0  29.68  29.42  64.0 
CPTRL  5  8.32  8.43  91.0  34.64  36.01  455.1 
50  8.18  8.11  9.1  30.92  30.73  45.5  
100  8.42  8.05  4.6  31.28  31.72  22.7  
TuckerTRL  8.30  8.39  41.0  33.34  32.26  24.5  
7.78  8.39  7.0  30.86  31.53  6.6  
7.92  8.58  0.9  TF  TF  1.0  
TTTRL  8.18  8.47  54.2  31.12  30.95  25.0  
7.86  9.13  7.1  30.28  31.08  6.6  
8.36  8.56  0.9  31.64  32.64  1.0 



6.3 On the regularization effect of TRL
In this section, we investigate the performance of TRL focusing on its function as a regularization to convolutional neural networks. We used shallow CNNs with different train/validation split where the number of the training samples were kept to be small. We compare the performance of TRL with fullyconnected layer and GAP layer. To improve the regularization performance, Dropout [21] and weight decay were included in the comparison. The training datasets are obtained by randomly selecting samples from the initial training dataset, and keeping k samples for validation for each train/validation split.
We evaluate the performance of each model on three datasets; MNIST, Street View House Numbers (SVHN) [16] and CIFAR10. SVHN dataset consists of colored images of house numbers where it contains k and k samples for training and testing respectively. We employed a CNN with two (resp. three) convolutional layers for MNIST dataset (resp. CIFAR10 and SVHN dataset). The dropout is inserted after the final convolutional layer.
The rank of each TRL is selected based on the dimensions of the output tensor as in Section 6.1. We run experiments with early stopping for all experiments where the maximum steps is set to for MNIST and to for SVHN and CIFAR10. The best rate for dropout is selected based on the validation accuracy where the hyperparameter is samples from . The decay factor for L2regularization is similarly chosen from the set .
7 Conclusion
Tensor regression layer replaces the last flattening operation and fully connected layers with tensor regression with low tensor rank structure. We investigate tensor regression layer with various tensor decompositions. TRL with CP, Tucker and TT decompositions were presented and investigated in this work. We show that the learning procedure for each type of tensor regression layer can be derived using tensor algebra. An analysis on the upper bound of the dimension of the image of the regression function is presented, where we show that the rank of Tucker decomposition and TT ranks affect such dimension.
We evaluated proposed models using benchmark dataset (i.e. handwritten digits and natural images). We did not observe significant differences in accuracy among TRLs with various decompositions for MNIST and CIFAR10 dataset. The result using the stateoftheart deep convolutional model shows that when compared to a baseline model with fullyconnected layer, TRL with CP decomposition achieved the rate of compression with the sacrifice of accuracy . When compared to the Residual network with GAP layer, our model empirically exhibits comparable performance in both accuracy and compression rate.
References
 [1] Mohammad Taha Bahadori, Qi Rose Yu, and Yan Liu. Fast multivariate spatiotemporal analysis via low rank tensor learning. In Advances in neural information processing systems, pages 3491–3499, 2014.
 [2] J Douglas Carroll and JihJie Chang. Analysis of individual differences in multidimensional scaling via an nway generalization of "eckartyoung" decomposition. Psychometrika, 35(3):283–319, 1970.
 [3] Andrzej Cichocki, Danilo Mandic, Lieven De Lathauwer, Guoxu Zhou, Qibin Zhao, Cesar Caiafa, and Huy Anh Phan. Tensor decompositions for signal processing applications: From twoway to multiway component analysis. IEEE Signal Processing Magazine, 32(2):145–163, 2015.
 [4] Richard A Harshman. Foundations of the parafac procedure: models and conditions for an" explanatory" multimodal factor analysis. 1970.

[5]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
 [7] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
 [8] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
 [9] Jean Kossaifi, Zachary C Lipton, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar. Tensor regression networks. arXiv preprint arXiv:1707.08308, 2017.
 [10] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 [11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[12]
Yann LeCun.
The mnist database of handwritten digits.
http://yann.lecun.com/exdb/mnist/, 1998.  [13] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.

[14]
Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye.
Tensor completion for estimating missing values in visual data.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208–220, 2013. 
[15]
Vinod Nair and Geoffrey E Hinton.
Rectified linear units improve restricted boltzmann machines.
In Proceedings of the 27th international conference on machine learning (ICML10), pages 807–814, 2010. 
[16]
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y
Ng.
Reading digits in natural images with unsupervised feature learning.
In
NIPS workshop on deep learning and unsupervised feature learning
, volume 2011, page 5, 2011.  [17] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pages 442–450, 2015.
 [18] Ivan V Oseledets. Tensortrain decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
 [19] Guillaume Rabusseau and Hachem Kadri. Lowrank regression with tensor responses. In Advances in Neural Information Processing Systems, pages 1867–1875, 2016.
 [20] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [21] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [22] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 [23] Rose Yu and Yan Liu. Learning from multiway data: Simple and efficient tensor regression. In International Conference on Machine Learning, pages 373–381, 2016.