Tensor has been attracting increasing interests from the machine learning community over past decades. One of the reasons for such appreciation towards tensor is the natural representation of multi-modal data using the tensor structure. Such multi-modal dataset are often encountered in scientific fields including image analysis, signal processing  and spatio-temporal analysis [1, 23]. Tensor methods allow statistical models to efficiently learn multilinear relationship between inputs and outputs by leveraging multilinear algebra and efficient low-rank constraints. The low-rank constraints on higher-order multivariate regression can be interpreted as a regularization technique. As shown in , efficient low-rank multilinear regression model with tensor response can improve the performance of regression.
Incorporating tensor methods into deep neural network has become a prominent area of studies. In particular, over the past decade, tensor decomposition and approximation algorithms have been introduced to deep neural networks, notably for 1) efficient compression of the model with low-rank constraints  and 2) leveraging the multi-modal structure of the high-dimensional dataset 
. For illustration, Kossaifi et al. proposed tensor regression layer (TRL) which replaces the vectorization operation and fully-connected layers of Convolutional Neural Networks (CNNs) with higher-order multivariate regression. The advantage of such replacement is the high compression rate of the model while preserving multi-modal information of dataset by enforcing efficient low-rank constraints. Given such high-dimensional dataset, the vectorization operation will lead to the loss of multi-modal information. The higher-level dependencies among various modes are lost when the data is mapped to a linear space. For instance, applying flattening operation to a colored image (rd-order tensor) will remove the relationship between the red-channel and the blue-channel. Tensor regression layer is able to capture such multi-modal information by performing multilinear regression tasks between the output of the last convolutional layer and the softmax.
Following , we investigate the property and performance of tensor regression layers from the perspectives of regularization and compression. We interpret low-rank constraints as a regularization technique for higher-order multivariate regression and enforce low-rank constraints on the weight tensor between output tensors of CNN and output vectors. Furthermore, we compare tensor regression layer with various tensor decomposition approximations. We aim to provide a comparative insight on different low-rank constraints that can be enforced on higher-order multivariate regression. We compare the performances of TRL using Tucker, CP and Tensor Train decompositions in a small standard CNN on MNIST and Fashion-MNIST. We also investigate such comparison in Residual Networks (ResNet) [5, 6] on CIFAR-10. To investigate the regularization effect, we employed shallow CNNs and trained them with different numbers of training samples and compare the performances.
We show that a compression rate of 54 can be achieved using TT decomposition with a sacrifice of accuracy less than 0.3% with respect to the weight matrix of a 32-Layer Residual Network with fully-connected layer on CIFAR-10 dataset. Surprisingly, we also show that an even better compression rate with a smaller loss in accuracy on CIFAR-10 can be achieved by simply using global average pooling (GAP) followed by a small fully connected layer. However, using the same trick on the smaller CNN on MNIST led to very poor results.
The remaining of this paper is organized as follows. We start by reviewing background knowledge of multilinear algebra and tensor decomposition formats in Section 2. In Section 3, we present and investigate tensor regression layer with different tensor decomposition formats. We show that global average pooling GAP) layer is a special case of TRL with Tucker decomposition in Section 4. In Section 5 we present a simple analysis of low-rank constraints showing how particular choices of the tensor rank parameters can drastically affect the expressiveness of the network. We demonstrate empirical performance of low-rank TRL in Section 6 followed by discussion and conclusion of our work in Section 7.
2.1 Tensor Algebra
We begin with a concise review of notations and basics of tensor algebra. For a more comprehensive review, we refer the reader to . Throughout this paper, a vector is denoted by boldface lowercase letter, e.g. . Matrices and higher-order tensors are denoted by boldface uppercase and calligraphic letters respectively, e.g. and . Given an th-order tensor , its th entry is denoted by or , where . The notation denotes the range of integers from to inclusive. Given a rd order tensor , its slices are the matrices obtained by fixing all but two indices; the horizontal, lateral and frontal slices of are denoted by , and respectively. Similarly, the mode-n fibers of are the vectors obtained by fixing every index but the n-th one. The mode-n matricization or mode-n unfolding of a tensor is the matrix having its mode- fibers as columns and is denoted by . Given vectors , the outer product of these vectors is denoted by and is defined by for all where . An N-th order tensor is called rank-one if it can be written as the outer product of N vectors (i.e. ). The -mode product of a tensor with a matrix is denoted by and is defined by
for all , where . Similarly, we denote an -mode product of a tensor and a vector by for all and it is defined by .
The Kronecker product of matrices and is the block matrix of size and is denoted by . Given matrices and , both of size , their Hadamard product (or component-wise product) is denoted by and defined by . The Khatri-Rao product of matrices and is the matrix defined by
where (resp. ) denotes the th column of (resp. B).
2.2 Various Tensor Decompositions
In this section we present three of the commonly used tensor decomposition formats: Candecomp/Parafac, Tucker and Tensor-Train.
CP decomposition. The CP decomposition [2, 4] approximates a tensor with a summation of rank-one tensors . The rank of the decomposition is simply the number of rank-one tensors used to approximate the input tensor: given an input tensor , its approximation with a CP decomposition of rank is defined by
In Eq. (2), denotes the CP approximation of where each matrix consists of the column vectors for .
We have the following useful expression of Eq. (2) in terms of the matricization of :
Tucker decomposition. The Tucker decomposition approximates a tensor by the product of a core tensor and factor matrices for :
The matricization of from Eq. (4) can be written as
The tuple is the rank of the Tucker decomposition and determines the size of the core tensor . An example of a Tucker approximation of a fourth order tensor is given in Figure 3.
Tensor train decomposition. The tensor train (TT) decomposition  provides a space-efficient representation for higher-order tensors. It approximates a tensor with the product of third order tensors called core tensors or simply cores. The rank of the TT decomposition is the tuple where .
Given a tensor , the approximation by TT decomposition is defined as
where denotes the matrix product.
In order to express Eq. (6) in terms of matricizations of , we first define the following contraction operation on core tensors.
Given a set of core tensors in Eq. (6) for , we define as the product of core tensors for :
Similarly to , we define as the product of core tensors for where and . A tensor network representation of core separation is provided in Figure 4.
3 Tensor Regression Layer
In this section, we introduce tensor regression layer via various low-rank tensor approximations. As stated in Section 1, the last fully-connected layer of traditional CNN represents a large proportion of the model parameters. In addition to such large consumption of computational resources, the flattening operation leads to the loss of rich multi-modal information in the last convolutional layer. Tensor regression layer  replaces such last flattening and fully connected layers of CNN by a multilinear map with low Tucker rank. In this work, we explore imposing other low-rank constraints on the weight tensor and we compare the compression and regularization effects of using either CP, Tucker or TT decompositions.
Given an input tensor and a weight tensor , we investigate the function where is the number of classes. Given such two tensors, the function is defined as
is a bias vector added to the product ofand . The tensor network representation of an example of Eq. (9) is given in Figure 5. The main idea behind tensor tensor regression layers is to enforce a low tensor rank structure on in order to both reduce memory usage and to leverage the multilinear structure of the input .
Throughout the paper, we denote a TRL with TT decomposition by TT-TRL. Similarly we use CP-TRL and Tucker-TRL for a TRL with CP or Tucker decomposition.
We can use this formulation to obtain the partial derivatives needed to implement gradient based optimization methods (e.g. backpropagation), indeed
for all of the matrices for . Furthermore, for a given mode , we can naturally arrange these partial derivatives into a third order tensor and obtain their expression using unfolding:
for , and
Tucker decomposition. As described in Section 2, the Tucker decomposition approximates an input tensor by a core tensor and a set of factor matrices. We can rewrite Eq. (9) using approximation of the tensor by Tucker decomposition as
where the tensor is approximated with
We can again obtain concise expressions for the partial derivatives using unfoldings, for example:
Tensor Train decomposition. The tensor network visualization is given in Figure 8, where the weight tensor is replaced with its TT representation. Using Eq. (6) and (8), in the case of TT decomposition Eq. (9) can be rewritten as
where the second equality follows from the fact that . Similarly to the case of CP and Tucker decomposition, the partial derivatives can be summarized with
for all and , and
4 Tensor perspective on Global Average Pooling layer
In this section, we provide an insight on Global Average Pooling layer from the perspective of tensor algebra. In particular, we show that GAP layer is a special case of Tucker-TRL.
It is a traditional practice to apply flattening operation to the output tensor (i.e. the last convolutional layer) before extracting its features. The problem of such approach lies in the generalization ability to the test dataset. Some work on deep neural networks show that fully-connected layers are prone to overfitting, thus leading to poor performance on test dataset [7, 13, 11].
In order to tackle such generalizability problem and to provide regularization, Global Average Pooling (GAP) layer was presented by Lin et al. . It replaces the combination of vectorization operation and fully-connected layer with averaging operation over all slices along the output channel axis. The output of a GAP layer is thus a single vector of size the number of output channels. GAP layer was empirically shown to significantly reduce the number of model parameters in CNNs .
The authors of  claims not only that GAP layer reduces the trainable model parameters but also that it can prevent the model from overfitting during the training stage. Over the last decade, GAP layer has been adopted to some of the most successful image classification models such as Residual Networks and VGG-16 [5, 20].
More general interpretation of the convolutional output is that it is a high-order tensor in a space . Given such tensor, the GAP layer will output a vector defined by
We here assume that the axis for the output channel corresponds to the last mode of the tensor . We now show that a GAP layer mapping to is equivalent to a specific Tucker-TRL with rank . Indeed, let
be the regression tensor of a Tucker-TRL, with where for each , and . We have
Observe that the composition of a GAP layer with a fully connected layer mapping to can also be achieved using a unique Tucker-TRL by setting to be the weight matrix of the fully connected layer instead of the identity. A graphical representation of this equivalence is shown in Figure 12.
5 Observations on Rank Constraints
In this section, we provide a simple guideline for choosing one of the components of low-rank constraints enforced to TRL. In particular, we observe that the CP rank parameter and and the last Tucker/TT rank parameter affects the dimension of the image of the function computed by the TRL. For example, as a consequence of this observation, if a TRL is used as the last layer of a network before a softmax activation function in a classification tasks withclasses, setting the rank parameter to values greater than
leads to unnecessary redundancy, while setting it to smaller values can detrimentally limit the expressiveness of the resulting classifier.
First, we start with a simple lemma necessary to provide the upper-bound on the dimension of the image of the regression function. We show that if an input matrix admits a factorization, then a function which maps such matrix to a linear space has an upper-bound on the dimension of the image.
If with and , then where .
Given such function f, the dimension of the image of the function f is , which is the dimension of the space that is spanned by column vectors of = . That is, . It is clear that each column vector of the matrix is linear combination of column vectors of from the equation where denotes -th column vector of . Since matrix is in the space , the dimension of the span of the column vectors of is upper-bounded by , namely . ∎
Using Lemma 1, we can provide upper-bounds on the dimension of the image spanned by the regression function of a TRL for different tensor rank constraints.
Let where . The following hold:
if admits a TT decomposition of rank , then ,
if admits a Tucker decomposition of rank , then ,
if admits a CP decomposition of rank , then .
and consequently, since and are of size and respectively, we have by Lemma 1.
We have shown that the dimension of the image mapped by the function is upper-bounded by one of the tensor rank parameters. We refer to this specific component of the rank tuple as the bottleneck rank.
Given a regression tensor , if admits a Tucker Decomposition with rank , we define the rank as the bottleneck rank. Similarly, if admits a TT decomposition with TT-rank , we define as the bottleneck rank.
This observation on the rank constraints used in a tensor regression layer can provide a simple guideline for choosing the bottleneck rank. For instance, when a TRL is used as the last layer of an architecture for a classification task, setting the bottleneck rank to a value smaller than the number of classes could limit the expressiveness of TRL (which we will empirically demonstrate in Section 6.1), while setting it to a value higher than could lead to redundancy in the model parameters.
In this section we provide experimental evidence which 1) supports our analysis on TRLs in Section 5 and 2) investigate the compressive and regularization power of the different low-rank constraints. We present experiments with tensor regression layer using CP, Tucker and TT decomposition on the benchmark datasets MNIST , FashionMNIST , CIFAR-10 and CIFAR-100 .
6.1 MNIST and Fashion-MNIST dataset
MNIST dataset  consists of 1-channel images of handwritten digits from to . The dataset contains k training and a test set of k examples. The purpose of the experiment is to provide insights on regularization power of different low-rank constraints. We set our baseline classifier to be CNN with convolutional layers followed by15] were introduced between each layer as non-linear activations. We tested the model with three tensor approximations; CP, Tucker and TT. By applying various low-rank constraints, we aim to show that as such constraints become larger, the smaller the approximation error becomes, therefore the accuracy of the low-rank model approaches to that of the model without regularizations (i.e. low-rank constraints).
We concisely review the choice of low-rank constraints for Tucker, TT and CP models. Detailed experimental configuration is available online111https://github.com/xwcao/LowRankTRN. Given an output tensor from final convolutional layer where denotes the number of samples in one batch, we constrain the weight tensor with the rank of Tucker decomposition . Following Proposition 2, the bottleneck rank is set to for TRL with Tucker and TT constraints.
Following . The bias term of each layer is initialized with constant . For Tucker-TRL, we conducted a total of experiments. This is per each low-rank Tucker-TRL where were set with constraints , and respectively. A set of experiments were conducted for TT-TRL as well. We set TT-rank to be and . For CP-TRL, we simply evaluated the performance with rank from a set .
We evaluate empirical performance of TRL with another MNIST-like dataset: Fashion-MNIST. The dataset consists of k training and k testing images where each sample belongs to one of ten classes of fashion items such as shoes, clothes and hats. We used the same CNN architecture and hyperparamters as for the MNIST dataset.
Experimental outcomes for both datasets are provided in Figure 13 where we can see that all low-rank approximation models exhibit similar performance in both MNIST and Fashion-MNIST dataset. As for the regularization effect, however, it is observable that as we relax the low-rank constraints, the accuracies of each model gradually converge to that of baseline model. This result illustrates the effect of regularization power that low-rank constraints provide. We also conducted experiments where we used GAP layer instead of fully-connected layer on both MNIST and Fashion-MNIST dataset. In both cases, the model performed very poorly compared to that of fully connected layer; with MNIST dataset and with Fashion-MNIST.
We conducted similar experiment to provide a empirical support to Proposition 2. In Section 5 we showed that the dimension of the image of TRL is upper-bounded by the bottleneck rank. We conducted experiments where we fix the bottleneck rank to be one of . The experimental result presented in Figure 14 shows the clear distinctions among models with different bottleneck ranks. It is observable that bottleneck rank affects the test accuracy by providing upper-bound to the dimension of the image of TRL.
6.2 CIFAR-10 and CIFAR-100 dataset with Residual Networks
We evaluate the performance of tensor regression layer with another benchmark dataset; CIFAR-10 and CIFAR-100 with deep CNNs. CIFAR-10 dataset  consists of k training and k test images from 10 classes; airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. Similarly, CIFAR-100 consists of colored images of 100 classes . We employ Residual structure network  and replaced the GAP layer with CP, Tucker and TT-TRL. Following , we trained the model with initial learning rate of with momentum of . The learning rate is multiplied by at k and k iteration steps and the training process is terminated at k steps. The size of each batch was set to . We set the weight decay to
. The image is pre-processed with whitening normalization followed by the random horizontal flip and cropping with padding size of 2 pixels on each side.
The experimental result are reported in Table 1 for a -layer Residual network  on CIFAR-10 and a -layer ResNet on CIFAR-100. In order to compare the compression rate, we set the baseline model to be the Residual network with fully-connected layer instead of GAP layer. The errors in Table 1 are obtained by choosing the with the best validation score. The experiment shows that CP-TRL achieves comparable test accuracy to ResNet with GAP layer, however, GAP layer performed the best in terms of both compression and accuracy.
6.3 On the regularization effect of TRL
In this section, we investigate the performance of TRL focusing on its function as a regularization to convolutional neural networks. We used shallow CNNs with different train/validation split where the number of the training samples were kept to be small. We compare the performance of TRL with fully-connected layer and GAP layer. To improve the regularization performance, Dropout  and weight decay were included in the comparison. The training datasets are obtained by randomly selecting samples from the initial training dataset, and keeping k samples for validation for each train/validation split.
We evaluate the performance of each model on three datasets; MNIST, Street View House Numbers (SVHN)  and CIFAR-10. SVHN dataset consists of colored images of house numbers where it contains k and k samples for training and testing respectively. We employed a CNN with two (resp. three) convolutional layers for MNIST dataset (resp. CIFAR-10 and SVHN dataset). The dropout is inserted after the final convolutional layer.
The rank of each TRL is selected based on the dimensions of the output tensor as in Section 6.1. We run experiments with early stopping for all experiments where the maximum steps is set to for MNIST and to for SVHN and CIFAR-10. The best rate for dropout is selected based on the validation accuracy where the hyper-parameter is samples from . The decay factor for L2-regularization is similarly chosen from the set .
Tensor regression layer replaces the last flattening operation and fully connected layers with tensor regression with low tensor rank structure. We investigate tensor regression layer with various tensor decompositions. TRL with CP, Tucker and TT decompositions were presented and investigated in this work. We show that the learning procedure for each type of tensor regression layer can be derived using tensor algebra. An analysis on the upper bound of the dimension of the image of the regression function is presented, where we show that the rank of Tucker decomposition and TT ranks affect such dimension.
We evaluated proposed models using benchmark dataset (i.e. handwritten digits and natural images). We did not observe significant differences in accuracy among TRLs with various decompositions for MNIST and CIFAR-10 dataset. The result using the state-of-the-art deep convolutional model shows that when compared to a baseline model with fully-connected layer, TRL with CP decomposition achieved the rate of compression with the sacrifice of accuracy . When compared to the Residual network with GAP layer, our model empirically exhibits comparable performance in both accuracy and compression rate.
-  Mohammad Taha Bahadori, Qi Rose Yu, and Yan Liu. Fast multivariate spatio-temporal analysis via low rank tensor learning. In Advances in neural information processing systems, pages 3491–3499, 2014.
-  J Douglas Carroll and Jih-Jie Chang. Analysis of individual differences in multidimensional scaling via an n-way generalization of "eckart-young" decomposition. Psychometrika, 35(3):283–319, 1970.
-  Andrzej Cichocki, Danilo Mandic, Lieven De Lathauwer, Guoxu Zhou, Qibin Zhao, Cesar Caiafa, and Huy Anh Phan. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Processing Magazine, 32(2):145–163, 2015.
-  Richard A Harshman. Foundations of the parafac procedure: models and conditions for an" explanatory" multimodal factor analysis. 1970.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
-  Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
-  Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
-  Jean Kossaifi, Zachary C Lipton, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar. Tensor regression networks. arXiv preprint arXiv:1707.08308, 2017.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
The mnist database of handwritten digits.http://yann.lecun.com/exdb/mnist/, 1998.
-  Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye.
Tensor completion for estimating missing values in visual data.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208–220, 2013.
Vinod Nair and Geoffrey E Hinton.
Rectified linear units improve restricted boltzmann machines.In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y
Reading digits in natural images with unsupervised feature learning.
NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
-  Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pages 442–450, 2015.
-  Ivan V Oseledets. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
-  Guillaume Rabusseau and Hachem Kadri. Low-rank regression with tensor responses. In Advances in Neural Information Processing Systems, pages 1867–1875, 2016.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
-  Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
-  Rose Yu and Yan Liu. Learning from multiway data: Simple and efficient tensor regression. In International Conference on Machine Learning, pages 373–381, 2016.