DeepAI

# Tensor Regression Networks with various Low-Rank Tensor Approximations

Tensor regression networks achieve high rate of compression of model parameters in multilayer perceptrons (MLP) while having slight impact on performances. Tensor regression layer imposes low-rank constraints on the tensor regression layer which replaces the flattening operation of traditional MLP. We investigate tensor regression networks using various low-rank tensor approximations, aiming to leverage the multi-modal structure of high dimensional data by enforcing efficient low-rank constraints. We provide a theoretical analysis giving insights on the choice of the rank parameters. We evaluated performance of proposed model with state-of-the-art deep convolutional models. For CIFAR-10 dataset, we achieved the compression rate of 0.018 with the sacrifice of accuracy less than 1

• 2 publications
• 25 publications
• 116 publications
10/08/2022

### Almost-lossless compression of a low-rank random tensor

In this work, we establish an asymptotic limit of almost-lossless compre...
03/08/2021

### Low-Rank Tensor Regression for X-Ray Tomography

Tomographic imaging is useful for revealing the internal structure of a ...
11/18/2018

### The core consistency of a compressed tensor

Tensor decomposition on big data has attracted significant attention rec...
02/27/2019

### Stochastically Rank-Regularized Tensor Regression Networks

Over-parametrization of deep neural networks has recently been shown to ...
07/31/2020

### Low-rank Tensor Bandits

In recent years, multi-dimensional online decision making has been playi...
10/31/2017

### Tensor Regression Meets Gaussian Processes

Low-rank tensor regression, a new model class that learns high-order cor...
03/31/2021

### High-Dimensional Uncertainty Quantification via Rank- and Sample-Adaptive Tensor Regression

Fabrication process variations can significantly influence the performan...

## 1 Introduction

Tensor has been attracting increasing interests from the machine learning community over past decades. One of the reasons for such appreciation towards tensor is the natural representation of multi-modal data using the tensor structure. Such multi-modal dataset are often encountered in scientific fields including image analysis

[14], signal processing [3] and spatio-temporal analysis [1, 23]. Tensor methods allow statistical models to efficiently learn multilinear relationship between inputs and outputs by leveraging multilinear algebra and efficient low-rank constraints. The low-rank constraints on higher-order multivariate regression can be interpreted as a regularization technique. As shown in [19], efficient low-rank multilinear regression model with tensor response can improve the performance of regression.

Incorporating tensor methods into deep neural network has become a prominent area of studies. In particular, over the past decade, tensor decomposition and approximation algorithms have been introduced to deep neural networks, notably for 1) efficient compression of the model with low-rank constraints [17] and 2) leveraging the multi-modal structure of the high-dimensional dataset [9]

. For illustration, Kossaifi et al. proposed tensor regression layer (TRL) which replaces the vectorization operation and fully-connected layers of Convolutional Neural Networks (CNNs) with higher-order multivariate regression

[9]. The advantage of such replacement is the high compression rate of the model while preserving multi-modal information of dataset by enforcing efficient low-rank constraints. Given such high-dimensional dataset, the vectorization operation will lead to the loss of multi-modal information. The higher-level dependencies among various modes are lost when the data is mapped to a linear space. For instance, applying flattening operation to a colored image (rd-order tensor) will remove the relationship between the red-channel and the blue-channel. Tensor regression layer is able to capture such multi-modal information by performing multilinear regression tasks between the output of the last convolutional layer and the softmax.

Following [9], we investigate the property and performance of tensor regression layers from the perspectives of regularization and compression. We interpret low-rank constraints as a regularization technique for higher-order multivariate regression and enforce low-rank constraints on the weight tensor between output tensors of CNN and output vectors. Furthermore, we compare tensor regression layer with various tensor decomposition approximations. We aim to provide a comparative insight on different low-rank constraints that can be enforced on higher-order multivariate regression. We compare the performances of TRL using Tucker, CP and Tensor Train decompositions in a small standard CNN on MNIST and Fashion-MNIST. We also investigate such comparison in Residual Networks (ResNet) [5, 6] on CIFAR-10. To investigate the regularization effect, we employed shallow CNNs and trained them with different numbers of training samples and compare the performances.

We show that a compression rate of 54 can be achieved using TT decomposition with a sacrifice of accuracy less than 0.3% with respect to the weight matrix of a 32-Layer Residual Network with fully-connected layer on CIFAR-10 dataset. Surprisingly, we also show that an even better compression rate with a smaller loss in accuracy on CIFAR-10 can be achieved by simply using global average pooling (GAP) followed by a small fully connected layer. However, using the same trick on the smaller CNN on MNIST led to very poor results.

The remaining of this paper is organized as follows. We start by reviewing background knowledge of multilinear algebra and tensor decomposition formats in Section 2. In Section 3, we present and investigate tensor regression layer with different tensor decomposition formats. We show that global average pooling GAP) layer is a special case of TRL with Tucker decomposition in Section 4. In Section 5 we present a simple analysis of low-rank constraints showing how particular choices of the tensor rank parameters can drastically affect the expressiveness of the network. We demonstrate empirical performance of low-rank TRL in Section 6 followed by discussion and conclusion of our work in Section 7.

## 2 Background

### 2.1 Tensor Algebra

We begin with a concise review of notations and basics of tensor algebra. For a more comprehensive review, we refer the reader to [8]. Throughout this paper, a vector is denoted by boldface lowercase letter, e.g. . Matrices and higher-order tensors are denoted by boldface uppercase and calligraphic letters respectively, e.g.  and . Given an th-order tensor , its th entry is denoted by or , where . The notation denotes the range of integers from to inclusive. Given a rd order tensor , its slices are the matrices obtained by fixing all but two indices; the horizontal, lateral and frontal slices of are denoted by , and respectively. Similarly, the mode-n fibers of are the vectors obtained by fixing every index but the n-th one. The mode-n matricization or mode-n unfolding of a tensor is the matrix having its mode- fibers as columns and is denoted by . Given vectors , the outer product of these vectors is denoted by and is defined by for all where . An N-th order tensor is called rank-one if it can be written as the outer product of N vectors (i.e. ). The -mode product of a tensor with a matrix is denoted by and is defined by

 (X×nU)i1,⋯,in−1,j,in+1,⋯,iN=In∑in=1Xi1,i2,⋯,iNuj,in

for all , where . Similarly, we denote an -mode product of a tensor and a vector by for all and it is defined by .

The Kronecker product of matrices and is the block matrix of size and is denoted by . Given matrices and , both of size , their Hadamard product (or component-wise product) is denoted by and defined by . The Khatri-Rao product of matrices and is the matrix defined by

 A⊙B=[a1⊗b1a2⊗b2⋯aK⊗bK] (1)

where  (resp. ) denotes the th column of  (resp. B).

### 2.2 Various Tensor Decompositions

In this section we present three of the commonly used tensor decomposition formats: Candecomp/Parafac, Tucker and Tensor-Train.

CP decomposition.    The CP decomposition [2, 4] approximates a tensor with a summation of rank-one tensors [8]. The rank of the decomposition is simply the number of rank-one tensors used to approximate the input tensor: given an input tensor , its approximation with a CP decomposition of rank is defined by

 X≈R∑j=1a(j)1∘⋯∘a(j)N=[[A(1),A(2),⋯,A(N)]]. (2)

In Eq. (2), denotes the CP approximation of where each matrix consists of the column vectors for .

We have the following useful expression of Eq. (2) in terms of the matricization of :

 X(n)≈A(n)(A(N)⊙⋯⊙A(n+1)⊙A(n−1)⊙⋯⊙A(1))T (3)

Tucker decomposition.    The Tucker decomposition approximates a tensor by the product of a core tensor and factor matrices for :

 X≈G×1U(1)×2⋯×NU(N)=[[G;U(1),⋯,U(N)]] (4)

The matricization of from Eq. (4) can be written as

 X(n)≈U(n)G(n)(U(N)⊗⋯U(n+1)⊗U(n−1)⊗⋯⊗U(1))T (5)

The tuple is the rank of the Tucker decomposition and determines the size of the core tensor . An example of a Tucker approximation of a fourth order tensor is given in Figure 3.

Tensor train decomposition.    The tensor train (TT) decomposition [18] provides a space-efficient representation for higher-order tensors. It approximates a tensor with the product of third order tensors called core tensors or simply cores. The rank of the TT decomposition is the tuple where .

Given a tensor , the approximation by TT decomposition is defined as

 Xi1,i2,…,iN≈G(1)i1,:×G(2):,i2,:×⋯×G(N):,iN= ⟨⟨G(1),⋯,G(N)⟩⟩ (6)

where denotes the matrix product.

In order to express Eq. (6) in terms of matricizations of , we first define the following contraction operation on core tensors.

###### Definition 1.

Given a set of core tensors in Eq. (6) for , we define as the product of core tensors for :

 G

Similarly to , we define as the product of core tensors for where and . A tensor network representation of core separation is provided in Figure 4.

Using Definition 1, the mode-n unfolding of a tensor in Eq. (6) where can be written as

 X(n)≈G(n)(2)(G>n(1)⊗G

## 3 Tensor Regression Layer

In this section, we introduce tensor regression layer via various low-rank tensor approximations. As stated in Section 1, the last fully-connected layer of traditional CNN represents a large proportion of the model parameters. In addition to such large consumption of computational resources, the flattening operation leads to the loss of rich multi-modal information in the last convolutional layer. Tensor regression layer [9] replaces such last flattening and fully connected layers of CNN by a multilinear map with low Tucker rank. In this work, we explore imposing other low-rank constraints on the weight tensor and we compare the compression and regularization effects of using either CP, Tucker or TT decompositions.

Given an input tensor and a weight tensor , we investigate the function where is the number of classes. Given such two tensors, the function is defined as

 f(X)=W(N+1)vec(X)+b (9)

where

is a bias vector added to the product of

and . The tensor network representation of an example of Eq. (9) is given in Figure 5. The main idea behind tensor tensor regression layers is to enforce a low tensor rank structure on in order to both reduce memory usage and to leverage the multilinear structure of the input .

Throughout the paper, we denote a TRL with TT decomposition by TT-TRL. Similarly we use CP-TRL and Tucker-TRL for a TRL with CP or Tucker decomposition.

CP decomposition.    First we investigate applying CP decomposition to approximate the weight tensor . Using Eq. (2) and Eq.(3), Eq. (9) can be rewritten as

 (10)

We can use this formulation to obtain the partial derivatives needed to implement gradient based optimization methods (e.g. backpropagation), indeed

 ∂f(X)i∂(A(n))jk=(∂A(N+1)(A(N)⊙⋯⊙A(1))Tvec(X))i∂(A(n))jk (11)

for all of the matrices for . Furthermore, for a given mode , we can naturally arrange these partial derivatives into a third order tensor and obtain their expression using unfolding:

 (∂f∂A(n))(2)=(X)(n)(A(N)⊙⋯⊙A(n+1)⊙A(n−1)⊙⋯⊙A(1))(A(N+1)⊙IR)T

for , and

 (∂f∂A(N+1))(1)=IIN+1⊗(vec(X)T(A(N)⊙⋯⊙A(1))).

Tucker decomposition.    As described in Section 2, the Tucker decomposition approximates an input tensor by a core tensor and a set of factor matrices. We can rewrite Eq. (9) using approximation of the tensor by Tucker decomposition as

 f(X)≈U(N+1)G(N+1)(U(N)⊗⋯⊗U(1))Tvec(X)+b (12)

where the tensor is approximated with

 W≈G×1U(1)×2⋯×NU(N)×N+1U(N+1)=[[G;U(1),⋯,U(N+1)]] (13)

The tensor network representations of Eq. (12) is shown in Figure 8. Given a tensor of size , the function maps such tensor to the space with low-rank constraints.

We can again obtain concise expressions for the partial derivatives using unfoldings, for example:

 (14)
 (∂f(X)∂U(N+1))(1)=(([[G;U(1),⋯,U(N),IRN+1]])(N+1)vec(X))T⊗IIN+1 (15)

and

 (16)

Tensor Train decomposition.    The tensor network visualization is given in Figure 8, where the weight tensor is replaced with its TT representation. Using Eq. (6) and (8), in the case of TT decomposition Eq. (9) can be rewritten as

 f(X)≈(G(N+1))(2)((G>N+1)(1)⊗(G

where the second equality follows from the fact that . Similarly to the case of CP and Tucker decomposition, the partial derivatives can be summarized with

 (∂f(X)i∂G(n))(2)=X(n)((G>n:,⋯,:,i)(1)⊗(G

for all and , and

 (∂f(X)∂G(N+1))T(1)=((G

## 4 Tensor perspective on Global Average Pooling layer

In this section, we provide an insight on Global Average Pooling layer from the perspective of tensor algebra. In particular, we show that GAP layer is a special case of Tucker-TRL.

It is a traditional practice to apply flattening operation to the output tensor (i.e. the last convolutional layer) before extracting its features. The problem of such approach lies in the generalization ability to the test dataset. Some work on deep neural networks show that fully-connected layers are prone to overfitting, thus leading to poor performance on test dataset [7, 13, 11].

In order to tackle such generalizability problem and to provide regularization, Global Average Pooling (GAP) layer was presented by Lin et al. [13]. It replaces the combination of vectorization operation and fully-connected layer with averaging operation over all slices along the output channel axis. The output of a GAP layer is thus a single vector of size the number of output channels. GAP layer was empirically shown to significantly reduce the number of model parameters in CNNs [13].

The authors of [13] claims not only that GAP layer reduces the trainable model parameters but also that it can prevent the model from overfitting during the training stage. Over the last decade, GAP layer has been adopted to some of the most successful image classification models such as Residual Networks and VGG-16 [5, 20].

More general interpretation of the convolutional output is that it is a high-order tensor in a space . Given such tensor, the GAP layer will output a vector defined by

 (y)iN=N−1∏n=1I−1nI1∑i1=1I2∑i2=1⋯IN−1∑iN−1=1Xi1i2…iN   for all iN∈[1,IN]. (20)

We here assume that the axis for the output channel corresponds to the last mode of the tensor . We now show that a GAP layer mapping to is equivalent to a specific Tucker-TRL with rank . Indeed, let

be the regression tensor of a Tucker-TRL, with where for each , and . We have

 f(X)iN=(U(N+1)G(N+1)(U(N)⊗u(N−1)⊗⋯⊗u(1))Tvec(X))iN=(X\vbox\scalebox{.75}{∙}1u(1)\vbox\scalebox{.75}{∙}2⋯\vbox\scalebox{.75}{∙}N−1u(N−1))iN=IN−1∑iN−1=1⋯I2∑i2=1I1∑i1=1Xi1,i2,…,iN(u(1))i1(u(2))i2⋯(u(N−1))iN−1=N−1∏n=1I−1nI1∑i1=1I2∑i2=1⋯IN−1∑iN−1=1Xi1,i2,…,iN=(y)iN. (21)

Observe that the composition of a GAP layer with a fully connected layer mapping to can also be achieved using a unique Tucker-TRL by setting to be the weight matrix of the fully connected layer instead of the identity. A graphical representation of this equivalence is shown in Figure 12.

## 5 Observations on Rank Constraints

In this section, we provide a simple guideline for choosing one of the components of low-rank constraints enforced to TRL. In particular, we observe that the CP rank parameter and and the last Tucker/TT rank parameter affects the dimension of the image of the function computed by the TRL. For example, as a consequence of this observation, if a TRL is used as the last layer of a network before a softmax activation function in a classification tasks with

classes, setting the rank parameter to values greater than

leads to unnecessary redundancy, while setting it to smaller values can detrimentally limit the expressiveness of the resulting classifier.

First, we start with a simple lemma necessary to provide the upper-bound on the dimension of the image of the regression function. We show that if an input matrix admits a factorization, then a function which maps such matrix to a linear space has an upper-bound on the dimension of the image.

###### Lemma 1.

If with and , then where .

###### Proof.

Given such function f, the dimension of the image of the function f is , which is the dimension of the space that is spanned by column vectors of = . That is, . It is clear that each column vector of the matrix is linear combination of column vectors of from the equation where denotes -th column vector of . Since matrix is in the space , the dimension of the span of the column vectors of is upper-bounded by , namely . ∎

Using Lemma 1, we can provide upper-bounds on the dimension of the image spanned by the regression function of a TRL for different tensor rank constraints.

###### Proposition 2.

Let where . The following hold:

• if admits a TT decomposition of rank , then ,

• if admits a Tucker decomposition of rank , then ,

• if admits a CP decomposition of rank , then .

###### Proof.

If admits a TT decomposition with TT rank , then by Eq. (6), we have . Using the matricization of given by Eq. (8), we can write as follows;

 f(X)=W(N+1)×vec(X)=G(N+1)(2)G

and consequently, since and are of size and respectively, we have by Lemma 1.

The other two points can be proven in a similar fashion using Eq. (4) and (5) for Tucker, and Eq. (2) and (3) for CP. ∎

We have shown that the dimension of the image mapped by the function is upper-bounded by one of the tensor rank parameters. We refer to this specific component of the rank tuple as the bottleneck rank.

###### Definition 2.

Given a regression tensor , if admits a Tucker Decomposition with rank , we define the rank as the bottleneck rank. Similarly, if admits a TT decomposition with TT-rank , we define as the bottleneck rank.

This observation on the rank constraints used in a tensor regression layer can provide a simple guideline for choosing the bottleneck rank. For instance, when a TRL is used as the last layer of an architecture for a classification task, setting the bottleneck rank to a value smaller than the number of classes could limit the expressiveness of TRL (which we will empirically demonstrate in Section 6.1), while setting it to a value higher than could lead to redundancy in the model parameters.

## 6 Experiments

In this section we provide experimental evidence which 1) supports our analysis on TRLs in Section 5 and 2) investigate the compressive and regularization power of the different low-rank constraints. We present experiments with tensor regression layer using CP, Tucker and TT decomposition on the benchmark datasets MNIST [12], FashionMNIST [22], CIFAR-10 and CIFAR-100 [10].

### 6.1 MNIST and Fashion-MNIST dataset

MNIST dataset [12] consists of 1-channel images of handwritten digits from to . The dataset contains k training and a test set of k examples. The purpose of the experiment is to provide insights on regularization power of different low-rank constraints. We set our baseline classifier to be CNN with convolutional layers followed by

fully-connected layer. Rectified linear units (ReLU)

[15] were introduced between each layer as non-linear activations. We tested the model with three tensor approximations; CP, Tucker and TT. By applying various low-rank constraints, we aim to show that as such constraints become larger, the smaller the approximation error becomes, therefore the accuracy of the low-rank model approaches to that of the model without regularizations (i.e. low-rank constraints).

We concisely review the choice of low-rank constraints for Tucker, TT and CP models. Detailed experimental configuration is available online. Given an output tensor from final convolutional layer where denotes the number of samples in one batch, we constrain the weight tensor with the rank of Tucker decomposition . Following Proposition 2, the bottleneck rank is set to for TRL with Tucker and TT constraints.

Following [11]

, we initialized the weights in each layer from zero mean normal distribution with standard deviation

. The bias term of each layer is initialized with constant . For Tucker-TRL, we conducted a total of experiments. This is per each low-rank Tucker-TRL where were set with constraints , and respectively. A set of experiments were conducted for TT-TRL as well. We set TT-rank to be and . For CP-TRL, we simply evaluated the performance with rank from a set .

We evaluate empirical performance of TRL with another MNIST-like dataset: Fashion-MNIST. The dataset consists of k training and k testing images where each sample belongs to one of ten classes of fashion items such as shoes, clothes and hats. We used the same CNN architecture and hyperparamters as for the MNIST dataset.

Experimental outcomes for both datasets are provided in Figure 13 where we can see that all low-rank approximation models exhibit similar performance in both MNIST and Fashion-MNIST dataset. As for the regularization effect, however, it is observable that as we relax the low-rank constraints, the accuracies of each model gradually converge to that of baseline model. This result illustrates the effect of regularization power that low-rank constraints provide. We also conducted experiments where we used GAP layer instead of fully-connected layer on both MNIST and Fashion-MNIST dataset. In both cases, the model performed very poorly compared to that of fully connected layer; with MNIST dataset and with Fashion-MNIST.

We conducted similar experiment to provide a empirical support to Proposition 2. In Section 5 we showed that the dimension of the image of TRL is upper-bounded by the bottleneck rank. We conducted experiments where we fix the bottleneck rank to be one of . The experimental result presented in Figure 14 shows the clear distinctions among models with different bottleneck ranks. It is observable that bottleneck rank affects the test accuracy by providing upper-bound to the dimension of the image of TRL.

### 6.2 CIFAR-10 and CIFAR-100 dataset with Residual Networks

We evaluate the performance of tensor regression layer with another benchmark dataset; CIFAR-10 and CIFAR-100 with deep CNNs. CIFAR-10 dataset [10] consists of k training and k test images from 10 classes; airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. Similarly, CIFAR-100 consists of colored images of 100 classes [10]. We employ Residual structure network [6] and replaced the GAP layer with CP, Tucker and TT-TRL. Following [6], we trained the model with initial learning rate of with momentum of . The learning rate is multiplied by at k and k iteration steps and the training process is terminated at k steps. The size of each batch was set to . We set the weight decay to

. The image is pre-processed with whitening normalization followed by the random horizontal flip and cropping with padding size of 2 pixels on each side.

The experimental result are reported in Table 1 for a -layer Residual network [6] on CIFAR-10 and a -layer ResNet on CIFAR-100. In order to compare the compression rate, we set the baseline model to be the Residual network with fully-connected layer instead of GAP layer. The errors in Table 1 are obtained by choosing the with the best validation score. The experiment shows that CP-TRL achieves comparable test accuracy to ResNet with GAP layer, however, GAP layer performed the best in terms of both compression and accuracy.

### 6.3 On the regularization effect of TRL

In this section, we investigate the performance of TRL focusing on its function as a regularization to convolutional neural networks. We used shallow CNNs with different train/validation split where the number of the training samples were kept to be small. We compare the performance of TRL with fully-connected layer and GAP layer. To improve the regularization performance, Dropout [21] and weight decay were included in the comparison. The training datasets are obtained by randomly selecting samples from the initial training dataset, and keeping k samples for validation for each train/validation split.

We evaluate the performance of each model on three datasets; MNIST, Street View House Numbers (SVHN) [16] and CIFAR-10. SVHN dataset consists of colored images of house numbers where it contains k and k samples for training and testing respectively. We employed a CNN with two (resp. three) convolutional layers for MNIST dataset (resp. CIFAR-10 and SVHN dataset). The dropout is inserted after the final convolutional layer.

The rank of each TRL is selected based on the dimensions of the output tensor as in Section 6.1. We run experiments with early stopping for all experiments where the maximum steps is set to for MNIST and to for SVHN and CIFAR-10. The best rate for dropout is selected based on the validation accuracy where the hyper-parameter is samples from . The decay factor for L2-regularization is similarly chosen from the set .

The outcome of the experiment is presented in Table 5. An unique behavior of TRL is observed in Table 5 where in most of the settings using dropout with Tucker and TT-TRL achieves better test accuracy than using dropout with a fully-connected layer.

## 7 Conclusion

Tensor regression layer replaces the last flattening operation and fully connected layers with tensor regression with low tensor rank structure. We investigate tensor regression layer with various tensor decompositions. TRL with CP, Tucker and TT decompositions were presented and investigated in this work. We show that the learning procedure for each type of tensor regression layer can be derived using tensor algebra. An analysis on the upper bound of the dimension of the image of the regression function is presented, where we show that the rank of Tucker decomposition and TT ranks affect such dimension.

We evaluated proposed models using benchmark dataset (i.e. handwritten digits and natural images). We did not observe significant differences in accuracy among TRLs with various decompositions for MNIST and CIFAR-10 dataset. The result using the state-of-the-art deep convolutional model shows that when compared to a baseline model with fully-connected layer, TRL with CP decomposition achieved the rate of compression with the sacrifice of accuracy . When compared to the Residual network with GAP layer, our model empirically exhibits comparable performance in both accuracy and compression rate.