Stochastically Rank-Regularized Tensor Regression Networks

02/27/2019 ∙ by Arinbjörn Kolbeinsson, et al. ∙ 83

Over-parametrization of deep neural networks has recently been shown to be key to their successful training. However, it also renders them prone to overfitting and makes them expensive to store and train. Tensor regression networks significantly reduce the number of effective parameters in deep neural networks while retaining accuracy and the ease of training. They replace the flattening and fully-connected layers with a tensor regression layer, where the regression weights are expressed through the factors of a low-rank tensor decomposition. In this paper, to further improve tensor regression networks, we propose a novel stochastic rank-regularization. It consists of a novel randomized tensor sketching method to approximate the weights of tensor regression layers. We theoretically and empirically establish the link between our proposed stochastic rank-regularization and the dropout on low-rank tensor regression. Extensive experimental results with both synthetic data and real world datasets (i.e., CIFAR-100 and the UK Biobank brain MRI dataset) support that the proposed approach i) improves performance in both classification and regression tasks, ii) decreases overfitting, iii) leads to more stable training and iv) improves robustness to adversarial attacks and random noise.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have been evolved to a general-purpose machine learning method with remarkable performance on practical applications (LeCun et al., 2015). Such models are usually over-parameterized, involving an enormous number (possibly millions) of parameters. This is much larger than the typical number of available training samples, making deep networks prone to overfitting (Caruana et al., 2001)

. Coupled with overfitting, the large number of unknown parameters makes deep learning models extremely hard and computationally expensive to train, requiring huge amount of memory and computation power. Such resources are often available only in massive computer clusters, preventing deep networks to be deployed in resource limited machines such as mobile and embedded devices.

To prevent deep neural networks from overfitting and improve their generalization ability, several explicit and implicit regularization methods have been proposed. More specifically, explicit regularization strategies, such as weight decay involve -norm regularization of the parameters (Nowlan & Hinton, 1992; Krogh & Hertz, 1992). Replacing the with -norm has been also investigated (Scardapane et al., 2017; Zhang et al., 2016). Besides the aforementioned general-purpose regularization functions, neural networks specific methods such as early stopping of back-propagation (Caruana et al., 2001)

, batch normalization

(Ioffe & Szegedy, 2015), dropout (Srivastava et al., 2014) and its variants –e.g., DropConnect (Wan et al., 2013)– are algorithmic approaches to reducing overfitting in over-parametrized networks and have been widely adopted in practice.

Reducing the storage and computational costs of deep networks has become critical for meeting the requirements of environments with limited memory or computational resources. To this end, a surge of network compression and approximation algorithms have recently been proposed in the context of deep learning. By leveraging the redundancy in network parameters, methods such as Tai et al. (2015); Cheng et al. (2015); Yu et al. (2017); Kossaifi et al. (2018) employ low-rank approximations of deep networks’ weight matrices (or tensors) for parameter reduction. Network compression methods in the frequency domain (Chen et al., 2016)

have also been investigated. An alternative approach for reducing the number of effective parameters in deep nets relies on sketching, whereby, given a matrix or tensor of input data or parameters, one first compresses it to a much smaller matrix (or tensor) by multiplying it by a (usually) random matrix with certain properties

(Kasiviswanathan et al., 2017; Daniely et al., 2016).

A particularly appealing approach to network compression, especially for visual data111Most modern data is inherently multi-dimensional -color images are naturally represented by order tensors, videos by order tensors, etc.) (and other types of multidimensional and multi-aspect data) is tensor regression networks (Kossaifi et al., 2018). Deep neural networks typically leverage the spatial structure of input data via series of convolutions, point-wise non-linearities, pooling, etc. However, this structure is usually wasted by the addition, at the end of the networks’ architectures, of a flattening layer followed by one or several fully-connected layers. A recent line of study focuses on alleviating this using tensor methods. Kossaifi et al. (2017) proposed tensor contraction as a layer, to reduce the size of activation tensors, and demonstrated large space savings by replacing fully-connected layers with this layer. However, a flattening layer and fully-connected layers were still ultimately needed for producing the outputs. Recently, tensor regression networks (Kossaifi et al., 2018) propose to replace flattening and fully-connected layers entirely with a tensor regression layer (TRL). This preserves the structure by expressing an output tensor as the result of a tensor contraction between the input tensor and some low-rank regression weight tensors. In addition, these allow for large space savings without sacrificing accuracy. Cao et al. (2017) explore the same model with various low-rank structures on the regression weight tensor.

In this paper, we combine ideas from networks regularization, low-rank approximation of networks, and randomized sketching in a principled way and introduce a novel stochastic regularization term to the tensor regression networks. It consists of a novel randomized low-rank tensor regression, which leads to the stochastic reduction of the rank, either by a fixed percentage during training or according to a series of Bernoulli random variables. This is akin to dropout, which, by randomly dropping units during training, prevents over-fitting. However, rather than dropping random elements from the


tensor, this is done on the regression weight tensor. We explore two schemes: (i) selecting random elements to keep, following a Bernoulli distribution and (ii) keeping a random subset of the

fibers of the tensor, with replacement. We theoretically and empirically establish the link between CP TRL with the proposed regularizer and the dropout on the deterministic low-rank tensor regression.

To demonstrate the practical advantages of this method, we conducted experiments in image classification and phenotypic trait prediction from MRI. To this end, the CIFAR-100 and the UK Biobank brain MRI datasets were employed. Experimental results demonstrate that the proposed method i) improves performance in both classification and regression tasks, ii) decreases over-fitting, iii) leads to more stable training and iv) largely improves robustness to adversarial attacks and random noise.

One notable application of deep neural networks is in medical imaging, particularly magnetic resonance imaging (MRI). MRI analysis performed using deep learning includes age prediction for brain-age estimation

(Cole et al., 2017a). Brain-age has been associated with a range of diseases and mortality (Cole et al., 2017b), and could be an early predictor for Alzheimer’s disease (Franke et al., 2012). A more accurate and more robust brain age estimation can consequently lead to more accurate disease diagnoses. We demonstrate a large performance improvement (more than 20%) on this task using a 3D-ResNet with our proposed stochastically rank-regularized TRL, compared to a regular 3D-ResNet.

2 Closely related work

Network regularization and dropout. Several methods that improve generalization by mitigating overfitting have been developed in the context of deep learning. The interested reader is referred to the work of Kukačka et al. (2017) and the references therein for a comprehensive survey of over 50 different regularization techniques for deep networks.

The most closely related regularization method to our approach is Dropout (Srivastava et al., 2014)

, which is probably the most widely adopted technique for training neural networks while preventing overfitting. Concretely, during dropout training each unit (i.e., neuron) is equipped with a binary Bernoulli random variable and only the network’s weights whose corresponding Bernoulli variables are sampled with value 1 are updated at each back-propagation step. At each iteration, those Bernoulli variables are re-sampled again and the weights are updated accordingly. The proposed regularization method can be interpreted as dropout on low-rank tensor regression, a fact which is proved in Section


Sketching and deep networks approximation. Daniely et al. (2016)

apply sketching to the input data in order to sparsify them and reduce their dimensionality. Subsequently they show any sparse polynomial function can be computed, on all sparse binary vectors, by a single layer neural network that takes a compact sketch of the vector as input. In contrast,

Kasiviswanathan et al. (2017)

, approximate neural networks and apply a random sketching on weight matrices/tensors instead of input data and demonstrate that given a fixed layer input, the output of this layer using sketching matrices is an unbiased estimator of the original output of this layer and has bounded variance. As opposed to the aforementioned sketching methods for deep networks approximation, the proposed method applies sketching in the low-rank factorization of weights.

Randomized tensor decompositions. Tensor decompositions exhibit high computational cost and low convergence rate when applied to massive multi-dimensional data. To accelerate computation, randomized tensor decompositions have been employed to scale tensor decompositions. A randomized least squares algorithm for CP decomposition is proposed by Battaglino et al. (2018), which is significantly faster than traditional CP decomposition. In (Erichson et al., 2017), CP is applied on a small tensor generated by tensor random projection of the high-dimensional tensor. The CP decomposition of the large-scale tensor is obtained by back projection of the CP decomposition of the small tensor. Wang et al. (2015) introduce a fast yet provable randomized CP decomposition that performs randomized tensor contraction using FFT. Methods in (Sidiropoulos et al., 2014; Vervliet et al., 2014) are highly computationally efficient algorithms for computing large-scale CP decompositions by applying randomization (random projections) into a set of small tensors, derived by subdividing a tensor into a set of blocks. Fast randomized algorithms that employ sketching for approximating Tucker decomposition have been also investigated (Tsourakakis, 2010; Zhou et al., 2014). More recently, a randomized tensor ring decomposition that employs tensor random projections has been developed in Yuan et al. (2019). The most similar method to ours is that of Battaglino et al. (2018), where elements of the tensor are sampled randomly, and each factor of the decomposition updated in an iterative manner. By contrast, our method allows for end-to-end training, and applies randomization on the fibers of the tensor, effectively randomizing the rank of the weight tensor.

(a) Tensor diagram of a TRL
(b) Tensor diagram of a SRR-TRL
Figure 1: Tensor diagrams of the TRL (left) and our proposed SRR-TRL (right), with low-rank constraints imposed on the regression weights tensor using a Tucker decomposition. Note that the CP case is readily given by this formulation by additionally having the core tensor G be super-diagonal, and setting

3 Tensor Regression Networks

In this section, we introduce the notations and notions necessary to introduce our stochastic rank regularization.


We denote vectors (1order tensors) and matrices (2order tensors).

is the identity matrix. We denote

tensors of order , and denote its element as . A colon is used to denote all elements of a mode e.g. the mode-0 fibers of are denoted as . The transpose of is denoted . Finally, for any denotes the set of integers , and the integer division of by .

Tensor unfolding:

Given a tensor, , its mode- unfolding is a matrix , with and is defined by the mapping from element to , with

Tensor vectorization:

Given a tensor, , we can flatten it into a vector of size defined by the mapping from element of to element of , with

Mode-n product:

For a tensor and a matrix , the n-mode product of a tensor is a tensor of size and can be expressed using unfolding of and the classical dot product as:

Generalized inner product:

For two tensors and , we denote by the contraction of by along their last (respectively first) modes.

Kruskal tensor:

Given a tensor , the Canonical-Polyadic decomposition (CP), also called PARAFAC, decomposes it into a sum of rank-1 tensors. The number of terms in the sum, , is known as the rank of the decomposition. Formally, we find the vectors , for such that:


These vectors can be collected in matrices, called factors or the decomposition. Specifically, we define, for each factor , The magnitude of the factors can optionally be absorbed in a vector of weights , such that

The decomposition can be denoted more compactly as , or if a weights vector is used.

Tucker tensor:

Given a tensor , we can decompose it into a low rank core by projecting along each of its modes with projection factors , with .

This allows us to write the tensor in a decomposed form as:


Note that the Kruskal form of a tensor can be seen as a Tucker tensor with a super-diagonal core.

Tensor diagrams:

In order to represent easily tensor operations, we adopt the tensor diagrams, where tensors are represented by vertices (circles) and edges represent their modes. The degree of a vertex then represents its order. Connecting two edges symbolizes a tensor contraction between the two represented modes. Figure 1 presents a tensor diagram of the tensor regression layer and its stochastic rank-regularized counter-part.

Tensor regression layers (TRL):

Let us denote by the input activation tensor for a sample and the label vector. We are interested in the problem of estimating the regression weight tensor under some fixed low rank :

with (3)

with , for each in and .

4 Stochastic rank regularization

In this section, we introduce the stochastic rank regularization (SRR). Specifically, we propose a new stochastic rank-regularization, applied to low-rank tensors in decomposed forms. This formulation is general and can be applied to any type of decomposition. We introduce it here, without loss of generality, to the case of Tucker and CP decompositions.

For any , let be a sketch matrix (e.g. a random projection or column selection matrix) and, be a sketch of factor matrix , and a sketch of the core tensor .

Given an activation tensor and a target label vector , a stochastically rank regularized tensor regression layer is written from equation 3 as follows:


with being a stochastic approximation of Tucker decomposition, namely:


Even though several sketching methods have been proposed, we focus here on SRR with two different types of binary sketching matrices, namely binary matrix sketching with replacement and binary diagonal matrix sketching with Bernoulli entries.

4.1 SRR with replacement:

In this setting, we introduce the SRR with binary sketching matrix (with replacement). We first choose .

Mathematically, we introduce the uniform sampling matrices . is a uniform sampling matrix, selecting elements, where . In other words, for any , verifies:


Note that in practice this product is never explicitly computed, we simply select the correct elements from and its corresponding factors.

4.2 Tucker-SRR with Bernoulli entries

In this setting, we introduce the SRR with diagonal binary sketching matrix with Bernoulli entries.

For any , let be a random vector, the entries of which are i.i.d. Bernoulli(), then a diagonal Bernoulli sketching matrix is defined as .

When the low-rank structure on the weight tensor of the TRL is imposed using a Tucker decomposition, the randomized Tucker approximation is expressed as:


The main advantage of considering the above-mentioned sampling matrices is that the products or are never explicitly computed, we simply select the elements from and the corresponding factors.

Interestingly, in analogy to dropout, where each hidden unit is dropped independently with probability , in the proposed randomized tensor decomposition, the columns of the factor matrices and the corresponding fibers of the core tensor are dropped independently and consequently the rank of the tensor decomposition is stochastically dropped. Hence the name stochastic rank-regularized TRL of our method.

4.3 CP-SRR with Bernoulli entries

An interesting special case of 5 is when the weight tensor of the TRL is expressed using a CP decomposition. In that case, we set , with, for any ,

Then a randomized CP approximation is expressed as:


The above randomized CP decomposition on the weights is equivalent to the following formulation:


This is easy to see by looking at the individual elements of the sketched factors. Let and . Then Since , if , and otherwise, we get It follows that Since we have

Based on the previous stochastic regularization, for an activation tensor X and a corresponding label vector , the optimization problem for our tensor regression layer with stochastic regularization is given by:


In addition, the above stochastic optimization problem can be rewritten as a deterministic regularized problem:


This is easy to see by considering the equivalent rewriting of the above optimization problem, using the mode- unfolding of the weight tensor. Equation 10 then becomes:

with The result can then be obtained following Mianjy et al. (2018, Lemma A.1).

5 Experimental evaluation

In this section, we introduce the experimental setting, databases used, and implementation details. We experimented on several datasets, across various tasks, namely image classification and MRI-based regression. All methods were implemented using PyTorch 

(Paszke et al., 2017) and TensorLy (Kossaifi et al., 2016).

Figure 2: Experiment on synthetic data:

loss of the TRL as a function of the number of epochs for the stochastic case (orange) and the deterministic version based on the regularized objective function (blue). As expected, both formulations are empirically the same.

(a) *

5.1 Numerical experiments

In this section, we empirically demonstrate the equivalence between our stochastic rank regularization and the deterministic regularization based formulation of the dropout.

To do so, we first created a random regression weight tensor to be a third order tensor of size , formed as a low-rank Kruskal tensor with

components, the factors of which were sampled from an i.i.d. Gaussian distribution. We then generated a tensor of

random samples, X of size

, the elements of which were sampled from a Normal distribution. Finally, we constructed the corresponding response array y of size

as: . Using the same regression weight tensor and same procedure, we also generated testing samples and labels.

We use this data to train a rank- CP SRR-TRL, with both our Bernoulli stochastic formulation (equation  10) and its deterministic counter-part (equation  11). We train for epochs, with a batch-size of , and an initial learning rate of , which we decrease by a factor of every epochs. Figure 1(a)

shows the loss function as a function of the epoch number. As expected, both formulations are identical.

5.2 Image classification results on CIFAR-100

In the image classification setting, we empirically compare our approach to both standard baseline and traditional tensor regression, and assess the robustness of each method in the face of adversarial noise.

CIFAR-100 consists of 60,000 RGB images in 100 classes (Krizhevsky & Hinton, 2009). We pre-processed the data by centering and scaling each image and then augmented the training images with random cropping and random horizontal flipping.

We compare the stochastic regularization tensor regression layer to full-rank tensor regression, average pooling and a fully-connected layer in an 18-layer residual network (ResNet) (He et al., 2016). For all networks, we used a batch size of and trained for

epochs, and minimized the cross-entropy loss using stochastic gradient descent (SGD). The initial learning rate was set to

and lowered by a factor of at epochs , and . We used a weight decay ( penalty) of and a momentum of .

Results: Table 1 presents results obtained on the CIFAR-100 dataset, on which our method matches or outperforms other methods, including the same architectures without SRR. Our regularization method makes the network more robust by reducing over-fitting, thus allowing for superior performance on the testing set.

Architecture Accuracy
ResNet without pooling 73.31 %
ResNet 75.88 %
ResNet with TRL 76.02 %
ResNet with Tucker SRR 76.05 %
ResNet with CP SRR 76.19 %
Table 1: Classification accuracy for CIFAR-100

A natural question is whether the model is sensitive to the choice of rank and (or drop rate when sampling with repetition). To assess this, we show the performance as a function of both rank and in figure 3. As can be observed, there is a large surface for which performance remains the same while decreasing both parameters (note the logarithmic scale for the rank). This means that, in practice, choosing good values for these is not a problem.

(a) Bernoulli SRR
(b) Repeat SRR
Figure 3: CIFAR-100 test accuracy as a function of the compression ratio (logarithmic scale) and the Bernoulli probability (left) or the drop rate (right). There is a large region for which dropping both the rank and does not hurt performance.

Robustness to adversarial attacks: We test for robustness to adversarial examples produced using the Fast Gradient Sign Method (Kurakin et al., 2016) in Foolbox (Rauber et al., 2017). In this method, the sign of the optimization gradient multiplied by the perturbation magnitude is added to the image in a single iteration. The perturbations we used are of magnitudes .

In addition to improving performance by reducing over-fitting, our proposed stochastic regularization makes the model more robust to perturbations in the input, for both random noise and adversarial attacks.

We tested the robustness of our models to adversarial attacks, when trained in the same configuration. In figure 3(a), we report the classification accuracy on the test set, as a function of the added adversarial noise. The models were trained without any adversarial training, on the training set, and adversarial noise was added to the test samples using the Fast Gradient Sign method. Our model is much more robust to adversarial attacks. Finally, we perform a thorough comparison of the various regularization strategies, the results of which can be seen in figure 4(a).

Figure 4: Robustness to adversarial attacks using Fast Gradient Sign attacks of various models, trained on CIFAR-100. Our stochastically rank-regularized architecture is much more robust to adversarial attacks, even though adversarial training was not used.

Classification accuracy (%)

(a) *
(b) FGS attack on Tucker TRL with different dropout rates on the tensor regression weights.
(c) FGS attack on Bernoulli Tucker SRR-TRL with different drop rates.
(d) FGS attack on CP SRR-TRL with different drop rates.
Figure 5: Robustness to adversarial attacks, measured by adding adversarial noise to the test images, using the Fast Gradient Sign, on CIFAR-100 and Bernoulli drop. We compare a Tucker tensor regression layer with dropout applied to the regression weight tensor 4(b) to our stochastic rank-regularized TRL, both in the Tucker (Subfig. 4(c)) and CP (Subfig. 4(d)) case.
(a) *

5.3 Phenotypic trait prediction from MRI data

In the regression setting, we investigate the performance of our SRR-TRL in a challenging, real-life application, on a very large-scale dataset. This case is particularly interesting since the MRI volumes are large 3D tensors, all modes of which carry important information. The spatial information is traditionally discarded during the flattening process, which we avoid by using a tensor regression layer.

The UK Biobank brain MRI dataset is the world’s largest MRI imaging database of its kind (Sudlow et al., 2015). The aim of the UK Biobank Imaging Study is to capture MRI scans of vital organs for primarily healthy individuals by 2022. Associations between these images and lifestyle factors and health outcomes, both of which are already available in the UK Biobank, will enable researchers to improve diagnoses and treatments for numerous diseases. The data we use here consists of T1-weighted MR images of the brain for individuals captured on a 3 T Siemens Skyra system. are used for training and rest are used to test and validate. The target label is the age for each individual at the time of MRI capture. We use skull-stripped images that have been aligned to the MNI152 template (Jenkinson et al., 2002) for head-size normalization. We then center and scale each image to zero mean and unit variance for intensity normalization.

Architecture MAE
3D-ResNet without pooling N/A
3D-ResNet 3.23 years
3D-ResNet with TRL 2.99 years
3D-ResNet with Tucker SRR 2.96 years
3D-ResNet with CP SRR 2.58 years
Table 2: Classification accuracy for UK Biobank MRI. The ResNet with TRL and our stochastic rank-regularization performs better, while the baseline ResNet without average pooling did not train at all. The version with average pooling did train but converged to a much worse performance.

Results: For MRI-based experiments we implement an 18-layer ResNet with three-dimensional convolutions. We minimize the mean squared error using Adam (Kingma & Ba, 2014), starting with an initial learning rate of , reduced by a factor of 10 at epochs 25, 50, and 75. We train for 100 epochs with a mini-batch size of 8 and a weight decay ( penalty) of . As previously observed, our Stochastic Rank Regularized tensor regression network outperforms the ResNet baseline by a large margin, Table 2

. To put this into context, the current state-of-art for convolutional neural networks on age prediction from brain MRI on most datasets is an MAE of around 3.6 years

(Cole et al., 2017a; Herent et al., 2018).

Robustness to noise: We tested the robustness of our model to white Gaussian noise added to the MRI data. Noise in MRI data typically follows a Rician distribution but can be approximated by a Gaussian for signal-to-noise ratios (SNR) greater than (Gudbjartsson & Patz, 1995). As both the signal (MRI voxel intensities) and noise are zero-mean, we define , where is the variance. We incrementally increase the added noise in the test set and compare the error rate of the models.

Figure 6: Age prediction error on the MRI test set as a function of increased added noise.

The ResNet with SRR is significantly more robust to added white Gaussian noise compared to the same architectures without SRR (figure 6). At signal-to-noise ratios below 10, the accuracy of a standard ResNet with average pooling is worse than a model that predicts the mean of training set (MAE = 7.9 years). Brain morphology is an important attribute that has been associated with various biological traits including cognitive function and overall health (Pfefferbaum et al., 1994; Swan et al., 1998). By keeping the structure of the brain represented in MRI in every layer of the architecture, the model has more information to learn a more accurate representation of the entire input. Additionally, the stochastic dropping of ranks forces the representation to be robust to confounds. This a particularly important property for MRI analysis since intensities and noise artifacts can vary significantly between MRI scanners (Wang et al., 1998). SRR enables both more accurate and more robust trait predictions from MRI that can consequently lead to more accurate disease diagnoses.

6 Conclusion

We introduced the stochastic rank-regularized tensor regression networks. By adding rank-randomization during training, this renders the network more robust and lead to better performance. This also translates to more stable training, and networks less prone to over-fitting. The low-rank, robust representation also makes the network more resilient to noise, both adversarial and random. Our results demonstrate superior performance and convergence on a variety of challenging tasks, including MRI data and images.


This research has been conducted using the UK Biobank Resource under Application Number 18545.