1 Introduction
Deep neural networks have been evolved to a generalpurpose machine learning method with remarkable performance on practical applications (LeCun et al., 2015). Such models are usually overparameterized, involving an enormous number (possibly millions) of parameters. This is much larger than the typical number of available training samples, making deep networks prone to overfitting (Caruana et al., 2001)
. Coupled with overfitting, the large number of unknown parameters makes deep learning models extremely hard and computationally expensive to train, requiring huge amount of memory and computation power. Such resources are often available only in massive computer clusters, preventing deep networks to be deployed in resource limited machines such as mobile and embedded devices.
To prevent deep neural networks from overfitting and improve their generalization ability, several explicit and implicit regularization methods have been proposed. More specifically, explicit regularization strategies, such as weight decay involve norm regularization of the parameters (Nowlan & Hinton, 1992; Krogh & Hertz, 1992). Replacing the with norm has been also investigated (Scardapane et al., 2017; Zhang et al., 2016). Besides the aforementioned generalpurpose regularization functions, neural networks specific methods such as early stopping of backpropagation (Caruana et al., 2001)
(Ioffe & Szegedy, 2015), dropout (Srivastava et al., 2014) and its variants –e.g., DropConnect (Wan et al., 2013)– are algorithmic approaches to reducing overfitting in overparametrized networks and have been widely adopted in practice.Reducing the storage and computational costs of deep networks has become critical for meeting the requirements of environments with limited memory or computational resources. To this end, a surge of network compression and approximation algorithms have recently been proposed in the context of deep learning. By leveraging the redundancy in network parameters, methods such as Tai et al. (2015); Cheng et al. (2015); Yu et al. (2017); Kossaifi et al. (2018) employ lowrank approximations of deep networks’ weight matrices (or tensors) for parameter reduction. Network compression methods in the frequency domain (Chen et al., 2016)
have also been investigated. An alternative approach for reducing the number of effective parameters in deep nets relies on sketching, whereby, given a matrix or tensor of input data or parameters, one first compresses it to a much smaller matrix (or tensor) by multiplying it by a (usually) random matrix with certain properties
(Kasiviswanathan et al., 2017; Daniely et al., 2016).A particularly appealing approach to network compression, especially for visual data^{1}^{1}1Most modern data is inherently multidimensional color images are naturally represented by order tensors, videos by order tensors, etc.) (and other types of multidimensional and multiaspect data) is tensor regression networks (Kossaifi et al., 2018). Deep neural networks typically leverage the spatial structure of input data via series of convolutions, pointwise nonlinearities, pooling, etc. However, this structure is usually wasted by the addition, at the end of the networks’ architectures, of a flattening layer followed by one or several fullyconnected layers. A recent line of study focuses on alleviating this using tensor methods. Kossaifi et al. (2017) proposed tensor contraction as a layer, to reduce the size of activation tensors, and demonstrated large space savings by replacing fullyconnected layers with this layer. However, a flattening layer and fullyconnected layers were still ultimately needed for producing the outputs. Recently, tensor regression networks (Kossaifi et al., 2018) propose to replace flattening and fullyconnected layers entirely with a tensor regression layer (TRL). This preserves the structure by expressing an output tensor as the result of a tensor contraction between the input tensor and some lowrank regression weight tensors. In addition, these allow for large space savings without sacrificing accuracy. Cao et al. (2017) explore the same model with various lowrank structures on the regression weight tensor.
In this paper, we combine ideas from networks regularization, lowrank approximation of networks, and randomized sketching in a principled way and introduce a novel stochastic regularization term to the tensor regression networks. It consists of a novel randomized lowrank tensor regression, which leads to the stochastic reduction of the rank, either by a fixed percentage during training or according to a series of Bernoulli random variables. This is akin to dropout, which, by randomly dropping units during training, prevents overfitting. However, rather than dropping random elements from the
activationtensor, this is done on the regression weight tensor. We explore two schemes: (i) selecting random elements to keep, following a Bernoulli distribution and (ii) keeping a random subset of the
fibers of the tensor, with replacement. We theoretically and empirically establish the link between CP TRL with the proposed regularizer and the dropout on the deterministic lowrank tensor regression.To demonstrate the practical advantages of this method, we conducted experiments in image classification and phenotypic trait prediction from MRI. To this end, the CIFAR100 and the UK Biobank brain MRI datasets were employed. Experimental results demonstrate that the proposed method i) improves performance in both classification and regression tasks, ii) decreases overfitting, iii) leads to more stable training and iv) largely improves robustness to adversarial attacks and random noise.
One notable application of deep neural networks is in medical imaging, particularly magnetic resonance imaging (MRI). MRI analysis performed using deep learning includes age prediction for brainage estimation
(Cole et al., 2017a). Brainage has been associated with a range of diseases and mortality (Cole et al., 2017b), and could be an early predictor for Alzheimer’s disease (Franke et al., 2012). A more accurate and more robust brain age estimation can consequently lead to more accurate disease diagnoses. We demonstrate a large performance improvement (more than 20%) on this task using a 3DResNet with our proposed stochastically rankregularized TRL, compared to a regular 3DResNet.2 Closely related work
Network regularization and dropout. Several methods that improve generalization by mitigating overfitting have been developed in the context of deep learning. The interested reader is referred to the work of Kukačka et al. (2017) and the references therein for a comprehensive survey of over 50 different regularization techniques for deep networks.
The most closely related regularization method to our approach is Dropout (Srivastava et al., 2014)
, which is probably the most widely adopted technique for training neural networks while preventing overfitting. Concretely, during dropout training each unit (i.e., neuron) is equipped with a binary Bernoulli random variable and only the network’s weights whose corresponding Bernoulli variables are sampled with value 1 are updated at each backpropagation step. At each iteration, those Bernoulli variables are resampled again and the weights are updated accordingly. The proposed regularization method can be interpreted as dropout on lowrank tensor regression, a fact which is proved in Section
4.3.Sketching and deep networks approximation. Daniely et al. (2016)
apply sketching to the input data in order to sparsify them and reduce their dimensionality. Subsequently they show any sparse polynomial function can be computed, on all sparse binary vectors, by a single layer neural network that takes a compact sketch of the vector as input. In contrast,
Kasiviswanathan et al. (2017), approximate neural networks and apply a random sketching on weight matrices/tensors instead of input data and demonstrate that given a fixed layer input, the output of this layer using sketching matrices is an unbiased estimator of the original output of this layer and has bounded variance. As opposed to the aforementioned sketching methods for deep networks approximation, the proposed method applies sketching in the lowrank factorization of weights.
Randomized tensor decompositions. Tensor decompositions exhibit high computational cost and low convergence rate when applied to massive multidimensional data. To accelerate computation, randomized tensor decompositions have been employed to scale tensor decompositions. A randomized least squares algorithm for CP decomposition is proposed by Battaglino et al. (2018), which is significantly faster than traditional CP decomposition. In (Erichson et al., 2017), CP is applied on a small tensor generated by tensor random projection of the highdimensional tensor. The CP decomposition of the largescale tensor is obtained by back projection of the CP decomposition of the small tensor. Wang et al. (2015) introduce a fast yet provable randomized CP decomposition that performs randomized tensor contraction using FFT. Methods in (Sidiropoulos et al., 2014; Vervliet et al., 2014) are highly computationally efficient algorithms for computing largescale CP decompositions by applying randomization (random projections) into a set of small tensors, derived by subdividing a tensor into a set of blocks. Fast randomized algorithms that employ sketching for approximating Tucker decomposition have been also investigated (Tsourakakis, 2010; Zhou et al., 2014). More recently, a randomized tensor ring decomposition that employs tensor random projections has been developed in Yuan et al. (2019). The most similar method to ours is that of Battaglino et al. (2018), where elements of the tensor are sampled randomly, and each factor of the decomposition updated in an iterative manner. By contrast, our method allows for endtoend training, and applies randomization on the fibers of the tensor, effectively randomizing the rank of the weight tensor.
3 Tensor Regression Networks
In this section, we introduce the notations and notions necessary to introduce our stochastic rank regularization.
Notation:
We denote vectors (1order tensors) and matrices (2order tensors).
is the identity matrix. We denote
tensors of order , and denote its element as . A colon is used to denote all elements of a mode e.g. the mode0 fibers of are denoted as . The transpose of is denoted . Finally, for any denotes the set of integers , and the integer division of by .Tensor unfolding:
Given a tensor, , its mode unfolding is a matrix , with and is defined by the mapping from element to , with
Tensor vectorization:
Given a tensor, , we can flatten it into a vector of size defined by the mapping from element of to element of , with
Moden product:
For a tensor and a matrix , the nmode product of a tensor is a tensor of size and can be expressed using unfolding of and the classical dot product as:
Generalized inner product:
For two tensors and , we denote by the contraction of by along their last (respectively first) modes.
Kruskal tensor:
Given a tensor , the CanonicalPolyadic decomposition (CP), also called PARAFAC, decomposes it into a sum of rank1 tensors. The number of terms in the sum, , is known as the rank of the decomposition. Formally, we find the vectors , for such that:
(1) 
These vectors can be collected in matrices, called factors or the decomposition. Specifically, we define, for each factor , The magnitude of the factors can optionally be absorbed in a vector of weights , such that
The decomposition can be denoted more compactly as , or if a weights vector is used.
Tucker tensor:
Given a tensor , we can decompose it into a low rank core by projecting along each of its modes with projection factors , with .
This allows us to write the tensor in a decomposed form as:
(2) 
Note that the Kruskal form of a tensor can be seen as a Tucker tensor with a superdiagonal core.
Tensor diagrams:
In order to represent easily tensor operations, we adopt the tensor diagrams, where tensors are represented by vertices (circles) and edges represent their modes. The degree of a vertex then represents its order. Connecting two edges symbolizes a tensor contraction between the two represented modes. Figure 1 presents a tensor diagram of the tensor regression layer and its stochastic rankregularized counterpart.
Tensor regression layers (TRL):
Let us denote by the input activation tensor for a sample and the label vector. We are interested in the problem of estimating the regression weight tensor under some fixed low rank :
with  (3) 
with , for each in and .
4 Stochastic rank regularization
In this section, we introduce the stochastic rank regularization (SRR). Specifically, we propose a new stochastic rankregularization, applied to lowrank tensors in decomposed forms. This formulation is general and can be applied to any type of decomposition. We introduce it here, without loss of generality, to the case of Tucker and CP decompositions.
For any , let be a sketch matrix (e.g. a random projection or column selection matrix) and, be a sketch of factor matrix , and a sketch of the core tensor .
Given an activation tensor and a target label vector , a stochastically rank regularized tensor regression layer is written from equation 3 as follows:
(4) 
with being a stochastic approximation of Tucker decomposition, namely:
(5) 
Even though several sketching methods have been proposed, we focus here on SRR with two different types of binary sketching matrices, namely binary matrix sketching with replacement and binary diagonal matrix sketching with Bernoulli entries.
4.1 SRR with replacement:
In this setting, we introduce the SRR with binary sketching matrix (with replacement). We first choose .
Mathematically, we introduce the uniform sampling matrices . is a uniform sampling matrix, selecting elements, where . In other words, for any , verifies:
(6) 
Note that in practice this product is never explicitly computed, we simply select the correct elements from and its corresponding factors.
4.2 TuckerSRR with Bernoulli entries
In this setting, we introduce the SRR with diagonal binary sketching matrix with Bernoulli entries.
For any , let be a random vector, the entries of which are i.i.d. Bernoulli(), then a diagonal Bernoulli sketching matrix is defined as .
When the lowrank structure on the weight tensor of the TRL is imposed using a Tucker decomposition, the randomized Tucker approximation is expressed as:
(7) 
The main advantage of considering the abovementioned sampling matrices is that the products or are never explicitly computed, we simply select the elements from and the corresponding factors.
Interestingly, in analogy to dropout, where each hidden unit is dropped independently with probability , in the proposed randomized tensor decomposition, the columns of the factor matrices and the corresponding fibers of the core tensor are dropped independently and consequently the rank of the tensor decomposition is stochastically dropped. Hence the name stochastic rankregularized TRL of our method.
4.3 CPSRR with Bernoulli entries
An interesting special case of 5 is when the weight tensor of the TRL is expressed using a CP decomposition. In that case, we set , with, for any ,
Then a randomized CP approximation is expressed as:
(8) 
The above randomized CP decomposition on the weights is equivalent to the following formulation:
(9) 
This is easy to see by looking at the individual elements of the sketched factors. Let and . Then Since , if , and otherwise, we get It follows that Since we have
Based on the previous stochastic regularization, for an activation tensor X and a corresponding label vector , the optimization problem for our tensor regression layer with stochastic regularization is given by:
(10) 
In addition, the above stochastic optimization problem can be rewritten as a deterministic regularized problem:
(11) 
5 Experimental evaluation
In this section, we introduce the experimental setting, databases used, and implementation details. We experimented on several datasets, across various tasks, namely image classification and MRIbased regression. All methods were implemented using PyTorch
(Paszke et al., 2017) and TensorLy (Kossaifi et al., 2016).loss of the TRL as a function of the number of epochs for the stochastic case (orange) and the deterministic version based on the regularized objective function (blue). As expected, both formulations are empirically the same.
5.1 Numerical experiments
In this section, we empirically demonstrate the equivalence between our stochastic rank regularization and the deterministic regularization based formulation of the dropout.
To do so, we first created a random regression weight tensor to be a third order tensor of size , formed as a lowrank Kruskal tensor with
components, the factors of which were sampled from an i.i.d. Gaussian distribution. We then generated a tensor of
random samples, X of size, the elements of which were sampled from a Normal distribution. Finally, we constructed the corresponding response array y of size
as: . Using the same regression weight tensor and same procedure, we also generated testing samples and labels.We use this data to train a rank CP SRRTRL, with both our Bernoulli stochastic formulation (equation 10) and its deterministic counterpart (equation 11). We train for epochs, with a batchsize of , and an initial learning rate of , which we decrease by a factor of every epochs. Figure 1(a)
shows the loss function as a function of the epoch number. As expected, both formulations are identical.
5.2 Image classification results on CIFAR100
In the image classification setting, we empirically compare our approach to both standard baseline and traditional tensor regression, and assess the robustness of each method in the face of adversarial noise.
CIFAR100 consists of 60,000 RGB images in 100 classes (Krizhevsky & Hinton, 2009). We preprocessed the data by centering and scaling each image and then augmented the training images with random cropping and random horizontal flipping.
We compare the stochastic regularization tensor regression layer to fullrank tensor regression, average pooling and a fullyconnected layer in an 18layer residual network (ResNet) (He et al., 2016). For all networks, we used a batch size of and trained for
epochs, and minimized the crossentropy loss using stochastic gradient descent (SGD). The initial learning rate was set to
and lowered by a factor of at epochs , and . We used a weight decay ( penalty) of and a momentum of .Results: Table 1 presents results obtained on the CIFAR100 dataset, on which our method matches or outperforms other methods, including the same architectures without SRR. Our regularization method makes the network more robust by reducing overfitting, thus allowing for superior performance on the testing set.
Architecture  Accuracy 

ResNet without pooling  73.31 % 
ResNet  75.88 % 
ResNet with TRL  76.02 % 
ResNet with Tucker SRR  76.05 % 
ResNet with CP SRR  76.19 % 
A natural question is whether the model is sensitive to the choice of rank and (or drop rate when sampling with repetition). To assess this, we show the performance as a function of both rank and in figure 3. As can be observed, there is a large surface for which performance remains the same while decreasing both parameters (note the logarithmic scale for the rank). This means that, in practice, choosing good values for these is not a problem.
Robustness to adversarial attacks: We test for robustness to adversarial examples produced using the Fast Gradient Sign Method (Kurakin et al., 2016) in Foolbox (Rauber et al., 2017). In this method, the sign of the optimization gradient multiplied by the perturbation magnitude is added to the image in a single iteration. The perturbations we used are of magnitudes .
In addition to improving performance by reducing overfitting, our proposed stochastic regularization makes the model more robust to perturbations in the input, for both random noise and adversarial attacks.
We tested the robustness of our models to adversarial attacks, when trained in the same configuration. In figure 3(a), we report the classification accuracy on the test set, as a function of the added adversarial noise. The models were trained without any adversarial training, on the training set, and adversarial noise was added to the test samples using the Fast Gradient Sign method. Our model is much more robust to adversarial attacks. Finally, we perform a thorough comparison of the various regularization strategies, the results of which can be seen in figure 4(a).
5.3 Phenotypic trait prediction from MRI data
In the regression setting, we investigate the performance of our SRRTRL in a challenging, reallife application, on a very largescale dataset. This case is particularly interesting since the MRI volumes are large 3D tensors, all modes of which carry important information. The spatial information is traditionally discarded during the flattening process, which we avoid by using a tensor regression layer.
The UK Biobank brain MRI dataset is the world’s largest MRI imaging database of its kind (Sudlow et al., 2015). The aim of the UK Biobank Imaging Study is to capture MRI scans of vital organs for primarily healthy individuals by 2022. Associations between these images and lifestyle factors and health outcomes, both of which are already available in the UK Biobank, will enable researchers to improve diagnoses and treatments for numerous diseases. The data we use here consists of T1weighted MR images of the brain for individuals captured on a 3 T Siemens Skyra system. are used for training and rest are used to test and validate. The target label is the age for each individual at the time of MRI capture. We use skullstripped images that have been aligned to the MNI152 template (Jenkinson et al., 2002) for headsize normalization. We then center and scale each image to zero mean and unit variance for intensity normalization.
Architecture  MAE 

3DResNet without pooling  N/A 
3DResNet  3.23 years 
3DResNet with TRL  2.99 years 
3DResNet with Tucker SRR  2.96 years 
3DResNet with CP SRR  2.58 years 
Results: For MRIbased experiments we implement an 18layer ResNet with threedimensional convolutions. We minimize the mean squared error using Adam (Kingma & Ba, 2014), starting with an initial learning rate of , reduced by a factor of 10 at epochs 25, 50, and 75. We train for 100 epochs with a minibatch size of 8 and a weight decay ( penalty) of . As previously observed, our Stochastic Rank Regularized tensor regression network outperforms the ResNet baseline by a large margin, Table 2
. To put this into context, the current stateofart for convolutional neural networks on age prediction from brain MRI on most datasets is an MAE of around 3.6 years
(Cole et al., 2017a; Herent et al., 2018).Robustness to noise: We tested the robustness of our model to white Gaussian noise added to the MRI data. Noise in MRI data typically follows a Rician distribution but can be approximated by a Gaussian for signaltonoise ratios (SNR) greater than (Gudbjartsson & Patz, 1995). As both the signal (MRI voxel intensities) and noise are zeromean, we define , where is the variance. We incrementally increase the added noise in the test set and compare the error rate of the models.
The ResNet with SRR is significantly more robust to added white Gaussian noise compared to the same architectures without SRR (figure 6). At signaltonoise ratios below 10, the accuracy of a standard ResNet with average pooling is worse than a model that predicts the mean of training set (MAE = 7.9 years). Brain morphology is an important attribute that has been associated with various biological traits including cognitive function and overall health (Pfefferbaum et al., 1994; Swan et al., 1998). By keeping the structure of the brain represented in MRI in every layer of the architecture, the model has more information to learn a more accurate representation of the entire input. Additionally, the stochastic dropping of ranks forces the representation to be robust to confounds. This a particularly important property for MRI analysis since intensities and noise artifacts can vary significantly between MRI scanners (Wang et al., 1998). SRR enables both more accurate and more robust trait predictions from MRI that can consequently lead to more accurate disease diagnoses.
6 Conclusion
We introduced the stochastic rankregularized tensor regression networks. By adding rankrandomization during training, this renders the network more robust and lead to better performance. This also translates to more stable training, and networks less prone to overfitting. The lowrank, robust representation also makes the network more resilient to noise, both adversarial and random. Our results demonstrate superior performance and convergence on a variety of challenging tasks, including MRI data and images.
Acknowledgements
This research has been conducted using the UK Biobank Resource under Application Number 18545.
References
 Battaglino et al. (2018) Battaglino, C., Ballard, G., and Kolda, T. G. A practical randomized cp tensor decomposition. SIAM Journal on Matrix Analysis and Applications, 39(2):876–901, 2018.
 Cao et al. (2017) Cao, X., Rabusseau, G., and Pineau, J. Tensor regression networks with various lowrank tensor approximations. CoRR, abs/1712.09520, 2017.

Caruana et al. (2001)
Caruana, R., Lawrence, S., and Giles, C. L.
Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping.
In Advances in neural information processing systems, pp. 402–408, 2001.  Chen et al. (2016) Chen, W., Wilson, J., Tyree, S., Weinberger, K. Q., and Chen, Y. Compressing convolutional neural networks in the frequency domain. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1475–1484. ACM, 2016.

Cheng et al. (2015)
Cheng, Y., Yu, F. X., Feris, R. S., Kumar, S., Choudhary, A., and Chang, S.F.
An exploration of parameter redundancy in deep networks with
circulant projections.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 2857–2865, 2015.  Cole et al. (2017a) Cole, J. H., Poudel, R. P., Tsagkrasoulis, D., Caan, M. W., Steves, C., Spector, T. D., and Montana, G. Predicting brain age with deep learning from raw imaging data results in a reliable and heritable biomarker. NeuroImage, 163:115–124, 2017a.
 Cole et al. (2017b) Cole, J. H., Ritchie, S. J., Bastin, M. E., Hernández, M. V., Maniega, S. M., Royle, N., Corley, J., Pattie, A., Harris, S. E., Zhang, Q., et al. Brain age predicts mortality. Molecular psychiatry, 2017b.
 Daniely et al. (2016) Daniely, A., Lazic, N., Singer, Y., and Talwar, K. Sketching and neural networks. arXiv preprint arXiv:1604.05753, 2016.
 Erichson et al. (2017) Erichson, N. B., Manohar, K., Brunton, S. L., and Kutz, J. N. Randomized cp tensor decomposition. arXiv preprint arXiv:1703.09074, 2017.
 Franke et al. (2012) Franke, K., Luders, E., May, A., Wilke, M., and Gaser, C. Brain maturation: predicting individual brainage in children and adolescents using structural mri. Neuroimage, 63(3):1305–1312, 2012.
 Gudbjartsson & Patz (1995) Gudbjartsson, H. and Patz, S. The rician distribution of noisy mri data. Magnetic resonance in medicine, 34(6):910–914, 1995.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Herent et al. (2018) Herent, P., Jegou, S., Wainrib, G., and Clozel, T. Brain age prediction of healthy subjects on anatomic mri with deep learning: going beyond with an” explainable ai” mindset. bioRxiv, pp. 413302, 2018.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 Jenkinson et al. (2002) Jenkinson, M., Bannister, P., Brady, M., and Smith, S. Improved optimization for the robust and accurate linear registration and motion correction of brain images. Neuroimage, 17(2):825–841, 2002.
 Kasiviswanathan et al. (2017) Kasiviswanathan, S. P., Narodytska, N., and Jin, H. Deep neural network approximation using tensor sketching. arXiv preprint arXiv:1710.07850, 2017.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kossaifi et al. (2016) Kossaifi, J., Panagakis, Y., and Pantic, M. Tensorly: Tensor learning in python. arXiv preprint arXiv:1610.09555, 2016.
 Kossaifi et al. (2017) Kossaifi, J., Khanna, A., Lipton, Z., Furlanello, T., and Anandkumar, A. Tensor contraction layers for parsimonious deep nets. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pp. 1940–1946. IEEE, 2017.
 Kossaifi et al. (2018) Kossaifi, J., Lipton, Z. C., Khanna, A., Furlanello, T., and Anandkumar, A. Tensor regression networks. CoRR, abs/1707.08308, 2018.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Krogh & Hertz (1992) Krogh, A. and Hertz, J. A. A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950–957, 1992.
 Kukačka et al. (2017) Kukačka, J., Golkov, V., and Cremers, D. Regularization for deep learning: A taxonomy. arXiv preprint arXiv:1710.10686, 2017.
 Kurakin et al. (2016) Kurakin, A., Goodfellow, I., and Bengio, S. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
 LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
 Mianjy et al. (2018) Mianjy, P., Arora, R., and Vidal, R. On the implicit bias of dropout. In Dy, J. and Krause, A. (eds.), International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pp. 3540–3548, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
 Nowlan & Hinton (1992) Nowlan, S. J. and Hinton, G. E. Simplifying neural networks by soft weightsharing. Neural computation, 4(4):473–493, 1992.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.
 Pfefferbaum et al. (1994) Pfefferbaum, A., Mathalon, D. H., Sullivan, E. V., Rawles, J. M., Zipursky, R. B., and Lim, K. O. A quantitative magnetic resonance imaging study of changes in brain morphology from infancy to late adulthood. Archives of neurology, 51(9):874–887, 1994.
 Rauber et al. (2017) Rauber, J., Brendel, W., and Bethge, M. Foolbox: a python toolbox to benchmark the robustness of machine learning models (2017). URL http://arxiv. org/abs/1707.04131, 2017.
 Scardapane et al. (2017) Scardapane, S., Comminiello, D., Hussain, A., and Uncini, A. Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
 Sidiropoulos et al. (2014) Sidiropoulos, N. D., Papalexakis, E. E., and Faloutsos, C. Parallel randomly compressed cubes: A scalable distributed architecture for big tensor decomposition. IEEE Signal Processing Magazine, 31(5):57–70, 2014.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Sudlow et al. (2015) Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M., et al. Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3):e1001779, 2015.
 Swan et al. (1998) Swan, G. E., DeCarli, C., Miller, B., Reed, T., Wolf, P., Jack, L., and Carmelli, D. Association of midlife blood pressure to latelife cognitive decline and brain morphology. Neurology, 51(4):986–993, 1998.
 Tai et al. (2015) Tai, C., Xiao, T., Zhang, Y., Wang, X., et al. Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067, 2015.
 Tsourakakis (2010) Tsourakakis, C. E. Mach: Fast randomized tensor decompositions. In Proceedings of the 2010 SIAM International Conference on Data Mining, pp. 689–700. SIAM, 2010.

Vervliet et al. (2014)
Vervliet, N., Debals, O., Sorber, L., and De Lathauwer, L.
Breaking the curse of dimensionality using decompositions of incomplete tensors: Tensorbased scientific computing in big data analysis.
IEEE Signal Processing Magazine, 31(5):71–79, 2014.  Wan et al. (2013) Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus, R. Regularization of neural networks using dropconnect. In International Conference on Machine Learning, pp. 1058–1066, 2013.
 Wang et al. (1998) Wang, L., Lai, H.M., Barker, G. J., Miller, D. H., and Tofts, P. S. Correction for variations in mri scanner sensitivity in brain studies with histogram matching. Magnetic resonance in medicine, 39(2):322–327, 1998.
 Wang et al. (2015) Wang, Y., Tung, H.Y., Smola, A. J., and Anandkumar, A. Fast and guaranteed tensor decomposition via sketching. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems (NIPS), pp. 991–999. 2015.
 Yu et al. (2017) Yu, X., Liu, T., Wang, X., and Tao, D. On compressing deep models by low rank and sparse decomposition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 67–76. IEEE, 2017.
 Yuan et al. (2019) Yuan, L., Li, C., Cao, J., and Zhao, Q. Randomized tensor ring decomposition and its application to largescale data reconstruction. arXiv preprint arXiv:1901.01652, 2019.
 Zhang et al. (2016) Zhang, Y., Lee, J. D., and Jordan, M. I. l1regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning, pp. 993–1001, 2016.
 Zhou et al. (2014) Zhou, G., Cichocki, A., and Xie, S. Decomposition of big tensors with low multilinear rank. arXiv preprint arXiv:1412.1885, 2014.