1 Introduction
Convolutional neural networks (CNNs) and their recent extensions have significantly increased their ability to solve complex computer vision tasks, such as image classification, object detection, instance segmentation, image generation, etc. Together with big data and fast development of the internet of things, CNNs bring new tools for solving computer science problems, which are intractable using classical approaches.
Despite the great successes and rapid development of CNNs, most modern neural network architectures contain a huge number of parameters in the convolutional and fully connected layers, therefore, demand extremely high computational costs [47], which makes them difficult to deploy on devices with limited computing resources, like PC or mobile devices. Common approaches to reduce redundancy of the neural network parameters are: structural pruning [22, 59, 14, 21], sparsification [16, 13, 37], quantization [45, 2] and lowrank approximation [11, 33, 27, 4, 15, 29].
The weights of convolutional and fully connected layers are usually overparameterized and known to lie on a lowrank subspace [10]. Hence, it is possible to represent them in lowrank tensor/tensor network formats using e.g., Canonical Polyadic decomposition (CPD) [11, 33, 1], Tucker decomposition [27, 15], Tensor Train decomposition [38, 55]. The decomposed layers are represented by a sequence of new layers with much smaller kernel sizes, therefore, reducing the number of parameters and computational cost in the original model.
Various lowrank tensor/matrix decompositions can be straightforwardly applied to compress the kernels. This article intends to promote the simplest tensor decomposition model, the Canonical Polyadic decomposition (CPD).
1.1 Why CPD
In neural network models working with images, the convolutional kernels are usually tensors of order 4 with severely unbalanced dimensions, e.g., , where represents the filter sizes, and denote the number of input and output channels, respectively. The typical convolutional filters are often of relatively small sizes, e.g., , , compared to the input () and output () dimensions, which in total may have hundred of thousands of filters. This leads to excessive redundancy among the kernel filters, which are particularly suited for tensor decomposition methods. Among lowrank tensor decomposition and tensor networks, the Canonical Polyadic tensor decomposition [18, 23] is the simplest and elegant model, which represents a tensor by sum of rank1 tensors^{1}^{1}1Rank1 tensor of size is an outer product of vectors with dimensions . or equivalently by factor matrices interconnected through a diagonal tensor (Fig. 0(a)). The number of parameters for a CP model of rank is or when we consider kernels as order4 tensors or their reshaped order3 versions, respectively. Usually, CPD gains a relatively high compression ratio since the decomposition rank is not very large [33, 15].
Representation of the high order convolutional kernels in the form of the CP model is equivalent to the use of separable convolutions. In [29], the authors modeled the high order kernels in the generalized multiway convolution by the CP model.
The Tucker tensor decomposition (TKD) [52] is an alternative tensor decomposition method for convolutional kernel compression [27]. The TKD provides more flexible interaction between the factor matrices through a core tensor, which is often dense in practice (Fig. 0(b)). Kim et al. [27] investigated lowrank models at the most suited noise level for different unfoldings^{2}^{2}2The mode unfolding of an order tensor of size reorders the elements of the tensor into a matrix with rows and columns.
of the kernel tensor. This heuristic method does not consider a common noise level for multi modes and is not optimal to attain the approximation error bound.
Block tensor decomposition [7] is an extension of the TKD, which models data as the sum of several Tucker or Kruskal terms, i.e., a TKD with blockdiagonal core tensor. For the same multilinear rank as in TKD, BTD exhibits a smaller number of parameters; however, there are no available proper criteria for block size selection (rank of BTD).
In addition, the other tensor networks, e.g., Tensor Train[39] or Tensor Chain (TC) [12, 26], are not applicable unless the kernel filters are tensorized to higher orders. Besides, the Tensor Chain contains a loop, is not closed and leads to severe numerical instability to ﬁnd the best approximation, see Theorem 14.1.2.2[32] and [17].
We later show that CPD can achieve much better performance with an even higher compression ratio by further compression the Tucker core tensors by solving a suitably formulated optimization problem.
1.2 Why not Standard CPD
In one of the first works applying CPD to convolutional kernels, Denton et al. [11] computed the CPD by sequentially extracting the best rank1 approximation in a greedy way. This type of deflation procedure is not a proper way to compute CPD unless decomposition of orthogonally decomposable tensors [57] or with a strong assumption, e.g., at least two factor matrices are linearly independent, and the tensor rank must not exceed any dimension of the tensor [43]. The reason is that subtracting the best rank1 tensor does not guarantee to decrease the rank of the tensor [49].
In [33], the authors approximated the convolution kernel tensors using the Nonlinear Least Squares (NLS) algorithm [54], one of the best existing algorithms for CPD. However, as mentioned in the Ph.D. thesis [34], it is not trivial to optimize a neural network even when weights from a single layer are factorized, and the authors “failed. ” to find a good SGD learning rate” with finetuning a classification model on the ILSVRC12 dataset.
Diverging Component  Degeneracy. Common phenomena when using numerical optimization algorithms to approximate a tensor of relatively high rank by a lowrank model or a tensor, which has nonunique CPD, is that there should exist at least two rankone tensors such that their Frobenius norms or intensities are relatively high but cancel each other [8],
The degeneracy of CPD is reported in the literature, e.g., in[36, 40, 19, 30, 5, 46]. Some efforts which impose additional constraints on the factor matrices can improve stability and accelerate convergence, such as, columnwise orthogonality [46, 30], positivity or nonnegativity [35]
. However, the constraints are not always applicable in some data, and thus prevent the estimator from getting lower approximation error, yielding to the tradeoff between estimation stability and good approximation error.
^{2}^{2}footnotetext: As shown in [53], RMS error is not the only one minimization criterion for a particular computer vision task.We have applied CPD approximations for various CNNs and confirm that the diverging component occurs for most cases when we used either Alternating Least Squares (ALS) or NLS [54] algorithm. As an example, we approximated one of the last convolutional layers from ResNet18 with rank500 CPD and plotted in Fig. 2(left) intensities of CPD components, i.e., Frobenius norm of rank1 tensors. The ratio between the largest and smallest intensities of rank1 tensors was greater than 30. Fig. 2(right) shows that the sum of squares of intensities for CPD components is (exponentially) higher when the decomposition is with a higher number of components. Another criterion, sensitivity (Definition 1), shows that the standard CPD algorithms are not robust to small perturbations of factor matrices, and sensitivity increases with higher CP rank.
Such degeneracy causes the instability issue when training a CNN with decomposed layers in the CP (or Kruskal) format. More specifically, it causes difficulty for a neural network to perform finetuning, selecting a good set of parameters, and maintaining stability in the entire network. This problem has not been investigated thoroughly. To the best of our knowledge, there is no method for handling this problem.
1.3 Contributions
In this paper, we address the problem of CNN stability compressed by CPD. The key advantages and major contributions of our paper are the following:

We propose a new stable and efficient method to perform neural network compression based on lowrank tensor decompositions.

We demonstrate how to deal with the degeneracy, the most severe problem when approximating convolutional kernels with CPD. Our approach allows finding CPD a reliable representation with minimal sensitivity and intensity.

We show that the combination of Tucker2 (TKD) and the proposed stable CPD (Fig. 0(c)) outperforms CPD in terms of accuracy/compression tradeoff.
We provide results of extensive experiments to confirm the efficiency of the proposed algorithms. Particularly, we empirically show that the neural network with weights in factorized CP format obtained using our algorithms is more stable during finetuning and recovers faster (close) to initial accuracy.
2 Stable Tensor Decomposition Method
2.1 CP Decomposition of Convolutional Kernel
In CNNs, the convolutional layer performs mapping of an input (source) tensor of size into output (target) tensor of size following the relation
(1) 
where , and is an order4 kernel tensor of size ,
is stride, and
is zeropadding size.
Our aim is to decompose the kernel tensor by the CPD or the TKD.
As it was mentioned earlier, we treat the kernel as order3 tensor of the size , and represent the kernel
by sum of rank1 tensors
(2) 
where , and are factor matrices of size , and , respectively. See an illustration of the model in Fig. 0(a). The tensor in the Kruskal format uses parameters.
2.2 Degeneracy and its effect to CNN stability
Degeneracy occurs in most CPD of the convolutional kernels. The Error Preserving Correction (EPC) method [44] suggests a correction to the decomposition results in order to get a
more stable decomposition with lower sensitivity.
There are two possible measures for assessment of the degeneracy degree of the CPD: sum of Frobenius norms of the rank1 tensors[44]
(3) 
and sensitivity, defined as follows.
Definition 1 (Sensitivity [51])
Given a tensor , define the sensitivity as
(4) 
where , , have random i.i.d. elements from
The sensitivity of the decomposition can be measured by the expectation () of the normalized squared Frobenius norm of the difference. In other words, sensitivity of the tensor is a measure with respect to perturbations in individual factor matrices. CPDs with high sensitivity are usually useless.
Lemma 1
(5) 
where denotes the Hadamard elementwise product.
Proof
First, the perturbed tensor in (4) can be expressed as sum of 8 Kruskal terms
Since these Kruskal terms are uncorrelated and expectation of the terms composed by two or three factor matrices , and are negligible, the expectation in (4) can be expressed in the form
(6) 
Next we expand the Frobenius norm of the three Kruskal tensors
(7)  
(8)  
(9) 
where and are KhatriRao and Kronecker products, respectively.
Finally, we replace these above expressions into (6) to obtain the compact expression of sensitivity.
2.3 Stabilization Method
2.3.1 Sensitivity minimization
The first method to correct CPD with diverging components proposed in [44] minimizes the sum of Frobenius norms of rank1 tensors while the approximation error is bounded. In [51]. the Krylov LevenbergMarquardt algorithm was proposed for the CPD with bounded sensitivity constraint.
In this paper, we propose a variant of the EPC method which minimizes the sensitivity of the decomposition while preserving the approximation error, i.e.,
(10)  
s.t. 
The bound, , can represent the approximation error of the decomposition with diverging components. Continuing the CPD using a new tensor with a lower sensitivity can improve its convergence.
2.3.2 Update rules
We derive alternating update formulas for the above optimization problem. While and are kept fixed, the objective function is rewritten to update as
(11)  
s.t. 
where is mode1 unfolding of the kernel tensor , and is a symmetric matrix of size , is a vector of length taken from the diagonal of .
Remark 1
Remark 2
2.4 Tucker Decomposition with Bound Constraint
Another wellknown representation of multiway data is the Tucker Decomposition [52], which decomposes a given tensor into a core tensor and a set of factor matrices (see Fig. 0(b) for illustration). The Tucker decomposition is particularly suited as priorcompression for CPD. In this case, we compute CPD of the core tensor in TKD, which is of smaller dimensions than the original kernels.
For our problem, we are interested in the Tucker2 model (see Fig. 0(b))
(14) 
where is the core tensor of size , and are matrices of size and , respectively. Because of rotational ambiguity, without loss in generality, the matrices and can be assumed to have orthonormal columns.
Different from the ordinary TK2, we seek the smallest TK2 model which holds the approximation error bound [42], i.e.,
(15)  
s.t.  
We will show that the core tensor has closedform expression as in the HOOI algorithm for the orthogonal Tucker decomposition [6], and the twofactor matrices, and , can be sequentially estimated
through Eigenvalue decomposition (EVD).
Lemma 2
The core tensor has closedform expression .
Proof
From the error bound condition, we can derive
which indicates that the core tensor can be expressed as , where is an error tensor such that its norm .
Next define a matrix of size
(16) 
Assume that is the optimal factor matrix with the minimal rank . The optimization in (15) becomes the rank minimization problem for
(17)  
s.t.  
The optimal factor matrix comprises
principal eigenvectors of
, where is the smallest number of eigenvalues, such that their norm exceeds the bound , that is, . It is obvious that the minimal number of columns is achieved, when the bound is smallest, i.e., . Implying that the optimal is . This completes the proof.Similar to the update of , the matrix comprises principal eigenvectors of the matrix of size
(18) 
where is either given or determined based on the bound . The algorithm for TKD sequentially updates and .
3 Implementation
Our method for neural network compression includes the following main steps (see Fig. 3):

Each convolutional kernel is approximated by a tensor decomposition (CPD/TKDCPD in case of ordinary convolutions and SVD in case of convolution) with given rank R.

The CP decomposition with diverging components is corrected using the error preserving method. The result is a new CP model with minimal sensitivity.

An initial convolutional kernel is replaced with a tensor in CPD/TKDCPD or SVD format, which is equivalent to replacing one convolutional layer with a sequence of convolutional layers with a smaller total number of parameters.

The entire network is then finetuned using backpropagation.
CPD Block results in three convolutional layers with shapes , depthwise and , respectively (see Fig. 2(a)). In obtained structure, all spatial convolutions are performed by central group convolution with channels. convolutions allow the transfer of input data to a more compact channel space (with channels) and then return data to initial channel space.
TKDCPD Block is similar to the CPD block, but has 4 convolutional layers with the condition that the CP rank must exceed the multilinear ranks, and (see Fig. 2(c)). This structure allows additionally to reduce the number of parameters and floating operations in a factorized layer. Otherwise, when and , sequential convolutions can be merged into one convolution, converting the TKDCPD layer format to CPD block.
SVD Block is a variant of CPD Block but comprises only twofactor layers, computed using SVD. Degeneracy is not considered in this block, and no correction is applied (see Fig. 2(b)).
Rank Search Procedure. Determination of CP rank is an NPhard problem [23]. We observe that the drop in accuracy by a factorized layer influences accuracy with finetuning of the whole network. In our experiments, we apply a heuristic binary search to find the smallest rank such that drop after single layer finetuning does not exceed a predefined accuracy drop threshold .
4 Experiments
We test our algorithms on three representative convolutional neural network architectures for image classification: VGG16 [48], ResNet18, ResNet50 [20]. We compressed and convolutional kernels with CPD, CPD with sensitivity correction (CPDEPC), and TuckerCPD with the correction (TKDCPDEPC). The networks after finetuning are evaluated through top 1 and top 5 accuracy on ILSVRC12 [9] and CIFAR100 [31].
We conducted a series of layerwise compression experiments and measured accuracy recovery and whole model compression of the decomposed architectures. Most of our experiments were devoted to the approximation of single layers when other layers remained intact. In addition, we performed compression of entire networks.
The experiments were conducted with the popular neural networks framework Pytorch on GPU server with NVIDIA V100 GPUs. As a baseline for ILSVRC12 we used a pretrained model shipped with Torchvision. Baseline CIFAR100 model was trained using the Cutout method. The finetuning process consists of two parts: local or single layer finetuning, and entire network finetuning. The model was trained with an SGD optimizer with an initial learning rate of and learning decay of at each loss saturation stage, weight decay was set as .
4.1 LayerWise Study
4.1.1 CPDEPC vs CPD
For this study, we decomposed the kernel filters in 17 convolutional layers of ResNet18 with different CP ranks, , ranging from small (10) to relatively high rank (500). The CPDs were run with a sufficiently large number of iterations so that all models converged or there was no significant improvement in approximation errors.
Experiment results show that for all decomposition ranks, the CPDEPC regularly results in considerably higher top 1 and top 5 model accuracy than the standard CPD algorithm. Fig. 4 (left) demonstrates an illustrative example for layer4.1.conv1. An important observation is that the compressed network using CPD even with the rank of 500 (and finetuning) does not achieve the original network’s accuracy. However, with EPC, the performances are much better and attain the original accuracy with the rank of 450. Even a much smaller model with the rank of 250 yields a relatively good result, with less than 1 loss of accuracy.
Next, each convolutional layer in ResNet18 was approximated with different CP ranks and finetuned. The best model in terms of top1 accuracy was then selected. Fig. 4 (right) shows relation between the sensitivity and accuracy of the best models. It is straightforward to see that the models estimated using CPD exhibit high sensitivity, and are hard to train. The CPDEPC suppressed sensitivities of the estimated models and improved the performance of the compressed networks. The CPDEPC gained the most remarkable accuracy recovery on deeper layers of CNNs.
The effect is significant for some deep convolutional layers of the network with top1 accuracy difference.
4.1.2 CPDEPC vs TKDEPC
Next, we investigated the proposed compression approach based on the hybrid TKDCPD model with sensitivity control. Similar experiments were conducted for the CIFAR100 dataset. The TK multilinear ranks were kept fixed, while the CP rank varied in a wide range.
In Fig. 5, we compare accuracy of the two considered compressed approaches applied to the layer 4.0.conv1 in ResNet18. For this case, CPDEPC still demonstrated a good performance. The obtained accuracy is very consistent, implying that the layer exhibits a lowrank structure. The hybrid TKDCPD yielded a rather low accuracy for small models, i.e., with small ranks, which are much worse than the CPDbased model with less or approximately the same number of parameters. However, the method quickly attained the original top1 accuracy and even exceeded the top5 accuracy when the .
Comparison of accuracy vs. the number of FLOPs and parameters for the other layers is provided in Fig. 6. Each dot in the figure represents (accuracy, no. FLOPs) for each model. The dots for the same layers are connected by dashed lines. Once again, TKDEPC achieved higher top 1 and top 5 accuracy with a smaller number of parameters and FLOPs, compared to CPDEPC.
4.2 Full Model Compression
In this section, we demonstrate the efficiency of our proposed method in a full model compression of three wellknown CNNs VGG16 [48], ResNet18, ResNet50 [20] for the ILSVRC12. We compressed all convolutional layers remaining fullyconnected layers intact. The proposed scheme gives for VGG16, for ResNet18 and for ResNet50 reduction in the number of weights and FLOPs respectively. Table 1 shows that our approach yields a high compression ratio while having a moderate accuracy drop.
VGG [48]. We compared our method with other lowrank compression approaches on VGG16. The Asym method [58] is one of the first successful methods on the whole VGG16 network compression. This method exploits matrix decomposition, which is based on SVD and is able to reduce the number of flops by a factor of 5. Kim et al. [27] applied TKD with ranks selected by VBMF, and achieved a comparable compression ratio but with a smaller accuracy drop. As can be seen from the table 1, our approach outperformed both Asym and TKD in terms of compression ratio and accuracy drop.
ResNet18 [20]. This architecture is one of the lightest in the ResNet family, which gives relatively high accuracy. Most convolutional layers in ResNet18 are with kernel size , making it a perfect candidate for the lowrank based methods for compression. We have compared our results with channel pruning methods [25, 59, 14] and iterative lowrank approximation method [15]. Among all the considered results, our approach has shown the best performance in terms of compression  accuracy drop tradeoff.
ResNet50 [20]. Compared to ResNet18, ResNet50 is a deeper and heavier neural network, which is used as backbone in various modern applications, such as object detection and segmentation. A large number of convolutions deteriorate performance of lowrank decompositionbased methods. There is not much relevant literature available for compression of this type of ResNet. To the best of our knowledge, the results we obtained can be considered the first attempt to compress the entire ResNet50.
Inference time for Resnet50. We briefly compare the inference time of Resnet50 for the image classification task in Table 2
. The measures were taken on 3 platforms: CPU server with Intel® Xeon® Silver 4114 CPU 2.20 GHz, NVIDIA GPU server with ® Tesla® V100 and Qualcomm mobile CPU ® Snapdragon™ 845. The batch size was choosen to yield small variance in inference measurements, e.g., 16 for the measures on CPU server, 128 for the GPU server and 1 for the mobile CPU.
Model  Method  FLOPs  top1  top5 
VGG16  Asym. [58]    1.00  
TKD+VBMF [27]  4.93    0.50  
Our (EPS^{1}^{1}footnotemark: 1=0.005)  5.26  0.92  0.34  
ResNet18  Channel Gating NN [25]  1.61  1.62  1.03 
Discriminationaware Channel Pruning [59]  1.89  2.29  1.38  
FBS [14]  1.98  2.54  1.46  
MUSCO [15]  2.42  0.47  0.30  
Our (EPS^{1}^{1}footnotemark: 1=0.00325)  3.09  0.69  0.15  
ResNet50  Our (EPS^{1}^{1}footnotemark: 1=0.0028)  2.64  1.47  0.71 
^{1}^{1}footnotemark: 1 EPS: accuracy drop threshold. Rank of the decomposition is chosen to maintain the drop in accuracy lower than EPS. 
Platform  Model inference time  

Original  Compressed  
Intel® Xeon®Silver 4114 CPU 2.20 GHz  3.92 0.02 s  2.84 0.02 s 
NVIDIA®Tesla®V100  102.3 0.5 ms  89.5 0.2 ms 
Qualcomm®Snapdragon™845  221 4 ms  171 4 ms 
5 Discussion and Conclusions
Replacing a large dense kernel in a convolutional or fullyconnected layer by its lowrank approximation is equivalent to substituting the initial layer with multiple ones, which in total have fewer parameters. However, as far as we concerned, the sensitivity of the tensorbased models has never been considered before. The closest method proposes to add regularizer on the Frobenius norm of each weight to prevent overfitting.
In this paper, we have shown a more direct way to control the tensorbased network’s sensitivity. Through all the experiments for both ILSVRC12 and CIFAR100 dataset, we have demonstrated the validity and reliability of our proposed method for compression of CNNs, which includes a stable decomposition method with minimal sensitivity for both CPD and the hybrid TKDCPD.
As we can see from recent deep learning literature
[24, 50, 29], modern stateoftheart architectures exploit the CP format when constructing blocks of consecutive layers, which consist of convolution followed by depthwise separable convolution. The intuition that stays behind the effectiveness of such representation is that first convolution maps data to a higherdimensional subspace, where the features are more separable, so we can apply separate convolutional kernels to preprocess them. Thus, representing weights in CP format using stable and efficient algorithms is the simplest and efficient way of constructing reduced convolutional kernels.To the best of our knowledge, our paper is the first work solving a problem of building weights in the CP format that is stable and consistent with the finetuning procedure.
The ability to control sensitivity and stability of factorized weights might be crucial when approaching incremental learning task [3] or multimodal tasks, where information fusion across different modalities is performed through shared weight factors.
Our proposed CPDEPC method can allow more stable finetuning of architectures containing higherorder CP convolutional layers [29, 28] that are potentially very promising due to the ability to propagate the input structure through the whole network. We leave the mentioned directions for further research.
Acknowledgements
The work of A.H. Phan, A. Cichocki, I. Oseledets, J. Gusak, K. Sobolev, K. Sozykin and D. Ermilov was supported by the Ministry of Education and Science of the Russian Federation under Grant 14.756.31.0001. The results of this work were achieved during the cooperation project with Noah’s Ark Lab, Huawei Technologies. The authors sincerely thank the Referees for very constructive comments which helped to improve the quality and presentation of the paper. The computing for this project was performed on the Zhores CDISE HPC cluster at Skoltech[56].
References
 [1] (2017) CPdecomposition with tensor power method for convolutional neural networks compression. In 2017 IEEE International Conference on Big Data and Smart Computing, BigComp 2017, Jeju Island, South Korea, February 1316, 2017, pp. 115–118. External Links: Document Cited by: §1.
 [2] (2019) Matrix and tensor decompositions for training binary neural networks. arXiv preprint arXiv:1904.07852. Cited by: §1.
 [3] (2020) Incremental multidomain learning with network latent tensor factorization. In AAAI, Cited by: §5.
 [4] (2018) Adaptive mixture of lowrank factorizations for compact neural modeling. In CDNNRIA Workshop, NIPS, Cited by: §1.

[5]
(2016)
Tensor networks for dimensionality reduction and largescale optimization: part 1 lowrank tensor decompositions.
Foundations and Trends® in Machine Learning
9 (45), pp. 249–429. External Links: ISSN 19358237 Cited by: §1.2.  [6] (200003) On the best rank1 and rank(R1,R2,. . .,RN) approximation of higherorder tensors. SIAM Journal on Matrix Analysis and Applications 21, pp. 1324–1342. External Links: ISSN 08954798 Cited by: §2.4.
 [7] (2008) Decompositions of a higherorder tensor in block terms — Part I and II. SIAM Journal on Matrix Analysis and Applications 30 (3), pp. 1022–1066. Note: Special Issue on Tensor Decompositions and Applications External Links: Link Cited by: §1.1.
 [8] (200809) Tensor rank and the illposedness of the best lowrank approximation problem. SIAM J. Matrix Anal. Appl. 30, pp. 1084–1127. External Links: ISSN 08954798 Cited by: §1.2.

[9]
(2009)
ImageNet: a largescale hierarchical image database.
2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 248–255. Cited by: §4.  [10] (2013) Predicting parameters in deep learning. In Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 2, NIPS’13, USA, pp. 2148–2156. Cited by: §1.
 [11] (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems 27, pp. 1269–1277. Cited by: §1.2, §1, §1.
 [12] (2011) Optimization problems in contracted tensor networks. Computing and Visualization in Science 14 (6), pp. 271–285. Cited by: §1.1.
 [13] (2016) PerforatedCNNs: acceleration through elimination of redundant convolutions. In Advances in Neural Information Processing Systems, pp. 947–955. Cited by: §1.
 [14] (2019) Dynamic channel pruning: feature boosting and suppression. In International Conference on Learning Representations, Cited by: §1, §4.2, Table 1.
 [15] (2019) Automated multistage compression of neural networks. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 2501–2508. Cited by: §1.1, §1, §1, §4.2, Table 1.
 [16] (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1135–1143. Cited by: §1.
 [17] (2015) Numerical methods in tensor networks. Ph.D. Thesis, Faculty of Mathematics and Informatics, University Leipzig, Germany, Leipzig, Germany. Cited by: §1.1.
 [18] (1970) Foundations of the PARAFAC procedure: models and conditions for an “explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics, 16, pp. 1–84. Cited by: §1.1.
 [19] (2004) The problem and nature of degenerate solutions or decompositions of 3way arrays. In Tensor Decomposition Workshop, Palo Alto, CA. Cited by: §1.2.
 [20] (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. Cited by: §4.2, §4.2, §4.2, §4.

[21]
(201807)
Soft filter pruning for accelerating deep convolutional neural networks.
In
Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI18
, pp. 2234–2240. Cited by: §1.  [22] (2018) AMC: AutoML for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800. Cited by: §1.
 [23] (2013) Most tensor problems are NPhard. Journal of the ACM (JACM) 60 (6), pp. 45. Cited by: §1.1, §3.
 [24] (2019) Searching for MobileNetv3. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1314–1324. Cited by: §5.

[25]
(2019)
Channel gating neural networks
. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. Fox, and R. Garnett (Eds.), pp. 1886–1896. Cited by: §4.2, Table 1.  [26] (2011) quantics approximation of  tensors in highdimensional numerical modeling. Constructive Approximation 34 (2), pp. 257–280. Cited by: §1.1.
 [27] (2016) Compression of deep convolutional neural networks for fast and low power mobile applications. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, External Links: Link Cited by: §1.1, §1, §1, §4.2, Table 1.
 [28] (2019) Tnet: parametrizing fully convolutional nets with a single highorder tensor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7822–7831. Cited by: §5.
 [29] (2020) Factorized higherorder CNNs with an application to spatiotemporal emotion estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6060–6069. Cited by: §1.1, §1, §5, §5.
 [30] (2008) On the nonexistence of optimal solutions and the occurrence of “degeneracy” in the CANDECOMP/PARAFAC model. Psychometrika 73, pp. 431–439. Cited by: §1.2.
 [31] (2009) Learning multiple layers of features from tiny images. Technical Report Technical Report TR2009, University of Toronto, Toronto. Cited by: §4.
 [32] (2012) Tensors: Geometry and Applications. Vol. 128, American Mathematical Society, Providence, RI, USA. Cited by: §1.1.
 [33] (2015) Speedingup convolutional neural networks using finetuned CPdecomposition. International Conference on Learning Representations. Cited by: §1.1, §1.2, §1, §1.
 [34] (2018) Algorithms for speeding up convolutional neural networks. Ph.D. Thesis, Skoltech, Russia. External Links: Link Cited by: §1.2.
 [35] (2009) Nonnegative approximations of nonnegative tensors. Journal of Chemometrics 23 (78), pp. 432–441. Cited by: §1.2.
 [36] (1994) Slowly converging PARAFAC sequences: Swamps and twofactor degeneracies. Journal of Chemometrics 8, pp. 155–168. Cited by: §1.2.
 [37] (2017) Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2498–2507. Cited by: §1.
 [38] (2015) Tensorizing neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 1, NIPS’15, Cambridge, MA, USA, pp. 442–450. Cited by: §1.

[39]
(2009)
Breaking the curse of dimensionality, or how to use SVD in many dimensions
. SIAM Journal on Scientific Computing 31 (5), pp. 3744–3759. Cited by: §1.1.  [40] (2000) Construction and analysis of degenerate PARAFAC models. J. Chemometrics 14 (3), pp. 285–299. Cited by: §1.2.

[41]
(2020)
Quadratic programming over ellipsoids with applications to constrained linear regression and tensor decomposition
. Neural Computing and Applications. External Links: Document Cited by: Remark 1.  [42] (2020) Tensor networks for latent variable analysis: novel algorithms for tensor train approximation. IEEE Transaction on Neural Network and Learning System. External Links: Document Cited by: §2.4.
 [43] (2015) Tensor deflation for CANDECOMP/PARAFAC. Part 1: Alternating Subspace Update Algorithm. IEEE Transaction on Signal Processing 63 (12), pp. 5924–5938. Cited by: §1.2.
 [44] (201903) Error preserving correction: a method for CP decomposition at a target error bound. IEEE Transactions on Signal Processing 67 (5), pp. 1175–1190. External Links: ISSN 1053587X Cited by: §2.2, §2.3.1, Remark 2.
 [45] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Cited by: §1.
 [46] (1997) Twofactor degeneracies and a stabilization of PARAFAC. Chemometrics Intelliggence Laboratory Systems 38 (2), pp. 173–181. Cited by: §1.2.
 [47] (2013) Learning separable filters. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’13, Washington, DC, USA, pp. 2754–2761. External Links: ISBN 9780769549897 Cited by: §1.
 [48] (2015) Very deep convolutional networks for largescale image recognition. In 3rd International Conference on Learning Representations, ICLR, Cited by: §4.2, §4.2, §4.
 [49] (20101201) Subtracting a best rank1 approximation may increase tensor rank. Linear Algebra and its Applications 433 (7), pp. 1276–1300. Cited by: §1.2.
 [50] (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In ICML, Cited by: §5.
 [51] (2019) Sensitivity in tensor decomposition. IEEE Signal Processing Letters 26 (11), pp. 1653–1657. Cited by: §2.3.1, Definition 1.
 [52] (1963) Implications of factor analysis of threeway matrices for measurement of change. Problems in measuring change 15, pp. 122–137. Cited by: §1.1, §2.4.
 [53] (2003) Multilinear subspace analysis of image ensembles. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2003), 1622 June 2003, Madison, WI, USA, pp. 93–99. External Links: Document Cited by: §1.2.
 [54] (2016Mar.) Tensorlab 3.0. Note: Available online External Links: Link Cited by: §1.2, §1.2.
 [55] (2019) Lossless compression for 3DCNNs based on tensor train decomposition. CoRR abs/1912.03647. External Links: Link, 1912.03647 Cited by: §1.
 [56] (2019) Zhores — petaflops supercomputer for datadriven modeling, machine learning and artificial intelligence installed in Skolkovo Institute of Science and Technology. Open Engineering 9 (1). Cited by: Acknowledgements.
 [57] (2001) Rankone approximation to high order tensors. SIAM Journal on Matrix Analysis and Applications 23 (2), pp. 534–550. External Links: Document Cited by: §1.2.
 [58] (2016) Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 1943–1955. Cited by: §4.2, Table 1.
 [59] (2018) Discriminationaware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 883–894. Cited by: §1, §4.2, Table 1.