DeepAI
Log In Sign Up

Stable Low-rank Tensor Decomposition for Compression of Convolutional Neural Network

08/12/2020
by   Anh-Huy Phan, et al.
HUAWEI Technologies Co., Ltd.
Akademie věd ČR
Skoltech
7

Most state of the art deep neural networks are overparameterized and exhibit a high computational cost. A straightforward approach to this problem is to replace convolutional kernels with its low-rank tensor approximations, whereas the Canonical Polyadic tensor Decomposition is one of the most suited models. However, fitting the convolutional tensors by numerical optimization algorithms often encounters diverging components, i.e., extremely large rank-one tensors but canceling each other. Such degeneracy often causes the non-interpretable result and numerical instability for the neural network fine-tuning. This paper is the first study on degeneracy in the tensor decomposition of convolutional kernels. We present a novel method, which can stabilize the low-rank approximation of convolutional kernels and ensure efficient compression while preserving the high-quality performance of the neural networks. We evaluate our approach on popular CNN architectures for image classification and show that our method results in much lower accuracy degradation and provides consistent performance.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

03/24/2019

One time is not enough: iterative tensor decomposition for neural network compression

The low-rank tensor approximation is very promising for the compression ...
10/31/2018

Low-Rank Embedding of Kernels in Convolutional Neural Networks under Random Shuffling

Although the convolutional neural networks (CNNs) have become popular fo...
03/05/2022

How to Train Unstable Looped Tensor Network

A rising problem in the compression of Deep Neural Networks is how to re...
02/28/2020

HOTCAKE: Higher Order Tucker Articulated Kernels for Deeper CNN Compression

The emerging edge computing has promoted immense interests in compacting...
12/19/2014

Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

We propose a simple two-step approach for speeding up convolution layers...
05/16/2018

End-to-end Learning of a Convolutional Neural Network via Deep Tensor Decomposition

In this paper we study the problem of learning the weights of a deep con...
06/28/2018

Automatic Rank Selection for High-Speed Convolutional Neural Network

Low-rank decomposition plays a central role in accelerating convolutiona...

1 Introduction

Convolutional neural networks (CNNs) and their recent extensions have significantly increased their ability to solve complex computer vision tasks, such as image classification, object detection, instance segmentation, image generation, etc. Together with big data and fast development of the internet of things, CNNs bring new tools for solving computer science problems, which are intractable using classical approaches.

Despite the great successes and rapid development of CNNs, most modern neural network architectures contain a huge number of parameters in the convolutional and fully connected layers, therefore, demand extremely high computational costs [47], which makes them difficult to deploy on devices with limited computing resources, like PC or mobile devices. Common approaches to reduce redundancy of the neural network parameters are: structural pruning [22, 59, 14, 21], sparsification [16, 13, 37], quantization [45, 2] and low-rank approximation [11, 33, 27, 4, 15, 29].

The weights of convolutional and fully connected layers are usually overparameterized and known to lie on a low-rank subspace [10]. Hence, it is possible to represent them in low-rank tensor/tensor network formats using e.g., Canonical Polyadic decomposition (CPD) [11, 33, 1], Tucker decomposition [27, 15], Tensor Train decomposition [38, 55]. The decomposed layers are represented by a sequence of new layers with much smaller kernel sizes, therefore, reducing the number of parameters and computational cost in the original model.

Various low-rank tensor/matrix decompositions can be straightforwardly applied to compress the kernels. This article intends to promote the simplest tensor decomposition model, the Canonical Polyadic decomposition (CPD).

1.1 Why CPD

In neural network models working with images, the convolutional kernels are usually tensors of order 4 with severely unbalanced dimensions, e.g., , where represents the filter sizes, and denote the number of input and output channels, respectively. The typical convolutional filters are often of relatively small sizes, e.g., , , compared to the input () and output () dimensions, which in total may have hundred of thousands of filters. This leads to excessive redundancy among the kernel filters, which are particularly suited for tensor decomposition methods. Among low-rank tensor decomposition and tensor networks, the Canonical Polyadic tensor decomposition [18, 23] is the simplest and elegant model, which represents a tensor by sum of rank-1 tensors111Rank-1 tensor of size is an outer product of vectors with dimensions . or equivalently by factor matrices interconnected through a diagonal tensor (Fig. 0(a)). The number of parameters for a CP model of rank- is or when we consider kernels as order-4 tensors or their reshaped order-3 versions, respectively. Usually, CPD gains a relatively high compression ratio since the decomposition rank is not very large [33, 15].

Representation of the high order convolutional kernels in the form of the CP model is equivalent to the use of separable convolutions. In [29], the authors modeled the high order kernels in the generalized multiway convolution by the CP model.

The Tucker tensor decomposition (TKD) [52] is an alternative tensor decomposition method for convolutional kernel compression [27]. The TKD provides more flexible interaction between the factor matrices through a core tensor, which is often dense in practice (Fig. 0(b)). Kim et al. [27] investigated low-rank models at the most suited noise level for different unfoldings222The mode- unfolding of an order- tensor of size reorders the elements of the tensor into a matrix with rows and columns.

of the kernel tensor. This heuristic method does not consider a common noise level for multi modes and is not optimal to attain the approximation error bound.

Block tensor decomposition [7] is an extension of the TKD, which models data as the sum of several Tucker or Kruskal terms, i.e., a TKD with block-diagonal core tensor. For the same multilinear rank as in TKD, BTD exhibits a smaller number of parameters; however, there are no available proper criteria for block size selection (rank of BTD).

In addition, the other tensor networks, e.g., Tensor Train[39] or Tensor Chain (TC) [12, 26], are not applicable unless the kernel filters are tensorized to higher orders. Besides, the Tensor Chain contains a loop, is not closed and leads to severe numerical instability to find the best approximation, see Theorem 14.1.2.2[32] and [17].

We later show that CPD can achieve much better performance with an even higher compression ratio by further compression the Tucker core tensors by solving a suitably formulated optimization problem.

(a) CPD
(b) TKD
(c) TKD-CPD
Figure 1: Approximation of a third-order tensor using Canonical Polyadic tensor decomposition (CPD), Tucker-2 tensor decomposition (TKD), and their combination (TKD-CPD). CPD and TKD are common methods applied for CNN compression.

1.2 Why not Standard CPD

In one of the first works applying CPD to convolutional kernels, Denton et al. [11] computed the CPD by sequentially extracting the best rank-1 approximation in a greedy way. This type of deflation procedure is not a proper way to compute CPD unless decomposition of orthogonally decomposable tensors [57] or with a strong assumption, e.g., at least two factor matrices are linearly independent, and the tensor rank must not exceed any dimension of the tensor [43]. The reason is that subtracting the best rank-1 tensor does not guarantee to decrease the rank of the tensor [49].

In [33], the authors approximated the convolution kernel tensors using the Nonlinear Least Squares (NLS) algorithm [54], one of the best existing algorithms for CPD. However, as mentioned in the Ph.D. thesis [34], it is not trivial to optimize a neural network even when weights from a single layer are factorized, and the authors “failed. ” to find a good SGD learning rate” with fine-tuning a classification model on the ILSVRC-12 dataset.

Diverging Component - Degeneracy. Common phenomena when using numerical optimization algorithms to approximate a tensor of relatively high rank by a low-rank model or a tensor, which has nonunique CPD, is that there should exist at least two rank-one tensors such that their Frobenius norms or intensities are relatively high but cancel each other [8],

The degeneracy of CPD is reported in the literature, e.g., in[36, 40, 19, 30, 5, 46]. Some efforts which impose additional constraints on the factor matrices can improve stability and accelerate convergence, such as, column-wise orthogonality [46, 30], positivity or nonnegativity [35]

. However, the constraints are not always applicable in some data, and thus prevent the estimator from getting lower approximation error, yielding to the trade-off between estimation stability and good approximation error.

22footnotetext: As shown in [53], RMS error is not the only one minimization criterion for a particular computer vision task.

We have applied CPD approximations for various CNNs and confirm that the diverging component occurs for most cases when we used either Alternating Least Squares (ALS) or NLS [54] algorithm. As an example, we approximated one of the last convolutional layers from ResNet-18 with rank-500 CPD and plotted in Fig. 2(left) intensities of CPD components, i.e., Frobenius norm of rank-1 tensors. The ratio between the largest and smallest intensities of rank-1 tensors was greater than 30. Fig. 2(right) shows that the sum of squares of intensities for CPD components is (exponentially) higher when the decomposition is with a higher number of components. Another criterion, sensitivity (Definition 1), shows that the standard CPD algorithms are not robust to small perturbations of factor matrices, and sensitivity increases with higher CP rank.

Such degeneracy causes the instability issue when training a CNN with decomposed layers in the CP (or Kruskal) format. More specifically, it causes difficulty for a neural network to perform fine-tuning, selecting a good set of parameters, and maintaining stability in the entire network. This problem has not been investigated thoroughly. To the best of our knowledge, there is no method for handling this problem.

Figure 2: (Left) Intensity (Frobenius norm) of rank-1 tensors in CPDs of the kernel in the layer of ResNet-18. (Right) Sum of squares of the intensity and Sensitivity vs Rank of CPD. EPC-CPD demonstrates much lower intensity and sensitivity as compared to CPD.

1.3 Contributions

In this paper, we address the problem of CNN stability compressed by CPD. The key advantages and major contributions of our paper are the following:

  • We propose a new stable and efficient method to perform neural network compression based on low-rank tensor decompositions.

  • We demonstrate how to deal with the degeneracy, the most severe problem when approximating convolutional kernels with CPD. Our approach allows finding CPD a reliable representation with minimal sensitivity and intensity.

  • We show that the combination of Tucker-2 (TKD) and the proposed stable CPD (Fig. 0(c)) outperforms CPD in terms of accuracy/compression trade-off.

We provide results of extensive experiments to confirm the efficiency of the proposed algorithms. Particularly, we empirically show that the neural network with weights in factorized CP format obtained using our algorithms is more stable during fine-tuning and recovers faster (close) to initial accuracy.

2 Stable Tensor Decomposition Method

2.1 CP Decomposition of Convolutional Kernel

In CNNs, the convolutional layer performs mapping of an input (source) tensor of size into output (target) tensor of size following the relation

(1)

where , and is an order-4 kernel tensor of size ,

is stride, and

is zero-padding size.

Our aim is to decompose the kernel tensor by the CPD or the TKD. As it was mentioned earlier, we treat the kernel as order-3 tensor of the size , and represent the kernel by sum of rank-1 tensors

(2)

where , and are factor matrices of size , and , respectively. See an illustration of the model in Fig. 0(a). The tensor in the Kruskal format uses parameters.

2.2 Degeneracy and its effect to CNN stability

Degeneracy occurs in most CPD of the convolutional kernels. The Error Preserving Correction (EPC) method [44] suggests a correction to the decomposition results in order to get a more stable decomposition with lower sensitivity. There are two possible measures for assessment of the degeneracy degree of the CPD: sum of Frobenius norms of the rank-1 tensors[44]

(3)

and sensitivity, defined as follows.

Definition 1 (Sensitivity [51])

Given a tensor , define the sensitivity as

(4)

where , , have random i.i.d. elements from

The sensitivity of the decomposition can be measured by the expectation () of the normalized squared Frobenius norm of the difference. In other words, sensitivity of the tensor is a measure with respect to perturbations in individual factor matrices. CPDs with high sensitivity are usually useless.

Lemma 1
(5)

where denotes the Hadamard element-wise product.

Proof

First, the perturbed tensor in (4) can be expressed as sum of 8 Kruskal terms

Since these Kruskal terms are uncorrelated and expectation of the terms composed by two or three factor matrices , and are negligible, the expectation in (4) can be expressed in the form

(6)

Next we expand the Frobenius norm of the three Kruskal tensors

(7)
(8)
(9)

where and are Khatri-Rao and Kronecker products, respectively.

Finally, we replace these above expressions into (6) to obtain the compact expression of sensitivity.

2.3 Stabilization Method

2.3.1 Sensitivity minimization

The first method to correct CPD with diverging components proposed in [44] minimizes the sum of Frobenius norms of rank-1 tensors while the approximation error is bounded. In [51]. the Krylov Levenberg-Marquardt algorithm was proposed for the CPD with bounded sensitivity constraint.

In this paper, we propose a variant of the EPC method which minimizes the sensitivity of the decomposition while preserving the approximation error, i.e.,

(10)
s.t.

The bound, , can represent the approximation error of the decomposition with diverging components. Continuing the CPD using a new tensor with a lower sensitivity can improve its convergence.

2.3.2 Update rules

We derive alternating update formulas for the above optimization problem. While and are kept fixed, the objective function is rewritten to update as

(11)
s.t.

where is mode-1 unfolding of the kernel tensor , and is a symmetric matrix of size , is a vector of length taken from the diagonal of .

Remark 1

The problem (11) can be reformulated as a regression problem with bound constraint

(12)
s.t.

where and . This problem can be solved in closed form solution through the quadratic programming over a sphere [41]. We skip the algorithm details and refer to the solver in [41].

Remark 2

If factor matrices and are normalized to unit length columns, i.e., , , then all entries of the diagonal of are identical. The optimization problem in (11) becomes seeking a weight matrix, , with minimal norm

(13)
s.t.

This sub-optimization problem is similar to that in the EPC approach [44].

2.4 Tucker Decomposition with Bound Constraint

Another well-known representation of multi-way data is the Tucker Decomposition [52], which decomposes a given tensor into a core tensor and a set of factor matrices (see Fig. 0(b) for illustration). The Tucker decomposition is particularly suited as prior-compression for CPD. In this case, we compute CPD of the core tensor in TKD, which is of smaller dimensions than the original kernels.

For our problem, we are interested in the Tucker-2 model (see Fig. 0(b))

(14)

where is the core tensor of size , and are matrices of size and , respectively. Because of rotational ambiguity, without loss in generality, the matrices and can be assumed to have orthonormal columns.

Different from the ordinary TK-2, we seek the smallest TK-2 model which holds the approximation error bound [42], i.e.,

(15)
s.t.

We will show that the core tensor has closed-form expression as in the HOOI algorithm for the orthogonal Tucker decomposition [6], and the two-factor matrices, and , can be sequentially estimated

through Eigenvalue decomposition (EVD).

Lemma 2

The core tensor has closed-form expression .

Proof

From the error bound condition, we can derive

which indicates that the core tensor can be expressed as , where is an error tensor such that its norm .

Next define a matrix of size

(16)

Assume that is the optimal factor matrix with the minimal rank . The optimization in (15) becomes the rank minimization problem for

(17)
s.t.

The optimal factor matrix comprises

principal eigenvectors of

, where is the smallest number of eigenvalues, such that their norm exceeds the bound , that is, . It is obvious that the minimal number of columns is achieved, when the bound is smallest, i.e., . Implying that the optimal is . This completes the proof.

Similar to the update of , the matrix comprises principal eigenvectors of the matrix of size

(18)

where is either given or determined based on the bound . The algorithm for TKD sequentially updates and .

3 Implementation

Our method for neural network compression includes the following main steps (see Fig. 3):

  1. Each convolutional kernel is approximated by a tensor decomposition (CPD/TKD-CPD in case of ordinary convolutions and SVD in case of convolution) with given rank R.

  2. The CP decomposition with diverging components is corrected using the error preserving method. The result is a new CP model with minimal sensitivity.

  3. An initial convolutional kernel is replaced with a tensor in CPD/TKD-CPD or SVD format, which is equivalent to replacing one convolutional layer with a sequence of convolutional layers with a smaller total number of parameters.

  4. The entire network is then fine-tuned using backpropagation.

CPD Block results in three convolutional layers with shapes , depthwise and , respectively (see Fig. 2(a)). In obtained structure, all spatial convolutions are performed by central group convolution with channels. convolutions allow the transfer of input data to a more compact channel space (with channels) and then return data to initial channel space.

TKD-CPD Block is similar to the CPD block, but has 4 convolutional layers with the condition that the CP rank must exceed the multilinear ranks, and (see Fig. 2(c)). This structure allows additionally to reduce the number of parameters and floating operations in a factorized layer. Otherwise, when and , sequential convolutions can be merged into one convolution, converting the TKD-CPD layer format to CPD block.

SVD Block is a variant of CPD Block but comprises only two-factor layers, computed using SVD. Degeneracy is not considered in this block, and no correction is applied (see Fig. 2(b)).

(a)
(b)
(c)
Figure 3: Graphical illustration to the proposed layer formats that show how decomposed factors are used as new weights of the compressed layer. are the number of input of and output channels and is a kernel size. (a) CPD layer format, is a CPD rank. (b) SVD layer format, is a SVD rank. (c) TKD-CPD layer format, is a CPD rank, and are TKD ranks.

Rank Search Procedure. Determination of CP rank is an NP-hard problem [23]. We observe that the drop in accuracy by a factorized layer influences accuracy with fine-tuning of the whole network. In our experiments, we apply a heuristic binary search to find the smallest rank such that drop after single layer fine-tuning does not exceed a predefined accuracy drop threshold .

4 Experiments

We test our algorithms on three representative convolutional neural network architectures for image classification: VGG-16 [48], ResNet-18, ResNet-50 [20]. We compressed and convolutional kernels with CPD, CPD with sensitivity correction (CPD-EPC), and Tucker-CPD with the correction (TKD-CPD-EPC). The networks after fine-tuning are evaluated through top 1 and top 5 accuracy on ILSVRC-12 [9] and CIFAR-100 [31].

We conducted a series of layer-wise compression experiments and measured accuracy recovery and whole model compression of the decomposed architectures. Most of our experiments were devoted to the approximation of single layers when other layers remained intact. In addition, we performed compression of entire networks.

The experiments were conducted with the popular neural networks framework Pytorch on GPU server with NVIDIA V-100 GPUs. As a baseline for ILSVRC-12 we used a pre-trained model shipped with Torchvision. Baseline CIFAR-100 model was trained using the Cutout method. The fine-tuning process consists of two parts: local or single layer fine-tuning, and entire network fine-tuning. The model was trained with an SGD optimizer with an initial learning rate of and learning decay of at each loss saturation stage, weight decay was set as .

4.1 Layer-Wise Study

Figure 4: (Left) Performance evaluation of ResNet-18 on ILSVRC-12 dataset after replacing layer4.1.conv1 by its approximation using CPD and CPD-EPC with various ranks. The networks are fine-tuned after compression. (Right) Top-1 accuracy and sensitivity of the models estimated using CPD (blue) and CPD-EPC (red). Each model has a single decomposed layer with the best CP rank and was fine-tuned after compression. CPD-EPC outperforms CPD in terms of accuracy/sensitivity trade-off.
22footnotetext: layer4.1.conv1 – layer 4, residual block 2(indexing starts with 0), convolutional layer 1

4.1.1 CPD-EPC vs CPD

For this study, we decomposed the kernel filters in 17 convolutional layers of ResNet-18 with different CP ranks, , ranging from small (10) to relatively high rank (500). The CPDs were run with a sufficiently large number of iterations so that all models converged or there was no significant improvement in approximation errors.

Experiment results show that for all decomposition ranks, the CPD-EPC regularly results in considerably higher top 1 and top 5 model accuracy than the standard CPD algorithm. Fig. 4 (left) demonstrates an illustrative example for layer4.1.conv1. An important observation is that the compressed network using CPD even with the rank of 500 (and fine-tuning) does not achieve the original network’s accuracy. However, with EPC, the performances are much better and attain the original accuracy with the rank of 450. Even a much smaller model with the rank of 250 yields a relatively good result, with less than 1 loss of accuracy.

Next, each convolutional layer in ResNet-18 was approximated with different CP ranks and fine-tuned. The best model in terms of top-1 accuracy was then selected. Fig. 4 (right) shows relation between the sensitivity and accuracy of the best models. It is straightforward to see that the models estimated using CPD exhibit high sensitivity, and are hard to train. The CPD-EPC suppressed sensitivities of the estimated models and improved the performance of the compressed networks. The CPD-EPC gained the most remarkable accuracy recovery on deeper layers of CNNs.

The effect is significant for some deep convolutional layers of the network with top-1 accuracy difference.

4.1.2 CPD-EPC vs TKD-EPC

Next, we investigated the proposed compression approach based on the hybrid TKD-CPD model with sensitivity control. Similar experiments were conducted for the CIFAR-100 dataset. The TK multi-linear ranks were kept fixed, while the CP rank varied in a wide range.

In Fig. 5, we compare accuracy of the two considered compressed approaches applied to the layer 4.0.conv1 in ResNet-18. For this case, CPD-EPC still demonstrated a good performance. The obtained accuracy is very consistent, implying that the layer exhibits a low-rank structure. The hybrid TKD-CPD yielded a rather low accuracy for small models, i.e., with small ranks, which are much worse than the CPD-based model with less or approximately the same number of parameters. However, the method quickly attained the original top-1 accuracy and even exceeded the top-5 accuracy when the .

Comparison of accuracy vs. the number of FLOPs and parameters for the other layers is provided in Fig. 6. Each dot in the figure represents (accuracy, no. FLOPs) for each model. The dots for the same layers are connected by dashed lines. Once again, TKD-EPC achieved higher top 1 and top 5 accuracy with a smaller number of parameters and FLOPs, compared to CPD-EPC.

Figure 5: Performance comparison (top1 accuracy – left, top5 accuracy – right) of CPD-EPC and TKD-CPD-EPC in compression of the layer 4.0.conv1 in the pre-trained ResNet-18 on ILSVRC-12 dataset. TKD-CPD-EPC shows better accuracy recovery with a relatively low number of FLOPs. Initial model has FLOPs.
Figure 6: Accuracy vs FLOPs for models obtained from ResNet-18 (CIFAR-100) via compression of one layer using standard CPD (cross), CPD-EPC (square), or TKD-CPD-EPC (circle) decomposition. Each color corresponds to one layer, which has been compressed using three different methods. For each layer, TKD-CPD-EPC outperforms other decompositions in terms of FLOPs, or accuracy, or both.

4.2 Full Model Compression

In this section, we demonstrate the efficiency of our proposed method in a full model compression of three well-known CNNs VGG-16 [48], ResNet-18, ResNet-50 [20] for the ILSVRC-12. We compressed all convolutional layers remaining fully-connected layers intact. The proposed scheme gives for VGG-16, for ResNet-18 and for ResNet-50 reduction in the number of weights and FLOPs respectively. Table 1 shows that our approach yields a high compression ratio while having a moderate accuracy drop.

VGG [48]. We compared our method with other low-rank compression approaches on VGG-16. The Asym method [58] is one of the first successful methods on the whole VGG-16 network compression. This method exploits matrix decomposition, which is based on SVD and is able to reduce the number of flops by a factor of 5. Kim et al. [27] applied TKD with ranks selected by VBMF, and achieved a comparable compression ratio but with a smaller accuracy drop. As can be seen from the table 1, our approach outperformed both Asym and TKD in terms of compression ratio and accuracy drop.

ResNet-18 [20]. This architecture is one of the lightest in the ResNet family, which gives relatively high accuracy. Most convolutional layers in ResNet-18 are with kernel size , making it a perfect candidate for the low-rank based methods for compression. We have compared our results with channel pruning methods [25, 59, 14] and iterative low-rank approximation method [15]. Among all the considered results, our approach has shown the best performance in terms of compression - accuracy drop trade-off.

ResNet-50 [20]. Compared to ResNet-18, ResNet-50 is a deeper and heavier neural network, which is used as backbone in various modern applications, such as object detection and segmentation. A large number of convolutions deteriorate performance of low-rank decomposition-based methods. There is not much relevant literature available for compression of this type of ResNet. To the best of our knowledge, the results we obtained can be considered the first attempt to compress the entire ResNet-50.

Inference time for Resnet-50. We briefly compare the inference time of Resnet-50 for the image classification task in Table 2

. The measures were taken on 3 platforms: CPU server with Intel® Xeon® Silver 4114 CPU 2.20 GHz, NVIDIA GPU server with ® Tesla® V100 and Qualcomm mobile CPU ®   Snapdragon™  845. The batch size was choosen to yield small variance in inference measurements, e.g., 16 for the measures on CPU server, 128 for the GPU server and 1 for the mobile CPU.

Model Method FLOPs top-1 top-5
VGG-16 Asym. [58] - -1.00
TKD+VBMF [27] 4.93 - -0.50
Our (EPS11footnotemark: 1=0.005) 5.26 -0.92 -0.34
ResNet-18 Channel Gating NN [25] 1.61 -1.62 -1.03
Discrimination-aware Channel Pruning [59] 1.89 -2.29 -1.38
FBS [14] 1.98 -2.54 -1.46
MUSCO [15] 2.42 -0.47 -0.30
Our (EPS11footnotemark: 1=0.00325) 3.09 -0.69 -0.15
ResNet-50 Our (EPS11footnotemark: 1=0.0028) 2.64 -1.47 -0.71

11footnotemark: 1 EPS: accuracy drop threshold. Rank of the decomposition is chosen to maintain the drop in accuracy lower than EPS.
Table 1: Comparison of different model compression methods on ILSVRC-12 validation dataset. The baseline models are taken from Torchvision.
Platform Model inference time
Original Compressed
Intel® Xeon®Silver 4114 CPU 2.20 GHz 3.92 0.02 s 2.84 0.02 s
NVIDIA®Tesla®V100 102.3 0.5 ms 89.5 0.2 ms
Qualcomm®Snapdragon™845 221 4 ms 171 4 ms
Table 2: Inference time and acceleration for ResNet-50 on different platforms.

5 Discussion and Conclusions

Replacing a large dense kernel in a convolutional or fully-connected layer by its low-rank approximation is equivalent to substituting the initial layer with multiple ones, which in total have fewer parameters. However, as far as we concerned, the sensitivity of the tensor-based models has never been considered before. The closest method proposes to add regularizer on the Frobenius norm of each weight to prevent over-fitting.

In this paper, we have shown a more direct way to control the tensor-based network’s sensitivity. Through all the experiments for both ILSVRC-12 and CIFAR-100 dataset, we have demonstrated the validity and reliability of our proposed method for compression of CNNs, which includes a stable decomposition method with minimal sensitivity for both CPD and the hybrid TKD-CPD.

As we can see from recent deep learning literature

[24, 50, 29], modern state-of-the-art architectures exploit the CP format when constructing blocks of consecutive layers, which consist of convolution followed by depth-wise separable convolution. The intuition that stays behind the effectiveness of such representation is that first convolution maps data to a higher-dimensional subspace, where the features are more separable, so we can apply separate convolutional kernels to preprocess them. Thus, representing weights in CP format using stable and efficient algorithms is the simplest and efficient way of constructing reduced convolutional kernels.

To the best of our knowledge, our paper is the first work solving a problem of building weights in the CP format that is stable and consistent with the fine-tuning procedure.

The ability to control sensitivity and stability of factorized weights might be crucial when approaching incremental learning task [3] or multi-modal tasks, where information fusion across different modalities is performed through shared weight factors.

Our proposed CPD-EPC method can allow more stable fine-tuning of architectures containing higher-order CP convolutional layers [29, 28] that are potentially very promising due to the ability to propagate the input structure through the whole network. We leave the mentioned directions for further research.

Acknowledgements

The work of A.-H. Phan, A. Cichocki, I. Oseledets, J. Gusak, K. Sobolev, K. Sozykin and D. Ermilov was supported by the Ministry of Education and Science of the Russian Federation under Grant 14.756.31.0001. The results of this work were achieved during the cooperation project with Noah’s Ark Lab, Huawei Technologies. The authors sincerely thank the Referees for very constructive comments which helped to improve the quality and presentation of the paper. The computing for this project was performed on the Zhores CDISE HPC cluster at Skoltech[56].

References

  • [1] M. Astrid and S. Lee (2017) CP-decomposition with tensor power method for convolutional neural networks compression. In 2017 IEEE International Conference on Big Data and Smart Computing, BigComp 2017, Jeju Island, South Korea, February 13-16, 2017, pp. 115–118. External Links: Document Cited by: §1.
  • [2] A. Bulat, J. Kossaifi, G. Tzimiropoulos, and M. Pantic (2019) Matrix and tensor decompositions for training binary neural networks. arXiv preprint arXiv:1904.07852. Cited by: §1.
  • [3] A. Bulat, J. Kossaifi, G. Tzimiropoulos, and M. Pantic (2020) Incremental multi-domain learning with network latent tensor factorization. In AAAI, Cited by: §5.
  • [4] T. Chen, J. Lin, T. Lin, S. Han, C. Wang, and D. Zhou (2018) Adaptive mixture of low-rank factorizations for compact neural modeling. In CDNNRIA Workshop, NIPS, Cited by: §1.
  • [5] A. Cichocki, N. Lee, I. Oseledets, A. Phan, Q. Zhao, and D. P. Mandic (2016) Tensor networks for dimensionality reduction and large-scale optimization: part 1 low-rank tensor decompositions.

    Foundations and Trends® in Machine Learning

    9 (4-5), pp. 249–429.
    External Links: ISSN 1935-8237 Cited by: §1.2.
  • [6] L. De Lathauwer, B. De Moor, and J. Vandewalle (2000-03) On the best rank-1 and rank-(R1,R2,. . .,RN) approximation of higher-order tensors. SIAM Journal on Matrix Analysis and Applications 21, pp. 1324–1342. External Links: ISSN 0895-4798 Cited by: §2.4.
  • [7] L. De Lathauwer (2008) Decompositions of a higher-order tensor in block terms — Part I and II. SIAM Journal on Matrix Analysis and Applications 30 (3), pp. 1022–1066. Note: Special Issue on Tensor Decompositions and Applications External Links: Link Cited by: §1.1.
  • [8] V. de Silva and L-H. Lim (2008-09) Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl. 30, pp. 1084–1127. External Links: ISSN 0895-4798 Cited by: §1.2.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database.

    2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 248–255.
    Cited by: §4.
  • [10] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas (2013) Predicting parameters in deep learning. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, USA, pp. 2148–2156. Cited by: §1.
  • [11] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems 27, pp. 1269–1277. Cited by: §1.2, §1, §1.
  • [12] M. Espig, W. Hackbusch, S. Handschuh, and R. Schneider (2011) Optimization problems in contracted tensor networks. Computing and Visualization in Science 14 (6), pp. 271–285. Cited by: §1.1.
  • [13] M. Figurnov, A. Ibraimova, D. P. Vetrov, and P. Kohli (2016) PerforatedCNNs: acceleration through elimination of redundant convolutions. In Advances in Neural Information Processing Systems, pp. 947–955. Cited by: §1.
  • [14] X. Gao, Y. Zhao, Ł. Dudziak, R. Mullins, and C. Xu (2019) Dynamic channel pruning: feature boosting and suppression. In International Conference on Learning Representations, Cited by: §1, §4.2, Table 1.
  • [15] J. Gusak, M. Kholyavchenko, E. Ponomarev, L. Markeeva, P. Blagoveschensky, A. Cichocki, and I. Oseledets (2019) Automated multi-stage compression of neural networks. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 2501–2508. Cited by: §1.1, §1, §1, §4.2, Table 1.
  • [16] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1135–1143. Cited by: §1.
  • [17] S. Handschuh (2015) Numerical methods in tensor networks. Ph.D. Thesis, Faculty of Mathematics and Informatics, University Leipzig, Germany, Leipzig, Germany. Cited by: §1.1.
  • [18] R. A. Harshman (1970) Foundations of the PARAFAC procedure: models and conditions for an “explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics, 16, pp. 1–84. Cited by: §1.1.
  • [19] R. A. Harshman (2004) The problem and nature of degenerate solutions or decompositions of 3-way arrays. In Tensor Decomposition Workshop, Palo Alto, CA. Cited by: §1.2.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. Cited by: §4.2, §4.2, §4.2, §4.
  • [21] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang (2018-07) Soft filter pruning for accelerating deep convolutional neural networks. In

    Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18

    ,
    pp. 2234–2240. Cited by: §1.
  • [22] Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018) AMC: AutoML for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800. Cited by: §1.
  • [23] C. J. Hillar and L. H. Lim (2013) Most tensor problems are NP-hard. Journal of the ACM (JACM) 60 (6), pp. 45. Cited by: §1.1, §3.
  • [24] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for MobileNetv3. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1314–1324. Cited by: §5.
  • [25] W. Hua, Y. Zhou, C. M. De Sa, Z. Zhang, and G. E. Suh (2019)

    Channel gating neural networks

    .
    In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 1886–1896. Cited by: §4.2, Table 1.
  • [26] B.N. Khoromskij (2011) -quantics approximation of - tensors in high-dimensional numerical modeling. Constructive Approximation 34 (2), pp. 257–280. Cited by: §1.1.
  • [27] Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin (2016) Compression of deep convolutional neural networks for fast and low power mobile applications. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, External Links: Link Cited by: §1.1, §1, §1, §4.2, Table 1.
  • [28] J. Kossaifi, A. Bulat, G. Tzimiropoulos, and M. Pantic (2019) T-net: parametrizing fully convolutional nets with a single high-order tensor. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7822–7831. Cited by: §5.
  • [29] J. Kossaifi, A. Toisoul, A. Bulat, Y. Panagakis, T. M. Hospedales, and M. Pantic (2020) Factorized higher-order CNNs with an application to spatio-temporal emotion estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6060–6069. Cited by: §1.1, §1, §5, §5.
  • [30] W.P. Krijnen, T.K. Dijkstra, and A. Stegeman (2008) On the non-existence of optimal solutions and the occurrence of “degeneracy” in the CANDECOMP/PARAFAC model. Psychometrika 73, pp. 431–439. Cited by: §1.2.
  • [31] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical Report Technical Report TR-2009, University of Toronto, Toronto. Cited by: §4.
  • [32] J. M. Landsberg (2012) Tensors: Geometry and Applications. Vol. 128, American Mathematical Society, Providence, RI, USA. Cited by: §1.1.
  • [33] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky (2015) Speeding-up convolutional neural networks using fine-tuned CP-decomposition. International Conference on Learning Representations. Cited by: §1.1, §1.2, §1, §1.
  • [34] V. Lebedev (2018) Algorithms for speeding up convolutional neural networks. Ph.D. Thesis, Skoltech, Russia. External Links: Link Cited by: §1.2.
  • [35] L.-H. Lim and P. Comon (2009) Nonnegative approximations of nonnegative tensors. Journal of Chemometrics 23 (7-8), pp. 432–441. Cited by: §1.2.
  • [36] B. C. Mitchell and D. S. Burdick (1994) Slowly converging PARAFAC sequences: Swamps and two-factor degeneracies. Journal of Chemometrics 8, pp. 155–168. Cited by: §1.2.
  • [37] D. Molchanov, A. Ashukha, and D. Vetrov (2017) Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2498–2507. Cited by: §1.
  • [38] A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov (2015) Tensorizing neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, Cambridge, MA, USA, pp. 442–450. Cited by: §1.
  • [39] I.V. Oseledets and E.E. Tyrtyshnikov (2009)

    Breaking the curse of dimensionality, or how to use SVD in many dimensions

    .
    SIAM Journal on Scientific Computing 31 (5), pp. 3744–3759. Cited by: §1.1.
  • [40] P. Paatero (2000) Construction and analysis of degenerate PARAFAC models. J. Chemometrics 14 (3), pp. 285–299. Cited by: §1.2.
  • [41] A. Phan, M. Yamagishi, D. Mandic, and A. Cichocki (2020)

    Quadratic programming over ellipsoids with applications to constrained linear regression and tensor decomposition

    .
    Neural Computing and Applications. External Links: Document Cited by: Remark 1.
  • [42] A.-H. Phan, A. Cichocki, A. Uschmajew, P. Tichavský, G. Luta, and D. Mandic (2020) Tensor networks for latent variable analysis: novel algorithms for tensor train approximation. IEEE Transaction on Neural Network and Learning System. External Links: Document Cited by: §2.4.
  • [43] A.-H. Phan, P. Tichavský, and A. Cichocki (2015) Tensor deflation for CANDECOMP/PARAFAC. Part 1: Alternating Subspace Update Algorithm. IEEE Transaction on Signal Processing 63 (12), pp. 5924–5938. Cited by: §1.2.
  • [44] A.-H. Phan, P. Tichavský, and A. Cichocki (2019-03) Error preserving correction: a method for CP decomposition at a target error bound. IEEE Transactions on Signal Processing 67 (5), pp. 1175–1190. External Links: ISSN 1053-587X Cited by: §2.2, §2.3.1, Remark 2.
  • [45] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European conference on computer vision, pp. 525–542. Cited by: §1.
  • [46] W.S. Rayens and B.C. Mitchell (1997) Two-factor degeneracies and a stabilization of PARAFAC. Chemometrics Intelliggence Laboratory Systems 38 (2), pp. 173–181. Cited by: §1.2.
  • [47] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua (2013) Learning separable filters. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’13, Washington, DC, USA, pp. 2754–2761. External Links: ISBN 978-0-7695-4989-7 Cited by: §1.
  • [48] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR, Cited by: §4.2, §4.2, §4.
  • [49] A. Stegeman and P. Comon (2010-12-01) Subtracting a best rank-1 approximation may increase tensor rank. Linear Algebra and its Applications 433 (7), pp. 1276–1300. Cited by: §1.2.
  • [50] M. Tan and Q. V. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In ICML, Cited by: §5.
  • [51] P. Tichavský, A.-H. Phan, and A. Cichocki (2019) Sensitivity in tensor decomposition. IEEE Signal Processing Letters 26 (11), pp. 1653–1657. Cited by: §2.3.1, Definition 1.
  • [52] L. R. Tucker (1963) Implications of factor analysis of three-way matrices for measurement of change. Problems in measuring change 15, pp. 122–137. Cited by: §1.1, §2.4.
  • [53] M. A. O. Vasilescu and D. Terzopoulos (2003) Multilinear subspace analysis of image ensembles. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2003), 16-22 June 2003, Madison, WI, USA, pp. 93–99. External Links: Document Cited by: §1.2.
  • [54] N. Vervliet, O. Debals, L. Sorber, M. V. Barel, and L. D. Lathauwer (2016-Mar.) Tensorlab 3.0. Note: Available online External Links: Link Cited by: §1.2, §1.2.
  • [55] D. Wang, G. Zhao, G. Li, L. Deng, and Y. Wu (2019) Lossless compression for 3DCNNs based on tensor train decomposition. CoRR abs/1912.03647. External Links: Link, 1912.03647 Cited by: §1.
  • [56] I. Zacharov, R. Arslanov, M. Gunin, D. Stefonishin, A. Bykov, S. Pavlov, O. Panarin, A. Maliutin, S. Rykovanov, and M. Fedorov (2019) Zhores — petaflops supercomputer for data-driven modeling, machine learning and artificial intelligence installed in Skolkovo Institute of Science and Technology. Open Engineering 9 (1). Cited by: Acknowledgements.
  • [57] T. Zhang and G. H. Golub (2001) Rank-one approximation to high order tensors. SIAM Journal on Matrix Analysis and Applications 23 (2), pp. 534–550. External Links: Document Cited by: §1.2.
  • [58] X. Zhang, J. Zou, K. He, and J. Sun (2016) Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 1943–1955. Cited by: §4.2, Table 1.
  • [59] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu (2018) Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 883–894. Cited by: §1, §4.2, Table 1.