MICIK: MIning Cross-Layer Inherent Similarity Knowledge for Deep Model Compression

by   Jie Zhang, et al.

State-of-the-art deep model compression methods exploit the low-rank approximation and sparsity pruning to remove redundant parameters from a learned hidden layer. However, they process each hidden layer individually while neglecting the common components across layers, and thus are not able to fully exploit the potential redundancy space for compression. To solve the above problem and enable further compression of a model, removing the cross-layer redundancy and mining the layer-wise inheritance knowledge is necessary. In this paper, we introduce a holistic model compression framework, namely MIning Cross-layer Inherent similarity Knowledge (MICIK), to fully excavate the potential redundancy space. The proposed MICIK framework simultaneously, (1) learns the common and unique weight components across deep neural network layers to increase compression rate; (2) preserves the inherent similarity knowledge of nearby layers and distant layers to minimize the accuracy loss and (3) can be complementary to other existing compression techniques such as knowledge distillation. Extensive experiments on large-scale convolutional neural networks demonstrate that MICIK is superior over state-of-the-art model compression approaches with 16X parameter reduction on VGG-16 and 6X on GoogLeNet, all without accuracy loss.



page 3

page 5


Few Shot Network Compression via Cross Distillation

Model compression has been widely adopted to obtain light-weighted deep ...

Network compression and faster inference using spatial basis filters

We present an efficient alternative to the convolutional layer through u...

SlimNets: An Exploration of Deep Model Compression and Acceleration

Deep neural networks have achieved increasingly accurate results on a wi...

Multi-Task Zipping via Layer-wise Neuron Sharing

Future mobile devices are anticipated to perceive, understand and react ...

Compressing Neural Networks: Towards Determining the Optimal Layer-wise Decomposition

We present a novel global compression framework for deep neural networks...

Towards Compact CNNs via Collaborative Compression

Channel pruning and tensor decomposition have received extensive attenti...

A Novel Architecture Slimming Method for Network Pruning and Knowledge Distillation

Network pruning and knowledge distillation are two widely-known model co...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning is the primary driving force for recent breakthroughs in various computer vision applications, such as object classification [20], semantic segmentation [33], image captioning [35], etc. However, deploying deep learning models on resource constrained mobile devices is challenging due to high memory and computation cost although most mobile vision applications can benefit from the advantages of on-device deployment such as low latency, better privacy and offline operation [21, 18]. Motivated by this demand, researchers have proposed deep model compression methods to remove redundancy in the learned deep models, i.e., reducing both computing and storage cost with minimum impact on model accuracy.

Figure 1: An illustration of the model compression architecture of MICIK. MICIK obtains the low-rankness and sparsity of the original network structure. Several layers share the common components (dashed line) for higher compression rate and solid lines (of a different color) represent the exclusive weight components of each layer. Besides, the nearby layers (different colors for different layers) have higher inherent similarity than distant layers (thicker arrow shows higher similarity).

One of the most widely used model compression approaches is pruning. The general idea of pruning is to directly remove redundant parameters based on certain pre-defined strategies. In an early work, Han et al[14] directly applied a hard threshold on the weights to cutoff network’s unimportant connections. He et al[16]

pruned the network by pruning each layer with a LASSO regression based channel selection and least square reconstruction. Another prevalent approach for model compression is low-rank approximation which decomposes a large weight matrix into several small matrices and has been successfully applied to both convolutional and fully-connected layers

[32, 19]. Low-rank approximation generally provides better initialization for fine-tuning than sparsity pruning and thus has the advantage of greatly reduced retraining time. However, low-rank decomposition cannot fully represent the knowledge in the original layers since some important information is distributed outside the low-rank subspace. Recently, as observed by Yu et al[37], combining low-rank approximation with sparse matrix decomposition gives better compression rates than using only either method individually.

In this paper, we argue that existing model compression approaches only apply within each layer separately and neglect the correlation of different layers. However, a deep learning model is usually composed of multiple layers where the parameters are learned in a sequential structure. Therefore there must be some common components shared among weight matrices of different layers, especially between adjacent layers (e.g., the first layers extract low-level features such as edges). Mining these shared common representations enables further compression of the deep model. Motivated by the previous work [37] and the information sharing assumption, we propose a holistic model compression method, namely MICIK, to decompose layers’ weight matrices using low-rank and sparsity component with integration of cross-layer mining scheme. Specifically, MICIK considers mining the shared information across layers while preserving inherent similarity and formulates it as a multi-task learning (MLT) [9] problem where both intra-layer and inter-layer feature representations are embodied together. In addition, learning the correlation among layers while treating all layers equally can distort the original deep sequential structure, especially for the layers with a large gap. Therefore, we incorporate an inherent knowledge mining scheme into the proposed pipeline to maximize the similarity of nearby layers and the dissimilarity of distant layers during the compression.

Our main contributions can be summarized as follows. First, unlike previous layer-wise model compression approaches which perform optimization based only on intra-layer correlations, we propose a holistic deep model compression framework that mines both common and individual weight components’ correlation simultaneously. To the best of our knowledge, this is the first compression work that considers using shared common weight components across multiply layers. This significantly increases the compression capacity in the deep learning model matrices. Second, to maintain the consistency of deep feature learning, our framework also employs inherent similarity knowledge to better learn the network structure, which will make the correlation of adjacent layers closer and the correlation of distant layers farther. By adding this term, the compression rate may be further improved by about

without accuracy loss of the compressed model. Furthermore, an efficient optimization method is introduced. Third, the proposed approach is also complementary to other existing compression techniques such as knowledge distillation [17] (KD) which can further improve the model compression performance. We have evaluated the proposed approach on two popular CNN architectures: GoogLeNet and VGG-16. Extensive evaluation demonstrate that the superiority of the proposed model compression algorithm over state-of-the-arts and significantly improves the compression rate.

2 Related Work

Recently, deep model compression has drawn great interests from AI researchers. We summarize previous compression strategies into four different categories.

2.1 Sparsity based model compression

Sparsifying parameters in learned neural network layers is an intuitive approach for compression. One general way for sparsification is pruning, which iteratively removes network parameters that have the least impact to the final prediction accuracy. For example, Han et al. [14]

proposed using a hard threshold formed by the product of a scalar and the standard deviation of the weights. Following this work, filter-level pruning methods 

[25] have been introduced which enables a more structured pruning strategy. Most recent, He et al[15] proposed using AutoML for model compression, which gave larger acceleration rate than others. The problem with pruning based compression is that it cannot guarantee a good initialization for retraining and thus requires tedious iterative processing.

2.2 Low rankness based model compression

Matrix quantization trades network representation precision of weights and activations for computational throughput. Xue et al[36] used low-rank approximation to reduce parameter dimensions which saves storage and improves training and testing times. Denton et al[8] explored matrix factorization methods for speeding up CNN inference time. These works showed it can be exploited to speed up CNN testing time by as much as 200% while keeping the accuracy drop within 1% of the original model. To avoid the accuracy loss and compensate the lost information for low-rankness, Yu et al[37] integrated both sparse and low-rank decomposition so that both smooth and spiky components can be preserved. All those methods are limited to compressing single-layer individually, while our method exploits the cross-layer relation for further compression. Zhang et al[38]

used low-rank approximation to compress Recurrent Neural Network.

(a) Layers inception_3a and inception_3b
(b) Layers inception_3a and inception_5b
Figure 2: The closest 100 corresponding filters () from close two layers (a) and distant two layers (b) of GoogLeNet.

2.3 Knowledge transfer methods

knowledge distillation (KD) transfers knowledge from one or several large pre-trained teacher networks to a small student network which can achieve comparable capability of the teacher networks during inference. Hinton et al[17] trained a small student network to match the soft targets of a cumbersome teacher network by setting a proper temperature parameter. Chen et al[5] introduced cross sample similarities and brought the learning to rank technique to model compression and acceleration. Our approach is compatible with KD to achieve higher compression rate.

2.4 Quantization based model compression

Quantization methods group weights with similar values or reduce the number of required bits to reduce the number of parameters. Gong et al[11]

applied vector quantization to the parameter values. Han

et al[13] pruned the unimportant connections and retrained the sparsely connected networks, then quantized the link weights using weight sharing and Huffman coding to further reduce the model size. Binary-weight neural networks such as BinaryNet [7]

and XNor-Net 

[28] use one binary bit to represent each weight parameter while maintaining the model accuracy. Our approach operates in a more principled way for compressing neural networks and this guarantees a faster converge speed during the re-training stage.

3 Method

In this section, we will introduce the proposed MICIK framework in detail. We first review the basic low-rank and sparse decomposition for a single layer. Next, we illustrate the motivation of cross layer compression and present the algorithm which simultaneously mines both the individual and common weight components. Specifically, MICIK takes into consideration the inherent similarity knowledge across successive layers to formulate the objective function. Finally, we introduce the optimization scheme.

3.1 Single-layer Compression

For the single-layer compression, the output of a convolutional or a fully connected layer can be obtained by


where is the input feature vector, is the weight matrix of a convolutional or fully connected layer and is the output response. Assume is the approximation of in low rank subspace. can be represented as the product of two smaller matrices as , where and ( and ). Therefore, we have the following model:


where represents the rank of . However, if the weight matrix is represented only in the low-rank subspace, some important sparse information could be lost.

To recover the information loss, Yu et al. [37] proposed to incorporate an additional sparse matrix . Thus, to compress the weight matrix , Eq. (2) is reformulated as follows:


where denotes the cardinality of matrix . The compression rate achieved using this method is .

If we sequentially apply Eq 3 to several layers prior to a retraining, the approximation error of each layer will be accumulated and it will be hard for the compressed model to converge. Therefore, we add an asymmetric data reconstruction term in Eq. 4 to reduce the accumulated error from Eq. 3. This asymmetric term ensures the optimal approximation of weight matrices in a given layer and avoids the abrupt accuracy loss during the compressing process which can help speed up the convergence.


3.2 Mining Cross-layer Model Compression

However, Eq. (4) does not consider the relation among different layers which may share common components for further model compression. We observe that different layers of a deep neural network have shared common components (similarity). This becomes our initial motivation for exploring cross-layer model compression. Fig. 2 shows corresponding filters in different convolutional layers of GoogLeNet. Red boxes in Fig. 2 (a) show the common components of nearby layers. We say two filters, each from a different layer, are correspondences if they are each other’s nearest filter, measured by similarity. It can be seen that for both close layers (Fig. 2 (a)) and distant layers (Fig. 2 (b)), the correspondences formed between layers appear very similar. It is worth noting that our algorithm does not rely on the correspondence definition and we demonstrate that different neural architectures can be well handled by the proposed solution in the experiments.

To fully exploit such potential redundancy, the idea of Multi-Task Learning (MTL) [9] can be incorporated. MTL was proposed to improve learning performance by learning multiple related tasks simultaneously and training tasks in parallel by using a shared representation. Based on MTL, we redefine Eq. (4) and propose an algorithm to compress weight matrices from different layers simultaneously.

Given the weight matrices from different layers , the proposed cross-layer compression can be formulated as follows:



is an non-negative parameter and provides a trade-off between the loss function and the asymmetric term. In Eq. (

5), MTL is used directly to learn individually in parallel, but does not make clear how the redundancy among layers can be exploited. Therefore, to further increase the compression rate, MICIK is proposed to use common and individual components to compress across multi-layer simultaneously.

For the weight matrix of a particular layer, our goal is to learn the weight components which are composed of two parts: where , and . is the common component among different layers and while is different from each other and only learned from the corresponding matrix . The objective function can be reformulated as follows:


In Eq. (6), we can compress several convolutional layers’ weight matrices together. We used a common components matrix across multiple layers to further compress the model and the total parameters for each layer will become instead of .

3.3 Mining Inherent Similarity Knowledge

The layers in a deep neural network have an ordinal relationship. Layers close to the input extract low-level image features (e.g., edges) while top layers close to the output extract high-level features with semantic meanings, thus there are different degrees of similarity between nearby layers and distant layers.

(a) Layers inception_3b, 4a
and inception_5b
(b) Layers inception_3a, 3b
and inception_4e
Figure 3: The similarity distribution of kernels visualized using the first three dominant principal components of GoogLeNet.

Fig. 3 shows distributions two examples of groups of three layers from GoogLeNet. These three layers are chosen such that two of them are consecutive layers with the same filter size and the third one is the farthest layer with filters of the same size (

). We use principal component analysis (PCA) 

[34] to project them onto a 3D-subspace for visualizing the distributions. The three selected layers are sorted by their depth in the network and we perform PCA to project the filters onto the three most dominant principal components A for the first layer. For the second and third layer, we apply PCA on the filter kernels and find the principal components B that are most similar (cosine distance) to A and plot them in the 3D subspace. We use different colors to represent each layer. We can see that the distributions of inception 3b, 4a (Fig. 3 (a)) share more similarities than inception 5b and inception 3a, 3b (Fig. 3 (b)) share more similarities than inception 4e. We observed that nearby layers are more likely to have similar distributions when compared to layers that far apart in depth and this motivates us to add constraints between codes learned for successive layers. And this means the nearby layers would have more similarity on their individual features than layers that are far away from each other. Therefore, we used a weighted function to emphasize the inherent similarity knowledge between two different layers:

The function is used to penalize the distance between two layers so that it emphasizes the inherent similarity of two layers, i.e., the nearby layers from learning layer-specific components with high similarity and distant layers with high disparities. For example, the distant layers have larger differences than nearby layers , we use a smaller to yield the smaller similarity of two distant layers so that the distant layers and will yield larger differences than nearby layers.

The final objective function of the MICIK compression algorithm can be formalized as follows:


To optimize Eq 7, we alternately optimize and . When updating , to avoid excessive matrix inversions, we use  [39] which reduces the time complexity from to (). When updating , to avoid calculating the inverse of Hessian matrix , we use the diagonal element of because would be close to a diagonal matrix when the columns of U have low correlation. Thus, we reduce the time complexity from to for updating .

3.4 Optimization Analysis

In this section, we explain the update procedure for the low-rank structure and sparse structure to minimize Eq. (7). The whole process of updating , and is summarized into Algorithm 1. We use an alternating updating method to update , and and run it for iterations (Alg. 1). More specifically, the objective function in Eq. (7) is a non-convex problem due to the non-convexity of the components and . We use the convex relaxation technique [4] to solve the above problem.

First, we update by fixing and

, we use the same updating rule (QR decomposition) as 

[37] for each input layer :


where and is the Moore-Penrose Pseudoinverse [2] of .

Input :  from DNN,

, epoch

, layer , , , and initial
Output :  and
1 begin
2       for   do
3             for   do
4                   For each input weight matrix ;
5                   Update by QR decomposition [37] as Eq. (8);
6                   Update : ;
7                   Update the and by Eq. (11);
8                   Calculate function and add term;
9                   Update by Eq. (14);
10                   ;
Algorithm 1 MICIK

Then, we update by fixing and and we need to consider two parts of : and . Let , where The convex envelope of on is defined as the largest convex function such that . The trace norm (nuclear norm) has been known as the convex envelop of the function of rank [10]:


Therefore, the equivalent objective function of Eq (7) on updating will become:


In Alg. 1 (line 5), to construct with rank , instead of updating columns/rows at all iterations, we use a greedy selection [39] to update the and . Specifically, we initiate a small rank to start, then select extra columns/rows and concatenate them into at each iteration. Such greedy method is a warm-start for the higher rank optimization and ensures the faster computation compared to updating m columns/rows at all iterations. Thus, we still need to constrain the rank of in Eq. 10.

We use obtained from Eq. (8) as the initialized

and we perform one step stochastic gradient descent (SGD) 

[3] to update the weight components and . The procedure of optimization takes the form of


where is the Hessian matrix of . stores of layer and Here, is a matrix multiplication function and has three input parameters (, and denotes matrices) and .

After we update using Eq. (11), we can now calculate , and then easily add the result of the third term of Eq. (10) on at the end of th updating iteration.

Last, we update by fixing and . -norm (cardinality) (the number of non-zero entries) is normally used to control the sparsity of the matrix, and the optimization problem of updating is equivalent to the follows:


where is an non-negative parameter to control the sparse regularization. However, solving norm is NP-hard [27], and regularization is generally intractable. Therefore we use convex relaxation techniques [4]: norm is known as the convex envelope of the -norm, and thus -norm is used instead of norm:


Given 13, the corresponding norm regularized convex optimization problem can be efficiently solved. We calculate the gradient based on Eq. (14), and then update the model based on . The calculation of and follows the equations:


is the soft thresholding shrinkage function [6].

Layers W (O) W (C) R GreBdec
conv1_1 9k 4.5k 2X 1X
conv2 115k 57K 2X 1.2X
inception_3a 164k 33k 5X 4.8X
inception_3b 389k 77k 5X 4.5X
inception_4a 376k 75k 5X 4.5X
inception_4b 449k 90k 5X 4.5X
inception_4c 510k 102k 5X 4.3X
inception_4d 605k 121k 5X 4.5X
inception_4e 868k 174k 5X 4.3X
inception_5a 1M 200k 5X 4.5X
inception_5b 1M 200k 5X 4.5X
fc8 1M 200k 5X 5X
Total 7M 1.3M 5.4X 4.5X
Table 1: Compression statistics of MICIK on GoogLeNet. : total parameters of original model. : total parameters of compressed model. R: compression rate.
Network W MCR
GoogLeNet 7M 1X
KD 5.1M 1.4X
Low-Rank [32] 2.4M 2.8X
Tucker [19] 4.7M 1.3X
GreBdec [37] 1.5M 4.5X
Sparse [26] 2.3M 3X
DeepRebirth [22] 2.8M 2.5X
MICIK 1.3M 5.4X
MICIK+KD 0.95M 7.4X
Table 2: Comparisons of different approaches on GoogLeNet. : The number of parameters. MCR: Maximum compression rate.

4 Experiment

In this section, we present the comprehensive evaluation results and analysis of the proposed MICIK deep model compression method. Following previous model compression work [19, 37]

, our evaluation is performed on two popular image classification CNN models pre-trained on ImageNet 

[29], namely GoogLeNet [31] and VGG-16 [30]

. We also provide results of comparison with other state-of-art model compression approaches. We implement our method and conduct the experiments using Tensorflow 

[1]. Our pre-trained reference models are obtained from the model repository of the official Tensorflow release.

4.1 Experimental Settings

In our experiment, we compress up to four neural network layers at a time. The and of a neural network layer are collected by feeding a training example and extracting the layer’s input feature maps (as ) and output feature maps (as ). To optimize a decomposition that the output feature maps from the decomposed layers have minimal reconstruction error which is critical for initialization of fine-tuning, we use 3000 input-output pairs per layer (i.e., 12,000 examples for four layers.) Different from pruning-based model compression methods [14, 19, 32] which heavily rely on layer-wise fine-tuning, our approach allows fine-tuning of the complete model which is much more time efficient and it converges in merely two epochs. We let (,

is the largest singular value

[37], and in all experiments.

To compare with state-of-the-art model compression methods, we use the maximum compression rate (MCR

) as our evaluation metric. Specifically, we would like to compress a model to the minimum number of parameters while preserving its accuracy (without accuracy loss compared with original model). For ImageNet, we use the Top-1 and Top-5 testing accuracy on ILSVRC2012 validation dataset as a guidance to measure the

MCR, i.e., we make sure there is no accuracy loss on both Top-1 and Top-5 validation accuracy after model compression. We report the number of parameters and MCR to evaluate the efficiency of different compression approaches.

Layers W (O) W (C) R GreBdec
conv1_1 2k 1k 2X 1X
conv1_2 37k 7k 5X 5X
conv2_1 74k 15k 5X 4.3X
conv2_2 148k 30k 5X 4.3X
conv3_1 295k 59k 5X 4.2X
conv3_2 590k 118k 5X 4.5X
conv3_3 590k 118k 5X 4.5X
conv4_1 1M 200k 5X 4.2X
conv4_2 2M 400k 5X 4.5X
conv4_3 2M 400k 5X 4.5X
conv5_1 2M 400k 5X 4.5X
conv5_2 2M 400k 5X 4.5X
conv5_3 2M 400k 5X 4.5X
fc6 103M 4.8M 21.6X 25X
fc7 17M 0.8M 21.3X 25X
fc8 4M 1M 5X 5X
Total 138M 9M 15.4X 14.2X
Table 3: Compression statistics of MICIK on VGG-16. : total parameters of original model. : total parameters of compressed model. R: compression rate.
Network W MCR
VGG-16 138M 1X
KD 107M 1.3X
Low-Rank [32] 50.2M 2.8X
Tucker [19] 127M 1.1X
Pruned [14] 10.3M 13.4X
GreBdec [37] 9.7M 14.2X
Bayesian [24] 9.9M 14X
Sparse [26] 37M 3.7X
MICIK 9M 15.4X
MICIK + KD 7.3M 18.9X
Table 4: Comparisons of different approaches on VGG-16. : the number of parameters. MCR: Maximum compression rate.

Implementations: GoogLeNet is composed of convolutional layers and one final fully-connected layer. For convolutional layers: there are 4 different filter sizes: 1x1, 3x3, 5x5 and 7x7. We compress each filter size separately. Note that for two filters with the same receptive field (e.g., 3x3), their depths are likely to be different depending on the feature map channels generated by its previous layer. In MICIK, we want to learn a common components matrix across multiple layers, separating multiple layers by different filter sizes is also necessary, so that the dimension of the common component matrix can be unified by the fixed filter size. Therefore, to compress weight matrices among different layers, we set the depth of the common components as the greatest common divisor (GCD) of depth of the filters with the same receptive field size (among 4 nearby filters in our experiment), and thus each filter may be represented by one or more components. For example, a 3x3x48 filter may be represented by three 3x3x16 components. The final fully-connected layer is processed separately.

VGG-16 is composed of 13 convolutional layers and 3 fully-connected layers. All convolutional layers have the same 3x3 filter size. We follow the same GCD strategy to set the size of the common components. The convolutional layer 1 is compressed separately since its filter depth is only 3. The final 3 fully-connected layers are individually compressed since those weight matrices have different dimensions.

4.2 Experimental Results

Table 1 and Table 3 give the details of compression rate within each layer for GoogLeNet and VGG-16, respectively. In addition, we set three baseline methods: (1) the original pre-trained model, (2) Knowledge Distillation (KD) [17] which trains a student network model with the same network architecture of the final compressed model, and (3) GreBdec [37], the single-layer model compression in Eq. 4. Comparisons with other works on GoogLeNet and VGG-16 are listed in Table 2 and Table 4 respectively. Obtained results demonstrate that the proposed approach has achieved largest compression rate without accuracy loss.

Table 1 and Table 2 give detailed compression statistics and the comparisons on GoogLeNet. We can compress GoogLeNet by times with higher accuracy compared to times achieved by GreBdec [37]. Furthermore, compared to [37], we achieve a better compression rate on each convolutional layer (on average more than ) as indicated in Table 1. The advantage is even more significant for the first few layers (conv1_1, conv2), for which the number of parameters can be reduced by more than 50% compared to [37]. These results demonstrate that mining cross-layer inherent knowledge can improve the power of compression on deep neural networks. The compressed GoogLeNet model can be further compressed using knowledge distillation. Specifically, we apply KD by using the compressed model obtained from MICIK as the teacher model and we gradually remove the inception layers from the teacher model as the student model. We still use the Top-1 and Top-5 accuracy to guide KD until accuracy loss is observed. In the experiment, we remove inception_5a and inception_5b to get the student model which can further reduce the number of weight parameters to less than 1M (7.4x MCR). This result demonstrates that by combining MICIK with KD, the model can be further compressed to better satisfy the need of mobile deployment.

Table 3 and Table 4 give detailed compression statistics and the result comparison on VGG-16. The MCR obtained on VGG using MICIK is 15.4x compared to 14.2x achieved by the state-of-the-art method GreBdec [37]. Louizos et al[24] used the variational Bayesian approximation compressed VGG-16 to 14X while we can achieve 15.4X. Compared to GoogLeNet, fully-connected layers constitute the majority redundancies of VGG-16. Therefore, since we can only compress the fully-connected layers individually, the proposed MICIK method shows less advantage over existing methods compared to the result of GoogLeNet (Table 1 ). Furthermore, we can get 18.9x MCR by fusing with KD (student model learns smaller fc6 and fc7).

Figure 4: Illustrations of performance influence caused by varying the percentage of common weight components on VGG-16.

4.3 Impact Factor Analysis in Micik

Network FCR Top-1 Top-5
MIC GoogLeNet 5.4X 0.21% 0.33%
MICIK GoogLeNet 5.4X 0% 0%
MIC VGG-16 15.4X 0.37% 0.23%
MICIK VGG-16 15.4X 0% 0%
Table 5: The results of considering inherent similarity knowledge. FCR: Fix Compression Rate, Top-1/Top-5 Accuracy Drop Percentage

Mining Inherent Similarity Knowledge. We study the influence of enforcing the inherit similarity knowledge to the compression pipeline and present the result in Table 5. In this experiment, we try to compress the models to the same number of parameters reached by the MCR of MICIK but without adding the inherent similarity term (we name this method as MIC), which corresponds to the loss function defined in Eq. 10. As shown in Table 5, both GoogLeNet and VGG-16 models have some accuracy loss. Therefore, adding inheriting similarity constraint can help maintain the model accuracy. Without adding this constraint, the training process cannot distinguish distant layers from nearby layers, which results in ambiguities.

Ratio of Common and Individual Components. As discussed above, one major contribution of this work is the proposed component sharing across layers. It is necessary to evaluate the initialization of appropriate rate of weight sharing components among layers. We use the smallest weight matrix as our base matrix and use the smallest as the upper bound of the common components. Then, we set different ratios of common components with regard to all weight components. We set the initial ratio to 1:10 and slowly increase the number of shared common components to 1:7, 1:5, 1:3, 1:2, 1:1, 2:1, 3:1, 5:1, 7:1 and 10:1. We use convolutional layer 2 to layer 13 in VGG-16 as the input of MICIK and the results are evaluated by the value of objective function in Eq. (7). A smaller value indicates smaller difference between the original weight matrices and the compressed matrices for retraining, and thus gives a better initialization for retraining. In our experiments, similar to [37], we observe that a good initialization is essential for recovering the model accuracy. Fig. 4 shows the results of different partitioned weight components. The best result is obtained after splitting the weight components equally between common and individual. Based on this observation, we use the 1:1 as the ratio between common and individual weight components in all experimental settings.

Inference Speed. In this paper, we focus more on saving the storage of deep learning models which has significant impacts, e.g., models can be distributed and deployed on mobile devices efficiently by saving both network bandwidth and mobile storage. Though inference speed was not discussed, MICIK has great potential for substantial speed-up. In Eq. 4, , i.e., the inference on the layer is broken down into two parallel computations and , both of which can be efficiently computed: (1) the low-rank matrix is decomposed into two small matrices and and the inference time is greatly reduced by calculating before multiplying , (2) for the sparse matrix , the higher sparsity rate enables more efficient usage of sparse matrix-vector multiplication operators [23] for speed-up and a LASSO regression based pruning [16] has been proved help accelerate the inference time; in addition, novel hardware such as specialized neural computing chips have been proposed to further speed up sparse matrix computation [12]. Further investigation of the inference speed will be our future work.

5 Conclusions and Future Work

In this work, we propose a new deep model compression algorithm termed MICIK, which compresses deep neural networks by integrating intra and inter layer inherent similarity knowledge. Extensive experiments on ImageNet demonstrate the evident superiority of MICIK over the state-of-the-arts. In the future, we will investigate adaptive adjustment of the shared components among different layers.