Model-Agnostic Structured Sparsification with Learnable Channel Shuffle

02/19/2020 ∙ by Xin-Yu Zhang, et al. ∙ 13

Recent advances in convolutional neural networks (CNNs) usually come with the expense of considerable computational overhead and memory footprint. Network compression aims to alleviate this issue by training compact models with comparable performance. However, existing compression techniques either entail dedicated expert design or compromise with a moderate performance drop. To this end, we propose a model-agnostic structured sparsification method for efficient network compression. The proposed method automatically induces structurally sparse representations of the convolutional weights, thereby facilitating the implementation of the compressed model with the highly-optimized group convolution. We further address the problem of inter-group communication with a learnable channel shuffle mechanism. The proposed approach is model-agnostic and highly compressible with a negligible performance drop. Extensive experimental results and analysis demonstrate that our approach performs favorably against the state-of-the-art network pruning methods. The code will be publicly available after the review process.



There are no comments yet.


page 4

page 5

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction


Figure 1:

Trade-off between accuracy and complexity on the ImageNet

(ILSVRC15) dataset. Our method (MASS) is highlighted with the solid lines (upper left is better).

Convolutional Neural Networks (CNNs) have made significant advances in a wide range of vision and learning tasks  (krizhevsky2012imagenet; gehring2017convolutional; long2015fully; girshick2015fast). However, the performance gains usually entail heavy computational costs, which make the deployment of CNNs on portable devices difficult. To meet the memory and computational constraints in real-world applications, numerous model compression techniques have been developed.

Existing network compression techniques are mainly based on weight quantization (chen2015compressing; courbariaux2016binarized; rastegari2016xnor; wu2016quantized), knowledge distillation (hinton2015distilling; chen2017learning; yim2017gift), and network pruning (li2017pruning; he2017channel; liu2017learning; molchanov2019importance). Weight quantization methods use low bit-width numbers to represent weights and activations, which usually bring a moderate performance degradation. Knowledge distillation schemes transfer knowledge from a large teacher network to a compact student network, which are typically susceptible to the teacher/student network architecture (mirzadeh2019improved; liu2019search). Closely related to our work, network pruning approaches reduce the model size by removing a proportion of model parameters that are considered unimportant. Notably, filter pruning algorithms (liu2017learning; he2017channel; li2017pruning) remove the entire filters and result in structured architectures that can be readily incorporated into modern BLAS libraries.

Identifying unimportant filters is critical to pruning methods. It is well-known that the weight norm can serve as a good indicator of the corresponding filter importance (li2017pruning; liu2017learning). Filters corresponding to smaller weight norms are considered to contribute less to the outputs. Furthermore, the regularization can be used to increase sparsity (liu2017learning). Despite the advances, several issues in the existing pruning methods can be improved: 1) pruning a large proportion of convolutional filters will result in severe performance degradation; 2) pruning alters the input/output feature dimensions, and thus meticulous adaptation is required to handle special network architectures (e.g., residual connections (he2016deep) and dense connections (huang2017densely)).

Before presenting the proposed method, we briefly introduce the group convolution (GroupConv) (krizhevsky2012imagenet)

, which plays an important role in this work. For the typical convolution operation, the output features are densely-connected with the entire input features, while for the GroupConv, the input features are equally split into several groups and transformed within each group independently. Essentially, each output channel is connected with only a proportion of the input channels, which leads to sparse neuron connections. Therefore, deep CNNs with GroupConvs can be trained on less powerful GPUs with smaller amount of memory. In this work, we propose a novel approach for network compression where unimportant neuron connections are pruned to facilitate the usage of GroupConvs. Nevertheless, converting vallina convolutions into GroupConvs is a challenging task. First, not all sparse neuron connectivities correspond to valid GroupConvs, while certain requirements must be satisfied,

e.g., mutual exclusiveness of different groups. To guarantee the desired structured sparsity, we impose structured regularization upon the convolutional weights and zero out the sparsified weights. Another challenge is that stacking multiple GroupConvs sequentially will hinder the inter-group information flow. The ShuffleNet (zhang2018shufflenet) method proposes a channel shuffle mechanism, i.e., 

gathering channels from distinct groups, to ensure the inter-group communication, though the order of permutation is hand-crafted. However, we solve the problem of channel shuffle in a learning-based scheme. Concretely, we formulate the learning of channel shuffle as a linear programming problem, which can be solved by efficient algorithms like the network simplex method

(bonneel2011displacement). Since the structured sparsity is induced among the convolutional weights, our method is nominated as Model-Agnostic Structured Sparsification, abbreviated to MASS.

The proposed structured sparsification method is designed for three goals. First, our approach is model-agnostic. A wide range of backbone architectures are amenable to our method without the need for any special adaptation. Second, our method is capable of achieving high compression rates. In modern efficient network architectures, the complexity of convolutions is highly compressed, while the computation bottleneck becomes the point-wise convolutions (i.e.,  convolutions) (zhang2018shufflenet). For example, the point-wise convolutions occupy 81.5% of the total FLOPs in the MobileNet-V2 (sandler2018mobilenetv2) backbone and 93.4% in ResNeXt (xie2017aggregated). Our method is applicable to all convolution operators so that a high compression rate is reachable. Third, our approach brings negligible performance drop. As all of the filters are preserved under our methodology, we retain stronger representational capacity of the compressed model and achieve better accuracy-complexity trade-off than the pruning-based counterparts (see Fig. 1).

The main contributions of this work are three-fold:

  • We propose a learnable channel shuffle mechanism (Sec. 3.2) in which the permutation of the convolutional weight norm matrix is learned via linear programming.

  • Upon the permuted weight norm matrix, we impose structured regularization (Sec. 3.3) to obtain valid GroupConvs by zeroing out the sparsified weights.

  • With the structurally sparse convolutional weights, we design the criteria of learning cardinality (Sec. 3.4) in which unimportant neuron connections are pruned with minimal impact on the entire network.

Incorporating the learnable channel shuffle mechanism, the structured regularization and the grouping criteria, the proposed structured sparsification method performs favorably against the state-of-the-art network pruning techniques on both CIFAR (krizhevsky2009learning) and ImageNet (ILSVRC15) datasets.

2 Related Work

Network Compression

Compression methods for deep models can be broadly categorized based on weight quantization, knowledge distillation, and network pruning. Closely related to our work are the network pruning approaches based on filter pruning. It is well-acknowledged that filters with smaller weight norm are considered to make negligible contribution to the outputs and can be pruned. li2017pruning prune filters with smaller norm, while liu2017learning remove those corresponding to smaller batch-norm scaling factors, on which an regularization term is imposed to increase sparsity.

However, techniques that remove the entire filters based on the weight norm may negatively affect the representational capacity significantly. Instead of removing the entire filters, the proposed structured sparsification method enforces structured sparsity among neuron connections and merely removes certain unimportant connections while the entire filters are preserved. As such, the network capacity is less affected than pruning-based approaches (li2017pruning; liu2017learning; he2019filter; molchanov2019importance). Furthermore, our method does not alter the input/output dimensions, and can be easily incorporated into numerous backbone models.

Group Convolution.

Group convolution (GroupConv) is introduced in the AlexNet (krizhevsky2012imagenet) to overcome the GPU memory constraints. GroupConv partitions the input features into mutually exclusive groups and transforms the features within each group in parallel. Compared with the vallina (i.e., densely connected) convolution, a GroupConv with groups can reduce the computational cost and number of parameters by a factor of . The ResNeXt  (xie2017aggregated) designs a multi-branch architecture by employing GroupConvs and defines cardinality as the number of parallel transformations, which is simply the group number in each GroupConv. If the cardinality equals to the number of channels, GroupConv becomes the depthwise separable convolution, which is widely used in recent lightweight neural architectures (howard2017mobilenets; sandler2018mobilenetv2; zhang2018shufflenet; ma2018shufflenet; chollet2017xception).

However, the aforementioned methods all treat the cardinality as a hyper-parameter, and the connectivity patterns between consecutive features are hand-crafted as well. On the other hand, there is also a line of research focusing on learnable GroupConvs (huang2018condensenet; wang2019fully; zhang2019differentiable). Both CondenseNet (huang2018condensenet) and FLGC (wang2019fully) pre-define the cardinality of each GroupConv and learn the connectivity patterns. We note that the work by zhang2019differentiable learns both the cardinality and neuron connectivity simultaneously. Essentially, this dynamic grouping convolution is modeled by a binary relationship matrix where indicates the connectivity between the input channel and the output channel. To guarantee that the resulting operator is a valid GroupConv, the relationship matrix is constructed using a Kronecker product of several binary symmetric matrices. Nevertheless, the Kronecker product gives a sufficient but unnecessary condition and the space of all valid relationship matrices is not fully exploited.

Our method decouples the learning of cardinality and connectivity. Motivated by the norm-based criterion in the network pruning methods (li2017pruning; liu2017learning), we quantify the importance of each neuron connection by the corresponding weight norm and learn the connectivity pattern by permuting the weight norm matrix. Besides, the structured regularization is imposed on the permuted weight norm matrix and the cardinality is learned accordingly. The essential difference between our approach and prior art (zhang2019differentiable) is that all possible neuron connectivity patterns, i.e., relationship matrices, can be reached by our method.

Channel Shuffle Mechanism.

The ShuffleNet (zhang2018shufflenet) combines the channel shuffle mechanism with GroupConv for efficient network design, in which channels from different groups are gathered so as to facilitate inter-group communication. Without channel shuffle, stacking multiple GroupConvs will eliminate the information flow among different groups and weaken the representational capacity. Different from the hand-crafted counterpart (zhang2018shufflenet), the proposed channel shuffle operation is learnable over the space of all possible channel permutations. Furthermore, without bells and whistles, our channel shuffle only involves a simple permutation along the channel dimension, which can be conveniently implemented by an index operation.

Neural Architecture Search.

Neural Architecture Search (NAS) (zoph2016neural; baker2016designing; zoph2018learning; real2019regularized; wu2019fbnet)

aims to automate the process of learning neural architectures within certain budgets of computational resources. Existing NAS algorithms are developed based on reinforcement learning

(zoph2016neural; baker2016designing; zoph2018learning), evolutionary search (real2017large; real2019regularized), and differentiable approaches (liu2018darts; wu2019fbnet). Our method can be viewed as a special case of hyper-parameter (i.e., cardinality) optimization and neuron connectivity search. However, different from existing approaches that evaluate on numerous architectures, the proposed method can determine the compressed architecture in one single training pass and is more scalable than most NAS methods.

3 Model-Agnostic Structured Sparsification

3.1 Overview

The structured sparsification method is designed to zero out a proportion of the convolutional weights so that the vallina convolutions can be converted into group convolutions (GroupConvs), and meanwhile the optimal neuron connectivity can be learned. We adopt the “train, compress, finetune” pipeline, in a way similar to the recent pruning approaches (liu2017learning). Concretely, we first train a large model with structured regularization, then convert vallina convolutions into GroupConvs under certain criteria, and finally finetune the compressed model to recover accuracy. The connectivity patterns can be therein learned as the structured regularization heavily depends on them. As such, three issues need to be addressed: 1) how to learn the connectivity patterns (Sec. 3.2); 2) how to design the structured regularization (Sec. 3.3); and 3) how to decide the grouping criteria (Sec. 3.4). Additional details of our pipeline are presented in Sec. 3.5.

[width=]channel-shuffle structuredregularization(a) channel connectivity(b) weight norm matrix

Figure 2: Illustration of the learnable channel shuffle mechanism. The original convolutional (first column) weights are shuffled along the input/output channel dimensions in order to solve Eq. 4. The structured regularization is imposed upon the permuted weight norm matrix (second column) to increase structured sparsity, and connections with small weight norms are discarded (third column). In the original ordering of channels, a structurally sparse connectivity pattern is learned (fourth column), and notably every valid connectivity pattern can be possibly reached in this manner.

3.2 Learning Connectivity with Linear Programming

Let be the input feature map, where denotes the number of input channels. We apply a vallina convolution111For simplicity, we omit the bias term from Eq. 1

, and assume the convolution operator is of stride 1 with proper paddings.

with weights to , i.e., , where with denoting the number of output channels. Each entry of is a weighted sum of a local patch of , namely,


In Eq. 1, the channel of relates to the channel of via weights

. Motivated by the norm-based importance estimation in filter pruning

(li2017pruning; liu2017learning), we quantify the importance of the connection between and of by . Thus, the importance matrix can be defined as the norm along the “kernel size” dimensions of , i.e., .

Next, we extend our formulation to GroupConvs with cardinality . A GroupConv can be considered as a convolution with sparse neuron connectivity, in which only a proportion of input channels is visible to each output channel. Without loss of generality, we assume both and are divisible by , and Eq. 1 can be adapted as


where indicates the output channel belongs to the group, and denotes the number of input channels within each group. Clearly, the valid entries of form a block diagonal matrix with equally-split blocks at the input/output channel dimensions. Thus, the GroupConv module requires parameters and FLOPs for processing the feature , and the computational complexity is reduced by a factor of compared with the vallina counterpart.

We note that if a vallina convolution operator can be converted into GroupConv without affecting its functional property (we call such convolution operators groupable), the convolutional weights must be block diagonal after certain permutations along the input/output channel dimensions. Due to the positive definiteness of norm and the fact that permuting corresponds to permuting , a necessary and sufficient condition of a convolution operator being groupable is that


where denotes the set of permutation matrices. Here, the permutation matrices and shuffle the channels of the input and output features, and thus determine the connectivity pattern between and (see Fig. 2).

However, a randomly initialized and trained convolution operator by no means can be groupable unless special sparsity constraints are imposed. To this end, we resort to permuting so as to make “as block diagonal as possible”. The next question is how to rigorously define the term “as block diagonal as possible”. Here, we assume both and are powers of 2, where the most widely-used backbone architectures (e.g., VGG (vgg) and ResNet (he2016deep)) satisfy this assumption222Similar reasoning can be applied if both and have many factors of 2. See the supplementary materials for details.. Then, the potential cardinality is also a power of 2. As the cardinality grows, more and more non-diagonal blocks are zeroed out (see Fig. 3(c)). As illustrated in Fig. 3(b), we define the cost matrix to progressively penalize the non-zero entries of the non-diagonal blocks. Finally, we utilize as a metric of the “block-diagonality” of the matrix , where indicates element-wise multiplication and summation over all entries, i.e., . Formally, the optimization problem is formulated as follows:


Solving Eq. 4 gives the optimal connectivity pattern between the adjacent layers.

[width=0.47]group-level group level group level group level (a) permuted weight norm matrix (b) structured regularization(c) relationship matrix

Figure 3: Illustration of the structured regularization matrix and the relationship matrix corresponding to the group level . (a) Heat map of the permuted weight norm matrix . Non-diagonal blocks of the weight norm are sparsified. (b) Structured regularization matrix . The regularization coefficient decays exponentially as the group level grows. A special case of the decay rate of 0.5 is demonstrated. Besides, the matrix depends on the current group level , and when the maximal possible group level is achieved, the matrix becomes the cost matrix in Eq. 4; (c) Relationship matrix . The entries of the permuted weight norm matrix corresponding to the zero entries of the relationship matrix will be zeroed out during grouping.

However, minimization over the set of permutation matrices is a non-convex and NP-hard problem that requires combinatorial search. To this end, we relax the feasible space to its convex hull. The Birkhoff-von Neumann theorem (birkhoff1946three) states that the convex hull of the set of permutation matrices is the set of doubly-stochastic matrices333Doubly-stochastic matrices are non-negative square matrices whose rows and columns sum to one., known as the Birkhoff polytope:



denotes the column vector composed of


We solve Eq. 4 with coordinate descent. That is, we iteratively update and until convergence. When updating one variable, we consider the other as fixed. For example, when optimizing , the objective function can be transformed as follows:


As the objective is a linear function of and the Birkhoff polytope is a simplex, Eq. 6 is a linear programming problem, which can be solved by efficient algorithms such as the network simplex method (bonneel2011displacement). In addition, the theory of linear programming guarantees that at least one of the solutions is achieved at the vertex of the simplex, and the vertices of the Birkhoff polytope are precisely the permutation matrices (birkhoff1946three). Thus, in Eq. 6, minimization over the Birkhoff polytope is equivalent to minimization over the set of permutation matrices, and the solution is naturally a permutation matrix without the need for post-processing. Furthermore, Eq. 6 has the same formulation as the optimal transport problem, and sophisticated computation library444 is available for efficient linear programming.

3.3 Structured Regularization

Permutation alone does not suffice to induce structurally sparse convolutional weights, and we still need to impose special sparsity regularization to achieve the desired sparsity structure. Inspired by the sparsity-inducing penalty in liu2017learning, we impose the structured regularization on the permuted weight norm . We first define the group level as illustrated in Fig. 3, which indicates the current cardinality achieved, i.e.,  and is determined in Sec. 3.4. Then, given the current group level , the structured regularization is formulated as555Here, we simply compute the regularization of a single convolutional layer. In the experiments, the regularization is the summation of those of all the convolution layers. , where denotes the structured regularization matrix as illustrated in Fig. 3(b). Intuitively, at group level , additional regularization is imposed upon the non-diagonal blocks to be zeroed out if the group level of is achieved. Furthermore, the regularization coefficient decays exponentially as the group level grows as we desire balanced cardinality distribution among the network. In the end, the overall loss becomes


where denotes the regular data loss (standard classification loss in the following experiments) and is the balancing scalar.

3.4 Criteria of Learning Cardinality

With the structurally sparsified convolutional weights, the next step is to determine the cardinality. The core idea of our criteria is that the weight norms corresponding to the valid connections constitute at least a certain proportion of the total weight norms. At group level , the following requirement should be satisfied:


where is a threshold set to 0.9 in all of our experiments, and is the relationship matrix (zhang2019differentiable) as illustrated in Fig. 3(c). The matrix specifies the valid neuron connections at group level . Therefore, the current group level can be determined by


Initially update the permutation matrices and .

to #epochs


       Train for 1 epoch with the structured regularization;
Solve Eq. 4 to update the matrices and ;
Determine the current group levels by Eq. 9;
Update the structured regularization matrices;
Adjust the coefficient .
end for
Algorithm 1 Training Pipeline.

3.5 Pipeline Details

Methods #Params.() FLOPs () Acc.(%)
Baseline 2.20 3.53 91.70 (0.12)
1-4[2pt/3pt] Slimming-40% 1.91 (0.00) 3.10 (0.02) 91.74 (0.35)
MASS-20% 1.76 (0.00) 3.18 (0.07) 91.79 (0.23)
1-4[2pt/3pt] Slimming-60% 1.36 (0.02) 2.24 (0.01) 89.68 (0.38)
MASS-40% 1.31 (0.01) 2.58 (0.00) 91.42 (0.04)
Baseline 5.90 9.16 93.50 (0.19)
1-4[2pt/3pt] Slimming-60% 4.15 (0.03) 5.75 (0.10) 93.10 (0.25)
MASS-30% 4.08 (0.05) 7.17 (0.20) 94.19 (0.16)
MASS-50% 2.96 (0.03) 4.81 (0.03) 93.70 (0.06)
1-4[2pt/3pt] Slimming-80% 2.33 (0.04) 3.50 (0.02) 91.01 (0.02)
MASS-60% 2.34 (0.08) 4.20 (0.08) 93.48 (0.13)
MASS-70% 1.80 (0.00) 3.52 (0.16) 93.25 (0.02)
Baseline 11.47 17.59 94.62 (0.22)
1-4[2pt/3pt] Slimming-40% 9.24 (0.03) 12.55 (0.00) 94.49 (0.12)
MASS-20% 9.12 (0.06) 14.76 (0.02) 94.78 (0.11)
MASS-40% 6.69 (0.24) 11.60 (0.01) 94.55 (0.18)
1-4[2pt/3pt] Slimming-60% 8.15 (0.03) 10.66 (0.00) 94.29 (0.11)
MASS-30% 7.89 (0.03) 12.47 (0.01) 94.69 (0.08)
MASS-60% 5.41 (0.02) 10.66 (0.01) 94.42 (0.04)
Table 1: Network compression results on the CIFAR-10 (krizhevsky2009learning) dataset. “Baseline” means the network without compression. The percentages in our method indicate the compression rate (measured by the reduction of “#Params.”), while those in other methods indicate the pruning ratio.

Implementation Details.

Our implementation is based on the PyTorch

(steiner2019pytorch) library. The proposed method is applied to the ResNet (he2016deep) and DenseNet (huang2017densely) families, and evaluated on the CIFAR (krizhevsky2009learning) and ImageNet (ILSVRC15) datasets. For the CIFAR dataset, we follow the common practice of data augmentation (he2016deep; liu2017learning; xie2017aggregated): zero-padding of 4 pixels on each side of the image, random crop of a patch, and random horizontal flip. For fair comparisons, we utilize the same network architecture as liu2017learning, and the model is trained on a single GPU with a batch size of 64. For the ImageNet dataset, we adopt the standard data augmentation strategy (vgg; he2016deep; xie2017aggregated): image resize such that the shortest edge is of 256 pixels, random crop of a patch, and random horizontal flip. The overall batch size is 256, which is distributed to 4 GPUs. For both datasets, we employ the SGD optimizer with momentum 0.9. The source code and trained models will be made available to the public upon acceptance.

Training Protocol.

For the first stage, we train a large model from scratch with the structured regularization described in Sec. 3.3. At the end of each epoch, we update the permutation matrices as in Sec. 3.2, determine the current group levels as in Sec. 3.4, and adjust the structured regularization matrices accordingly. We train with a fixed learning rate of 0.1 for 100 epochs on the CIFAR dataset and 60 epochs on the ImageNet dataset and exclude the weight decay due to the existence of the structured regularization. The coefficient is dynamically adjusted to meet the desired compression rate (see the supplementary materials). The training pipeline is summarized in Alg. 1.

Finetune Protocol.

The remaining parameters are restored from the training stage and the compressed model is finetuned with an initial learning rate of 0.1. We finetune for 160 epochs on the CIFAR dataset and the learning rate decays by a factor of 10 at 50% and 75% of the total epochs. On the ImageNet dataset, the learning rate is decayed according to the cosine annealing strategy (loshchilov-ICLR17SGDR) within 120 epochs. For both datasets, a standard weight decay of is adopted to prevent overfitting.

Methods #Params.() GFLOPs Acc.(%)
Baseline 25.6 4.14 77.10
1-4[2pt/3pt] NISP-A 18.6 2.97 72.75
Slimming-20% 17.8 2.81 75.12
Taylor-19% 17.9 2.66 75.48
FPGM-30% N/A 2.39 75.59
MASS-35% 17.2 3.12 76.82
1-4[2pt/3pt] ThiNet-30% 16.9 2.62 72.04
NISP-B 14.3 2.29 72.07
Taylor-28% 14.2 2.25 74.50
FPGM-40% N/A 1.93 74.83
MASS-65% 10.3 1.67 75.10
1-4[2pt/3pt] ThiNet-50% 12.4 1.83 71.01
Taylor-44% 7.9 1.34 71.69
Slimming-50% 11.1 1.87 71.99
MASS-85% 5.6 0.90 72.47
Baseline 44.5 7.87 78.64
1-4[2pt/3pt] FPGM-30% N/A 4.55 77.32
Taylor-25% 31.2 4.70 77.35
MASS-40% 26.7 5.05 78.16
1-4[2pt/3pt] BN-ISTA-v1 17.3 3.69 74.56
BN-ISTA-v2 23.6 4.47 75.27
Taylor-45% 20.7 2.85 75.95
Slimming-50% 20.9 3.16 75.97
MASS-65% 16.5 2.98 77.62
1-4[2pt/3pt] Taylor-60% 13.6 1.76 74.16
MASS-80% 10.6 1.70 75.73
Baseline 20.0 4.39 77.88
1-4[2pt/3pt] Taylor-40% 12.5 3.02 76.51
MASS-38% 13.1 3.53 77.43
1-4[2pt/3pt] Taylor-64% 9.0 2.21 75.28
MASS-60% 9.2 2.10 75.86
Table 2: Network compression results on the ImageNet (ILSVRC15) dataset. The center-crop validation accuracy is reported. “Baseline” means the network without compression. The percentages in the table have the same meaning as those in Tab. 1.

4 Experiments and Analysis

In this section, we present the experimental results on the CIFAR and ImageNet datasets. In addition, we carry out ablation studies to demonstrate the effectiveness of components of the proposed method.

4.1 Results on CIFAR

We first compare our proposed method with the Network Slimming (liu2017learning)

approach on the CIFAR-10 dataset. The Network Slimming approach is a representative pruning method that compresses CNN models by pruning less important filters. As the experimental results on the CIFAR-10 dataset are somewhat random, we repeat the train-compress-finetune pipeline for 10 times and record the mean and standard deviation (std). As shown in 

Tab. 1, the proposed MASS method performs favorably under various compression rates. For ResNet-110, with 60% parameters compressed, MASS can still achieve 94.42% top-1 accuracy which is nearly equal to the performance of the baseline method without compression. Compared with the Network Slimming, MASS consistently performs better, especially under high compression rates. Experiments on the CIFAR-10 dataset demonstrate that MASS is able to compress CNNs with negligible performance drop and favorable accuracy against pruning methods such as Network Slimming.

4.2 Results on ImageNet

Tab. 2 shows the evaluation results of the proposed method against the state-of-the-art network pruning approaches, including ThiNet (luo2017thinet), Slimming (liu2017learning), NISP (yu2018nisp), BN-ISTA (ye2018rethinking), FPGM (he2019filter), and Taylor (molchanov2019importance). Overall, the MASS method performs favorably against the state-of-the-art network compression methods under different settings. These performance gains achieved by the MASS method can be attributed to the fact that discarding the entire filters will negatively affect the representational strength of the network model, especially when the pruning ratio is high, e.g., 50%. In contrast, the MASS method removes only a proportion of neuron connections and preserves all of the filters, thereby making a mild impact on the model capacity. In addition, it is known that pruning neuron connections would eliminate the information flow and affect performance. To alleviate this issue, the learnable channel shuffle mechanism assists the information exchange among different groups, thereby minimizing the potential negative impact.

4.3 Ablation Studies

Accuracy v.s. Complexity.

As shown in Fig. 1, the proposed MASS method is designed to make sound accuracy-complexity trade-off. On the ImageNet (krizhevsky2012imagenet) dataset, a slight top-1 accuracy drop of 0.28% is compromised for about 25% complexity reduction on the ResNet-50 backbone, and an accuracy loss of 1.02% for about 60% reduction on ResNet-101. Furthermore, high compression rates can be achieved in our methodology while maintaining competitive performance. It is worth noticing that our method achieves an accuracy of 72.47% with only about 20% complexity of the ResNet-50 backbone, which performs favorably against the pruning methods with two times complexity.

Config. ResNet-50-65% ResNet-101-65%
Acc. Top-1 Top-5 Top-1 Top-5
Finetune 75.10 92.52 77.62 93.72
FromScratch 75.02 92.46 77.14 93.53
ShuffleNet 74.97 92.41 76.91 93.38
Random 69.45 89.45 73.16 91.44
NoShuffle 73.30 91.39 75.31 92.64
Table 3: Ablation study of different channel shuffle operations on the ImageNet dataset (ILSVRC15).

Learned Channel Shuffle Mechanism.

We evaluate the effectiveness of our learned channel shuffle mechanism on the ResNet backbone with a compression rate of 65%. We use the following five settings for performance evaluation:

  • Finetune: The preserved parameters after compression are restored and the compressed model is finetuned. For the other four settings, the parameters of the compressed model are re-initialized for the finetune stage.

  • FromScratch: We keep the learned channel connectivity, i.e.,  and , from the training stage, and train the model from randomly re-initialized weights.

  • ShuffleNet: The same channel shuffle operation in the ShuffleNet (zhang2018shufflenet) is adopted. Specifically, if a convolution is of cardinality and has output channels, then the channel shuffle operation is equivalent to reshaping the output channel dimension into , transposing and flattening it back. Compared with ShuffleNet, the way of channel shuffle is learned rather than pre-defined in our method, i.e., Finetune and FromScratch.

  • Random: The permutation matrices and are randomly given, independent of the learned ones.

  • NoShuffle: The channel shuffle operations are removed, i.e.,  and are identity matrices.

The results are demonstrated in Tab. 3. First, the finetuned models perform slightly better than those trained from scratch, which implies that the preserved parameters take an essential role in the final performance. Furthermore, the model with learned channel shuffle mechanism, i.e., neuron connectivity, performs the best among all settings. The channel shuffle mechanism in the ShuffleNet (zhang2018shufflenet) is effective as it outperforms the no-shuffle counterpart. However, it is can be further improved by a learning-based strategy. Interestingly, the random channel shuffle scheme performs the worst, even worse than the no-shuffle scheme. This implies learning the channel shuffle operation is a challenging task, and randomly gathering channels from different groups gives no benefits.

4.4 Discussion

To the best of our knowledge, our work is the first to introduce structured sparsification for network compression. As there is still room for improvement, we discuss three potential directions for future work along this line of work.

  1. [label=()]

  2. Data-Driven Structured Sparsification. Currently, the gradients of the data loss and those of the sparsity regularization are computed independently (Eq. 7) in each training iteration. Thus, the structured regularization is imposed uniformly on the convolutional layers, and the learned cardinality distribution is task-agnostic and prone to uniformity. However, better cardinality distribution may be achieved if the structured sparsification is guided by the back-propagated signals of the data loss. Thus, optimization-based meta-learning techniques (pmlr-v70-finn17a) can be exploited for this purpose.

  3. Progressive Sparsification Solution. Typically, finetune-free compression techniques are desired in practical applications (cheng2018recent). Therefore, the sparsified weights can be removed progressively during training, and the architecture search as well as model training can be completed in a single training pass.

  4. Combination with Filter Pruning Techniques. As the entire feature maps are reserved in our approach, the reduction of memory footprint is limited. This issue can be addressed by combining with the filter pruning techniques, which is non-trivial as uniform filter pruning is required within each group. It is of great interest to exploit group sparsity constraints (yoon2017combined) to achieve such uniform sparsity.

5 Conclusion

In this work, we propose a method for efficient network compression. Our approach induces structurally sparse representations of the convolutional weights and the compressed model can be readily incorporated in the modern deep learning libraries thanks to their support for the group convolution. The problem of inter-group communication is further solved via the learnable channel shuffle mechanism. Our approach is model-agnostic and highly compressible with negligible performance degradation, which is validated by extensive experiments on the CIFAR and ImageNet datasets. In addition, experimental evaluation against the state-of-the-art compression approaches shows techniques of structured sparsification can be a fruitful future research direction.


Appendix A Structured Regularization in General Form

Generally, we can relax the constraints that both and are powers of 2, and assume both and have many factors of 2. Under this assumption, the potential candidates of cardinality are still restricted to powers of 2. Specifically, if the greatest common divisor of and can be factored as



is an odd integer, then the potential candidates of the group level

are . For example, if the minimal is 4 among all convolutional layers666The standard DenseNet (huang2017densely) family satisfies this condition., the potential candidates of cardinality are , giving adequate flexibility of the compressed model. The structured regularization and the relationship matrix corresponding to each group level are designed in a similar way. For clarity, we provide an exemplar implementation based on the NumPy library.

1import numpy as np
3def struc_reg(dim1, dim2, level=None, power=0.5):
4    r"""
5    Compute the structured regularization matrix.
7    Args::
8        dim1 (int): number of output channels.
9        dim2 (int): number of input channels.
10        level (int or None): current group level.
11                             Specify ’None’ if the cost matrix is desired.
12        power (float): decay rate of the regularization coefficients.
14    Return::
15        Structured regularization matrix.
16    """
17    reg = np.zeros((dim1, dim2))
18    assign_val(reg, 1., level, power)
19    return reg
21def assign_val(arr, val, level, power):
22    dim1, dim2 = arr.shape
23    if dim1 % 2 != 0 or dim2 % 2 != 0 or level == 0:
24        return
25    else:
26        next_level = None if level is None else level - 1
27        arr[dim1//2:, :dim2//2] = val
28        arr[:dim1//2, dim2//2:] = val
29        assign_val(arr[dim1//2:, dim2//2:], val*power, next_level, power)
30        assign_val(arr[:dim1//2, :dim2//2], val*power, next_level, power)

Appendix B Dynamic Penalty Adjustment

As the desired compression rate is customized according to user preference, manually choosing an appropriate regularization coefficient in Eq. (7) of Sec. 3.3 for each experimental setting is extremely inefficient. To alleviate this issue, we dynamically adjust based on the sparsification progress. The algorithm is summarized in Alg. 2.

Concretely, after the training epoch, we first determine the current group level of each convolutional layer according to Eq. (9) of Sec. 3.4. Then, we define the model sparsity based on the reduction of model parameters. For the convolutional layer, the number of parameters is reduced by a factor of , where is the cardinality. Thus, the original number of parameters and the reduced one are given by


respectively. Here, and denote the input channel number and the kernel size of the convolutional layer, respectively. Therefore, the current model sparsity is calculated as


Afterwards, we assume the model sparsity grows linearly, and calculate the expected sparsity gain. If the expected sparsity gain is not met, i.e.,


where is the total training epoch number and is the target sparsity, we increase by . If the model sparsity exceeds the target, i.e., , we decrease by .

In all experiments, the coefficient is initialized from and is set to .

Initialize , , ,
for  to  do train for 1 epoch

       Determine the current group levels ;
Compute the current sparsity by Eq. 12 and 11
if  then
       else if  then
end for
Algorithm 2 Dynamically adjust

Appendix C Experimental Details

In this section, we provide more results and details of our experiments. We provide the loss and accuracy curves along with the performance after each stage in Sec. C.1, and analyze the compressed model architectures in Sec. C.2.

c.1 Training Dynamics

Backbone ResNet-50 ResNet-101 DenseNet-201
Compression Rate 35% 65% 85% 40% 65% 80% 38% 60%
Pre-compression Acc. 69.07 66.36 64.30 69.56 67.13 64.20 69.10 66.26
Post-compression Acc. 60.92 42.78 8.82 65.78 58.63 18.57 66.15 17.35
Finetune Acc. 76.82 75.10 72,47 78.16 77.62 75.73 77.43 75.86
threshold 0.127 0.115 0.125 0.095 0.090 0.103 0.098 0.115
Table 4: Performance along the timeline of our approach. The evaluation is performed on the ImageNet dataset.

We first provide the pre- and post-compression accuracy along with the finetune accuracy of our pipeline in Tab. 4. During, compression, we use a binary search to decide the threshold of the grouping criteria (Eq. (9)) so that the network can be compressed at the desired compression rate. The searched thresholds are also illustrated. Apart from this, we further provide the training and finetune curves in Fig. 4. In the training stage, the accuracy gradually increases till saturation, and then the compression leads to a slight performance drop. Finally, the performance is recovered in the finetune stage.


Figure 4: Training dynamics of the full MASS pipeline. We plot the training and finetune curves of the DenseNet-201 backbone with a compression rate of 38%. At the end of the 60 epoch of the training stage, we compress the network following our criteria. Then, we finetune for 120 epochs to recover performance.

c.2 Compressed Architectures

We illustrate the compressed architecture by showing the cardinality of each convolution layer in Fig. 6 and 5. Note that our method is applied to all convolution operators, i.e., both convolutions and convolutions, so a high compression rate, e.g., 80%, can be achieved. Besides, as discussed in Sec. 4.4, the learned cardinality distribution is prone to uniformity, but there are still certain patterns. For example, shallow layers are relatively more difficult to be compressed. A possible explanation is that shallow layers have fewer filters, so a large cardinality will inevitably eliminate the communication between certain groups. Moreover, we observe convolutions are generally more compressible than convolutions. This is intuitive as convolutions have more parameters, thus leading to heavier redundancy.


Figure 5: Learned cardinalities of the ResNet-50 backbone with the compression rates of 35% and 65%.


Figure 6: Learned cardinalities of the ResNet-101 backbone with the compression rates of 40% and 80%.

Besides, we illustrate the learned neuron connectivity and compare with the ShuffleNet (zhang2018shufflenet) counterpart. Here, we consider the channel permutation between two group convolutions (GroupConvs) and demonstrate the connectivity via the confusion matrix. Specifically, we assume the first GroupConv is of cardinality and the second of

, then the confusion matrix

is a matrix with denoting the number of channels that come from the group of the first GroupConv and belong to the group of the second.

In Tab. 5, we can see that the inter-group communication is guaranteed as there are connections between every two groups. Furthermore, the learnable channel shuffle scheme is more flexible. The ShuffleNet (zhang2018shufflenet) scheme uniformly partitions and distributes channels within each group, while our approach allows small variations of the number of connections for each group. In this way, the network can itself control the information flow from each group by customizing its neuron connectivity. More examples can be found in Fig. 7. All models illustrated in this section are trained on the ImageNet dataset.

G1 G2 G3 G4 G5 G6 G7 G8 G1 6 6 10 8 9 6 13 6 G2 9 8 7 9 11 8 4 8 G3 11 8 11 6 4 8 7 9 G4 16 9 5 9 10 4 6 5 G5 7 9 7 7 8 10 9 7 G6 5 7 10 6 7 11 7 11 G7 4 8 7 14 6 8 7 10 G8 6 9 7 5 9 9 11 8 G1 G2 G3 G4 G5 G6 G7 G8 G1 8 8 8 8 8 8 8 8 G2 8 8 8 8 8 8 8 8 G3 8 8 8 8 8 8 8 8 G4 8 8 8 8 8 8 8 8 G5 8 8 8 8 8 8 8 8 G6 8 8 8 8 8 8 8 8 G7 8 8 8 8 8 8 8 8 G8 8 8 8 8 8 8 8 8
Table 5: Confusion matrices of the adjacent GroupConvs. Here, the neuron connectivity between “Layer4-Bottleneck1-conv1” and “Layer4-Bottleneck1-conv2” of the ResNet-50-85% model is demonstrated. Left: the learned neuron connectivity; Right: the neuron connectivity of the ShuffleNet (zhang2018shufflenet).

[width=0.8]confusion (a) DenseNet-201-60%Block4-Layer24-conv1-2(b) ResNet-50-85%Layer1-Bottleneck1-conv2-3(c) ResNet-101-80%Layer4-Bottleneck2-conv2-3(d) ResNet-50-85%Layer3-Bottleneck4-conv1-2(e) ResNet-101-80%Layer3-Bottleneck1-conv1-2

Figure 7: More examples of the confusion matrices.