1 Introduction
Convolutional Neural Networks (CNNs) have made significant advances in a wide range of vision and learning tasks (krizhevsky2012imagenet; gehring2017convolutional; long2015fully; girshick2015fast). However, the performance gains usually entail heavy computational costs, which make the deployment of CNNs on portable devices difficult. To meet the memory and computational constraints in realworld applications, numerous model compression techniques have been developed.
Existing network compression techniques are mainly based on weight quantization (chen2015compressing; courbariaux2016binarized; rastegari2016xnor; wu2016quantized), knowledge distillation (hinton2015distilling; chen2017learning; yim2017gift), and network pruning (li2017pruning; he2017channel; liu2017learning; molchanov2019importance). Weight quantization methods use low bitwidth numbers to represent weights and activations, which usually bring a moderate performance degradation. Knowledge distillation schemes transfer knowledge from a large teacher network to a compact student network, which are typically susceptible to the teacher/student network architecture (mirzadeh2019improved; liu2019search). Closely related to our work, network pruning approaches reduce the model size by removing a proportion of model parameters that are considered unimportant. Notably, filter pruning algorithms (liu2017learning; he2017channel; li2017pruning) remove the entire filters and result in structured architectures that can be readily incorporated into modern BLAS libraries.
Identifying unimportant filters is critical to pruning methods. It is wellknown that the weight norm can serve as a good indicator of the corresponding filter importance (li2017pruning; liu2017learning). Filters corresponding to smaller weight norms are considered to contribute less to the outputs. Furthermore, the regularization can be used to increase sparsity (liu2017learning). Despite the advances, several issues in the existing pruning methods can be improved: 1) pruning a large proportion of convolutional filters will result in severe performance degradation; 2) pruning alters the input/output feature dimensions, and thus meticulous adaptation is required to handle special network architectures (e.g., residual connections (he2016deep) and dense connections (huang2017densely)).
Before presenting the proposed method, we briefly introduce the group convolution (GroupConv) (krizhevsky2012imagenet)
, which plays an important role in this work. For the typical convolution operation, the output features are denselyconnected with the entire input features, while for the GroupConv, the input features are equally split into several groups and transformed within each group independently. Essentially, each output channel is connected with only a proportion of the input channels, which leads to sparse neuron connections. Therefore, deep CNNs with GroupConvs can be trained on less powerful GPUs with smaller amount of memory. In this work, we propose a novel approach for network compression where unimportant neuron connections are pruned to facilitate the usage of GroupConvs. Nevertheless, converting vallina convolutions into GroupConvs is a challenging task. First, not all sparse neuron connectivities correspond to valid GroupConvs, while certain requirements must be satisfied,
e.g., mutual exclusiveness of different groups. To guarantee the desired structured sparsity, we impose structured regularization upon the convolutional weights and zero out the sparsified weights. Another challenge is that stacking multiple GroupConvs sequentially will hinder the intergroup information flow. The ShuffleNet (zhang2018shufflenet) method proposes a channel shuffle mechanism, i.e.,gathering channels from distinct groups, to ensure the intergroup communication, though the order of permutation is handcrafted. However, we solve the problem of channel shuffle in a learningbased scheme. Concretely, we formulate the learning of channel shuffle as a linear programming problem, which can be solved by efficient algorithms like the network simplex method
(bonneel2011displacement). Since the structured sparsity is induced among the convolutional weights, our method is nominated as ModelAgnostic Structured Sparsification, abbreviated to MASS.The proposed structured sparsification method is designed for three goals. First, our approach is modelagnostic. A wide range of backbone architectures are amenable to our method without the need for any special adaptation. Second, our method is capable of achieving high compression rates. In modern efficient network architectures, the complexity of convolutions is highly compressed, while the computation bottleneck becomes the pointwise convolutions (i.e., convolutions) (zhang2018shufflenet). For example, the pointwise convolutions occupy 81.5% of the total FLOPs in the MobileNetV2 (sandler2018mobilenetv2) backbone and 93.4% in ResNeXt (xie2017aggregated). Our method is applicable to all convolution operators so that a high compression rate is reachable. Third, our approach brings negligible performance drop. As all of the filters are preserved under our methodology, we retain stronger representational capacity of the compressed model and achieve better accuracycomplexity tradeoff than the pruningbased counterparts (see Fig. 1).
The main contributions of this work are threefold:

We propose a learnable channel shuffle mechanism (Sec. 3.2) in which the permutation of the convolutional weight norm matrix is learned via linear programming.

Upon the permuted weight norm matrix, we impose structured regularization (Sec. 3.3) to obtain valid GroupConvs by zeroing out the sparsified weights.

With the structurally sparse convolutional weights, we design the criteria of learning cardinality (Sec. 3.4) in which unimportant neuron connections are pruned with minimal impact on the entire network.
Incorporating the learnable channel shuffle mechanism, the structured regularization and the grouping criteria, the proposed structured sparsification method performs favorably against the stateoftheart network pruning techniques on both CIFAR (krizhevsky2009learning) and ImageNet (ILSVRC15) datasets.
2 Related Work
Network Compression
Compression methods for deep models can be broadly categorized based on weight quantization, knowledge distillation, and network pruning. Closely related to our work are the network pruning approaches based on filter pruning. It is wellacknowledged that filters with smaller weight norm are considered to make negligible contribution to the outputs and can be pruned. li2017pruning prune filters with smaller norm, while liu2017learning remove those corresponding to smaller batchnorm scaling factors, on which an regularization term is imposed to increase sparsity.
However, techniques that remove the entire filters based on the weight norm may negatively affect the representational capacity significantly. Instead of removing the entire filters, the proposed structured sparsification method enforces structured sparsity among neuron connections and merely removes certain unimportant connections while the entire filters are preserved. As such, the network capacity is less affected than pruningbased approaches (li2017pruning; liu2017learning; he2019filter; molchanov2019importance). Furthermore, our method does not alter the input/output dimensions, and can be easily incorporated into numerous backbone models.
Group Convolution.
Group convolution (GroupConv) is introduced in the AlexNet (krizhevsky2012imagenet) to overcome the GPU memory constraints. GroupConv partitions the input features into mutually exclusive groups and transforms the features within each group in parallel. Compared with the vallina (i.e., densely connected) convolution, a GroupConv with groups can reduce the computational cost and number of parameters by a factor of . The ResNeXt (xie2017aggregated) designs a multibranch architecture by employing GroupConvs and defines cardinality as the number of parallel transformations, which is simply the group number in each GroupConv. If the cardinality equals to the number of channels, GroupConv becomes the depthwise separable convolution, which is widely used in recent lightweight neural architectures (howard2017mobilenets; sandler2018mobilenetv2; zhang2018shufflenet; ma2018shufflenet; chollet2017xception).
However, the aforementioned methods all treat the cardinality as a hyperparameter, and the connectivity patterns between consecutive features are handcrafted as well. On the other hand, there is also a line of research focusing on learnable GroupConvs (huang2018condensenet; wang2019fully; zhang2019differentiable). Both CondenseNet (huang2018condensenet) and FLGC (wang2019fully) predefine the cardinality of each GroupConv and learn the connectivity patterns. We note that the work by zhang2019differentiable learns both the cardinality and neuron connectivity simultaneously. Essentially, this dynamic grouping convolution is modeled by a binary relationship matrix where indicates the connectivity between the input channel and the output channel. To guarantee that the resulting operator is a valid GroupConv, the relationship matrix is constructed using a Kronecker product of several binary symmetric matrices. Nevertheless, the Kronecker product gives a sufficient but unnecessary condition and the space of all valid relationship matrices is not fully exploited.
Our method decouples the learning of cardinality and connectivity. Motivated by the normbased criterion in the network pruning methods (li2017pruning; liu2017learning), we quantify the importance of each neuron connection by the corresponding weight norm and learn the connectivity pattern by permuting the weight norm matrix. Besides, the structured regularization is imposed on the permuted weight norm matrix and the cardinality is learned accordingly. The essential difference between our approach and prior art (zhang2019differentiable) is that all possible neuron connectivity patterns, i.e., relationship matrices, can be reached by our method.
Channel Shuffle Mechanism.
The ShuffleNet (zhang2018shufflenet) combines the channel shuffle mechanism with GroupConv for efficient network design, in which channels from different groups are gathered so as to facilitate intergroup communication. Without channel shuffle, stacking multiple GroupConvs will eliminate the information flow among different groups and weaken the representational capacity. Different from the handcrafted counterpart (zhang2018shufflenet), the proposed channel shuffle operation is learnable over the space of all possible channel permutations. Furthermore, without bells and whistles, our channel shuffle only involves a simple permutation along the channel dimension, which can be conveniently implemented by an index operation.
Neural Architecture Search.
Neural Architecture Search (NAS) (zoph2016neural; baker2016designing; zoph2018learning; real2019regularized; wu2019fbnet)
aims to automate the process of learning neural architectures within certain budgets of computational resources. Existing NAS algorithms are developed based on reinforcement learning
(zoph2016neural; baker2016designing; zoph2018learning), evolutionary search (real2017large; real2019regularized), and differentiable approaches (liu2018darts; wu2019fbnet). Our method can be viewed as a special case of hyperparameter (i.e., cardinality) optimization and neuron connectivity search. However, different from existing approaches that evaluate on numerous architectures, the proposed method can determine the compressed architecture in one single training pass and is more scalable than most NAS methods.3 ModelAgnostic Structured Sparsification
3.1 Overview
The structured sparsification method is designed to zero out a proportion of the convolutional weights so that the vallina convolutions can be converted into group convolutions (GroupConvs), and meanwhile the optimal neuron connectivity can be learned. We adopt the “train, compress, finetune” pipeline, in a way similar to the recent pruning approaches (liu2017learning). Concretely, we first train a large model with structured regularization, then convert vallina convolutions into GroupConvs under certain criteria, and finally finetune the compressed model to recover accuracy. The connectivity patterns can be therein learned as the structured regularization heavily depends on them. As such, three issues need to be addressed: 1) how to learn the connectivity patterns (Sec. 3.2); 2) how to design the structured regularization (Sec. 3.3); and 3) how to decide the grouping criteria (Sec. 3.4). Additional details of our pipeline are presented in Sec. 3.5.
3.2 Learning Connectivity with Linear Programming
Let be the input feature map, where denotes the number of input channels. We apply a vallina convolution^{1}^{1}1For simplicity, we omit the bias term from Eq. 1
, and assume the convolution operator is of stride 1 with proper paddings.
with weights to , i.e., , where with denoting the number of output channels. Each entry of is a weighted sum of a local patch of , namely,(1) 
In Eq. 1, the channel of relates to the channel of via weights
. Motivated by the normbased importance estimation in filter pruning
(li2017pruning; liu2017learning), we quantify the importance of the connection between and of by . Thus, the importance matrix can be defined as the norm along the “kernel size” dimensions of , i.e., .Next, we extend our formulation to GroupConvs with cardinality . A GroupConv can be considered as a convolution with sparse neuron connectivity, in which only a proportion of input channels is visible to each output channel. Without loss of generality, we assume both and are divisible by , and Eq. 1 can be adapted as
(2) 
where indicates the output channel belongs to the group, and denotes the number of input channels within each group. Clearly, the valid entries of form a block diagonal matrix with equallysplit blocks at the input/output channel dimensions. Thus, the GroupConv module requires parameters and FLOPs for processing the feature , and the computational complexity is reduced by a factor of compared with the vallina counterpart.
We note that if a vallina convolution operator can be converted into GroupConv without affecting its functional property (we call such convolution operators groupable), the convolutional weights must be block diagonal after certain permutations along the input/output channel dimensions. Due to the positive definiteness of norm and the fact that permuting corresponds to permuting , a necessary and sufficient condition of a convolution operator being groupable is that
(3) 
where denotes the set of permutation matrices. Here, the permutation matrices and shuffle the channels of the input and output features, and thus determine the connectivity pattern between and (see Fig. 2).
However, a randomly initialized and trained convolution operator by no means can be groupable unless special sparsity constraints are imposed. To this end, we resort to permuting so as to make “as block diagonal as possible”. The next question is how to rigorously define the term “as block diagonal as possible”. Here, we assume both and are powers of 2, where the most widelyused backbone architectures (e.g., VGG (vgg) and ResNet (he2016deep)) satisfy this assumption^{2}^{2}2Similar reasoning can be applied if both and have many factors of 2. See the supplementary materials for details.. Then, the potential cardinality is also a power of 2. As the cardinality grows, more and more nondiagonal blocks are zeroed out (see Fig. 3(c)). As illustrated in Fig. 3(b), we define the cost matrix to progressively penalize the nonzero entries of the nondiagonal blocks. Finally, we utilize as a metric of the “blockdiagonality” of the matrix , where indicates elementwise multiplication and summation over all entries, i.e., . Formally, the optimization problem is formulated as follows:
(4) 
Solving Eq. 4 gives the optimal connectivity pattern between the adjacent layers.
However, minimization over the set of permutation matrices is a nonconvex and NPhard problem that requires combinatorial search. To this end, we relax the feasible space to its convex hull. The Birkhoffvon Neumann theorem (birkhoff1946three) states that the convex hull of the set of permutation matrices is the set of doublystochastic matrices^{3}^{3}3Doublystochastic matrices are nonnegative square matrices whose rows and columns sum to one., known as the Birkhoff polytope:
(5) 
where
denotes the column vector composed of
ones.We solve Eq. 4 with coordinate descent. That is, we iteratively update and until convergence. When updating one variable, we consider the other as fixed. For example, when optimizing , the objective function can be transformed as follows:
(6) 
As the objective is a linear function of and the Birkhoff polytope is a simplex, Eq. 6 is a linear programming problem, which can be solved by efficient algorithms such as the network simplex method (bonneel2011displacement). In addition, the theory of linear programming guarantees that at least one of the solutions is achieved at the vertex of the simplex, and the vertices of the Birkhoff polytope are precisely the permutation matrices (birkhoff1946three). Thus, in Eq. 6, minimization over the Birkhoff polytope is equivalent to minimization over the set of permutation matrices, and the solution is naturally a permutation matrix without the need for postprocessing. Furthermore, Eq. 6 has the same formulation as the optimal transport problem, and sophisticated computation library^{4}^{4}4https://github.com/rflamary/POT/ is available for efficient linear programming.
3.3 Structured Regularization
Permutation alone does not suffice to induce structurally sparse convolutional weights, and we still need to impose special sparsity regularization to achieve the desired sparsity structure. Inspired by the sparsityinducing penalty in liu2017learning, we impose the structured regularization on the permuted weight norm . We first define the group level as illustrated in Fig. 3, which indicates the current cardinality achieved, i.e., and is determined in Sec. 3.4. Then, given the current group level , the structured regularization is formulated as^{5}^{5}5Here, we simply compute the regularization of a single convolutional layer. In the experiments, the regularization is the summation of those of all the convolution layers. , where denotes the structured regularization matrix as illustrated in Fig. 3(b). Intuitively, at group level , additional regularization is imposed upon the nondiagonal blocks to be zeroed out if the group level of is achieved. Furthermore, the regularization coefficient decays exponentially as the group level grows as we desire balanced cardinality distribution among the network. In the end, the overall loss becomes
(7) 
where denotes the regular data loss (standard classification loss in the following experiments) and is the balancing scalar.
3.4 Criteria of Learning Cardinality
With the structurally sparsified convolutional weights, the next step is to determine the cardinality. The core idea of our criteria is that the weight norms corresponding to the valid connections constitute at least a certain proportion of the total weight norms. At group level , the following requirement should be satisfied:
(8) 
where is a threshold set to 0.9 in all of our experiments, and is the relationship matrix (zhang2019differentiable) as illustrated in Fig. 3(c). The matrix specifies the valid neuron connections at group level . Therefore, the current group level can be determined by
(9) 
3.5 Pipeline Details
Methods  #Params.()  FLOPs ()  Acc.(%) 
ResNet20  
Baseline  2.20  3.53  91.70 (0.12) 
14[2pt/3pt] Slimming40%  1.91 (0.00)  3.10 (0.02)  91.74 (0.35) 
MASS20%  1.76 (0.00)  3.18 (0.07)  91.79 (0.23) 
14[2pt/3pt] Slimming60%  1.36 (0.02)  2.24 (0.01)  89.68 (0.38) 
MASS40%  1.31 (0.01)  2.58 (0.00)  91.42 (0.04) 
ResNet56  
Baseline  5.90  9.16  93.50 (0.19) 
14[2pt/3pt] Slimming60%  4.15 (0.03)  5.75 (0.10)  93.10 (0.25) 
MASS30%  4.08 (0.05)  7.17 (0.20)  94.19 (0.16) 
MASS50%  2.96 (0.03)  4.81 (0.03)  93.70 (0.06) 
14[2pt/3pt] Slimming80%  2.33 (0.04)  3.50 (0.02)  91.01 (0.02) 
MASS60%  2.34 (0.08)  4.20 (0.08)  93.48 (0.13) 
MASS70%  1.80 (0.00)  3.52 (0.16)  93.25 (0.02) 
ResNet110  
Baseline  11.47  17.59  94.62 (0.22) 
14[2pt/3pt] Slimming40%  9.24 (0.03)  12.55 (0.00)  94.49 (0.12) 
MASS20%  9.12 (0.06)  14.76 (0.02)  94.78 (0.11) 
MASS40%  6.69 (0.24)  11.60 (0.01)  94.55 (0.18) 
14[2pt/3pt] Slimming60%  8.15 (0.03)  10.66 (0.00)  94.29 (0.11) 
MASS30%  7.89 (0.03)  12.47 (0.01)  94.69 (0.08) 
MASS60%  5.41 (0.02)  10.66 (0.01)  94.42 (0.04) 
Implementation Details.
Our implementation is based on the PyTorch
(steiner2019pytorch) library. The proposed method is applied to the ResNet (he2016deep) and DenseNet (huang2017densely) families, and evaluated on the CIFAR (krizhevsky2009learning) and ImageNet (ILSVRC15) datasets. For the CIFAR dataset, we follow the common practice of data augmentation (he2016deep; liu2017learning; xie2017aggregated): zeropadding of 4 pixels on each side of the image, random crop of a patch, and random horizontal flip. For fair comparisons, we utilize the same network architecture as liu2017learning, and the model is trained on a single GPU with a batch size of 64. For the ImageNet dataset, we adopt the standard data augmentation strategy (vgg; he2016deep; xie2017aggregated): image resize such that the shortest edge is of 256 pixels, random crop of a patch, and random horizontal flip. The overall batch size is 256, which is distributed to 4 GPUs. For both datasets, we employ the SGD optimizer with momentum 0.9. The source code and trained models will be made available to the public upon acceptance.Training Protocol.
For the first stage, we train a large model from scratch with the structured regularization described in Sec. 3.3. At the end of each epoch, we update the permutation matrices as in Sec. 3.2, determine the current group levels as in Sec. 3.4, and adjust the structured regularization matrices accordingly. We train with a fixed learning rate of 0.1 for 100 epochs on the CIFAR dataset and 60 epochs on the ImageNet dataset and exclude the weight decay due to the existence of the structured regularization. The coefficient is dynamically adjusted to meet the desired compression rate (see the supplementary materials). The training pipeline is summarized in Alg. 1.
Finetune Protocol.
The remaining parameters are restored from the training stage and the compressed model is finetuned with an initial learning rate of 0.1. We finetune for 160 epochs on the CIFAR dataset and the learning rate decays by a factor of 10 at 50% and 75% of the total epochs. On the ImageNet dataset, the learning rate is decayed according to the cosine annealing strategy (loshchilovICLR17SGDR) within 120 epochs. For both datasets, a standard weight decay of is adopted to prevent overfitting.
Methods  #Params.()  GFLOPs  Acc.(%) 
ResNet50  
Baseline  25.6  4.14  77.10 
14[2pt/3pt] NISPA  18.6  2.97  72.75 
Slimming20%  17.8  2.81  75.12 
Taylor19%  17.9  2.66  75.48 
FPGM30%  N/A  2.39  75.59 
MASS35%  17.2  3.12  76.82 
14[2pt/3pt] ThiNet30%  16.9  2.62  72.04 
NISPB  14.3  2.29  72.07 
Taylor28%  14.2  2.25  74.50 
FPGM40%  N/A  1.93  74.83 
MASS65%  10.3  1.67  75.10 
14[2pt/3pt] ThiNet50%  12.4  1.83  71.01 
Taylor44%  7.9  1.34  71.69 
Slimming50%  11.1  1.87  71.99 
MASS85%  5.6  0.90  72.47 
ResNet101  
Baseline  44.5  7.87  78.64 
14[2pt/3pt] FPGM30%  N/A  4.55  77.32 
Taylor25%  31.2  4.70  77.35 
MASS40%  26.7  5.05  78.16 
14[2pt/3pt] BNISTAv1  17.3  3.69  74.56 
BNISTAv2  23.6  4.47  75.27 
Taylor45%  20.7  2.85  75.95 
Slimming50%  20.9  3.16  75.97 
MASS65%  16.5  2.98  77.62 
14[2pt/3pt] Taylor60%  13.6  1.76  74.16 
MASS80%  10.6  1.70  75.73 
DenseNet201  
Baseline  20.0  4.39  77.88 
14[2pt/3pt] Taylor40%  12.5  3.02  76.51 
MASS38%  13.1  3.53  77.43 
14[2pt/3pt] Taylor64%  9.0  2.21  75.28 
MASS60%  9.2  2.10  75.86 
14[2pt/3pt] 
4 Experiments and Analysis
In this section, we present the experimental results on the CIFAR and ImageNet datasets. In addition, we carry out ablation studies to demonstrate the effectiveness of components of the proposed method.
4.1 Results on CIFAR
We first compare our proposed method with the Network Slimming (liu2017learning)
approach on the CIFAR10 dataset. The Network Slimming approach is a representative pruning method that compresses CNN models by pruning less important filters. As the experimental results on the CIFAR10 dataset are somewhat random, we repeat the traincompressfinetune pipeline for 10 times and record the mean and standard deviation (std). As shown in
Tab. 1, the proposed MASS method performs favorably under various compression rates. For ResNet110, with 60% parameters compressed, MASS can still achieve 94.42% top1 accuracy which is nearly equal to the performance of the baseline method without compression. Compared with the Network Slimming, MASS consistently performs better, especially under high compression rates. Experiments on the CIFAR10 dataset demonstrate that MASS is able to compress CNNs with negligible performance drop and favorable accuracy against pruning methods such as Network Slimming.4.2 Results on ImageNet
Tab. 2 shows the evaluation results of the proposed method against the stateoftheart network pruning approaches, including ThiNet (luo2017thinet), Slimming (liu2017learning), NISP (yu2018nisp), BNISTA (ye2018rethinking), FPGM (he2019filter), and Taylor (molchanov2019importance). Overall, the MASS method performs favorably against the stateoftheart network compression methods under different settings. These performance gains achieved by the MASS method can be attributed to the fact that discarding the entire filters will negatively affect the representational strength of the network model, especially when the pruning ratio is high, e.g., 50%. In contrast, the MASS method removes only a proportion of neuron connections and preserves all of the filters, thereby making a mild impact on the model capacity. In addition, it is known that pruning neuron connections would eliminate the information flow and affect performance. To alleviate this issue, the learnable channel shuffle mechanism assists the information exchange among different groups, thereby minimizing the potential negative impact.
4.3 Ablation Studies
Accuracy v.s. Complexity.
As shown in Fig. 1, the proposed MASS method is designed to make sound accuracycomplexity tradeoff. On the ImageNet (krizhevsky2012imagenet) dataset, a slight top1 accuracy drop of 0.28% is compromised for about 25% complexity reduction on the ResNet50 backbone, and an accuracy loss of 1.02% for about 60% reduction on ResNet101. Furthermore, high compression rates can be achieved in our methodology while maintaining competitive performance. It is worth noticing that our method achieves an accuracy of 72.47% with only about 20% complexity of the ResNet50 backbone, which performs favorably against the pruning methods with two times complexity.
Config.  ResNet5065%  ResNet10165%  

Acc.  Top1  Top5  Top1  Top5 
Finetune  75.10  92.52  77.62  93.72 
FromScratch  75.02  92.46  77.14  93.53 
ShuffleNet  74.97  92.41  76.91  93.38 
Random  69.45  89.45  73.16  91.44 
NoShuffle  73.30  91.39  75.31  92.64 
Learned Channel Shuffle Mechanism.
We evaluate the effectiveness of our learned channel shuffle mechanism on the ResNet backbone with a compression rate of 65%. We use the following five settings for performance evaluation:

Finetune: The preserved parameters after compression are restored and the compressed model is finetuned. For the other four settings, the parameters of the compressed model are reinitialized for the finetune stage.

FromScratch: We keep the learned channel connectivity, i.e., and , from the training stage, and train the model from randomly reinitialized weights.

ShuffleNet: The same channel shuffle operation in the ShuffleNet (zhang2018shufflenet) is adopted. Specifically, if a convolution is of cardinality and has output channels, then the channel shuffle operation is equivalent to reshaping the output channel dimension into , transposing and flattening it back. Compared with ShuffleNet, the way of channel shuffle is learned rather than predefined in our method, i.e., Finetune and FromScratch.

Random: The permutation matrices and are randomly given, independent of the learned ones.

NoShuffle: The channel shuffle operations are removed, i.e., and are identity matrices.
The results are demonstrated in Tab. 3. First, the finetuned models perform slightly better than those trained from scratch, which implies that the preserved parameters take an essential role in the final performance. Furthermore, the model with learned channel shuffle mechanism, i.e., neuron connectivity, performs the best among all settings. The channel shuffle mechanism in the ShuffleNet (zhang2018shufflenet) is effective as it outperforms the noshuffle counterpart. However, it is can be further improved by a learningbased strategy. Interestingly, the random channel shuffle scheme performs the worst, even worse than the noshuffle scheme. This implies learning the channel shuffle operation is a challenging task, and randomly gathering channels from different groups gives no benefits.
4.4 Discussion
To the best of our knowledge, our work is the first to introduce structured sparsification for network compression. As there is still room for improvement, we discuss three potential directions for future work along this line of work.

[label=()]

DataDriven Structured Sparsification. Currently, the gradients of the data loss and those of the sparsity regularization are computed independently (Eq. 7) in each training iteration. Thus, the structured regularization is imposed uniformly on the convolutional layers, and the learned cardinality distribution is taskagnostic and prone to uniformity. However, better cardinality distribution may be achieved if the structured sparsification is guided by the backpropagated signals of the data loss. Thus, optimizationbased metalearning techniques (pmlrv70finn17a) can be exploited for this purpose.

Progressive Sparsification Solution. Typically, finetunefree compression techniques are desired in practical applications (cheng2018recent). Therefore, the sparsified weights can be removed progressively during training, and the architecture search as well as model training can be completed in a single training pass.

Combination with Filter Pruning Techniques. As the entire feature maps are reserved in our approach, the reduction of memory footprint is limited. This issue can be addressed by combining with the filter pruning techniques, which is nontrivial as uniform filter pruning is required within each group. It is of great interest to exploit group sparsity constraints (yoon2017combined) to achieve such uniform sparsity.
5 Conclusion
In this work, we propose a method for efficient network compression. Our approach induces structurally sparse representations of the convolutional weights and the compressed model can be readily incorporated in the modern deep learning libraries thanks to their support for the group convolution. The problem of intergroup communication is further solved via the learnable channel shuffle mechanism. Our approach is modelagnostic and highly compressible with negligible performance degradation, which is validated by extensive experiments on the CIFAR and ImageNet datasets. In addition, experimental evaluation against the stateoftheart compression approaches shows techniques of structured sparsification can be a fruitful future research direction.
References
Appendix A Structured Regularization in General Form
Generally, we can relax the constraints that both and are powers of 2, and assume both and have many factors of 2. Under this assumption, the potential candidates of cardinality are still restricted to powers of 2. Specifically, if the greatest common divisor of and can be factored as
(10) 
where
is an odd integer, then the potential candidates of the group level
are . For example, if the minimal is 4 among all convolutional layers^{6}^{6}6The standard DenseNet (huang2017densely) family satisfies this condition., the potential candidates of cardinality are , giving adequate flexibility of the compressed model. The structured regularization and the relationship matrix corresponding to each group level are designed in a similar way. For clarity, we provide an exemplar implementation based on the NumPy library.Appendix B Dynamic Penalty Adjustment
As the desired compression rate is customized according to user preference, manually choosing an appropriate regularization coefficient in Eq. (7) of Sec. 3.3 for each experimental setting is extremely inefficient. To alleviate this issue, we dynamically adjust based on the sparsification progress. The algorithm is summarized in Alg. 2.
Concretely, after the training epoch, we first determine the current group level of each convolutional layer according to Eq. (9) of Sec. 3.4. Then, we define the model sparsity based on the reduction of model parameters. For the convolutional layer, the number of parameters is reduced by a factor of , where is the cardinality. Thus, the original number of parameters and the reduced one are given by
(11) 
respectively. Here, and denote the input channel number and the kernel size of the convolutional layer, respectively. Therefore, the current model sparsity is calculated as
(12) 
Afterwards, we assume the model sparsity grows linearly, and calculate the expected sparsity gain. If the expected sparsity gain is not met, i.e.,
(13) 
where is the total training epoch number and is the target sparsity, we increase by . If the model sparsity exceeds the target, i.e., , we decrease by .
In all experiments, the coefficient is initialized from and is set to .
Initialize , , ,
for to do train for 1 epoch
Appendix C Experimental Details
In this section, we provide more results and details of our experiments. We provide the loss and accuracy curves along with the performance after each stage in Sec. C.1, and analyze the compressed model architectures in Sec. C.2.
c.1 Training Dynamics
Backbone  ResNet50  ResNet101  DenseNet201  

Compression Rate  35%  65%  85%  40%  65%  80%  38%  60% 
Precompression Acc.  69.07  66.36  64.30  69.56  67.13  64.20  69.10  66.26 
Postcompression Acc.  60.92  42.78  8.82  65.78  58.63  18.57  66.15  17.35 
Finetune Acc.  76.82  75.10  72,47  78.16  77.62  75.73  77.43  75.86 
threshold  0.127  0.115  0.125  0.095  0.090  0.103  0.098  0.115 
We first provide the pre and postcompression accuracy along with the finetune accuracy of our pipeline in Tab. 4. During, compression, we use a binary search to decide the threshold of the grouping criteria (Eq. (9)) so that the network can be compressed at the desired compression rate. The searched thresholds are also illustrated. Apart from this, we further provide the training and finetune curves in Fig. 4. In the training stage, the accuracy gradually increases till saturation, and then the compression leads to a slight performance drop. Finally, the performance is recovered in the finetune stage.
[width=]trainingcurve
c.2 Compressed Architectures
We illustrate the compressed architecture by showing the cardinality of each convolution layer in Fig. 6 and 5. Note that our method is applied to all convolution operators, i.e., both convolutions and convolutions, so a high compression rate, e.g., 80%, can be achieved. Besides, as discussed in Sec. 4.4, the learned cardinality distribution is prone to uniformity, but there are still certain patterns. For example, shallow layers are relatively more difficult to be compressed. A possible explanation is that shallow layers have fewer filters, so a large cardinality will inevitably eliminate the communication between certain groups. Moreover, we observe convolutions are generally more compressible than convolutions. This is intuitive as convolutions have more parameters, thus leading to heavier redundancy.
[width=]barplotresnet50
[width=]barplotresnet101
Besides, we illustrate the learned neuron connectivity and compare with the ShuffleNet (zhang2018shufflenet) counterpart. Here, we consider the channel permutation between two group convolutions (GroupConvs) and demonstrate the connectivity via the confusion matrix. Specifically, we assume the first GroupConv is of cardinality and the second of
, then the confusion matrix
is a matrix with denoting the number of channels that come from the group of the first GroupConv and belong to the group of the second.In Tab. 5, we can see that the intergroup communication is guaranteed as there are connections between every two groups. Furthermore, the learnable channel shuffle scheme is more flexible. The ShuffleNet (zhang2018shufflenet) scheme uniformly partitions and distributes channels within each group, while our approach allows small variations of the number of connections for each group. In this way, the network can itself control the information flow from each group by customizing its neuron connectivity. More examples can be found in Fig. 7. All models illustrated in this section are trained on the ImageNet dataset.
[width=0.8]confusion
Comments
There are no comments yet.