1 Introduction
Deep convolutional neural networks (CNNs) have shown remarkable performance in many computer vision tasks in recent years. In order to achieve higher accuracy for major tasks such as image classification, building deeper and wider CNNs
2 ; 3 ; 4 ; 11 is the primary trend. However, deeper and wider CNNs usually have hundreds of layers and thousands of channels, which come with an increasing amount of parameters and computational cost. For example, one of the classic networks, VGG16 2with 130 million parameters needs more than 30 billion floatingpoint operations (FLOPs) to classify a single image, it fails to achieve realtime classification even with a powerful GPU. And many realworld applications often need to be performed on limited resource in realtime, e.g., mobile devices. Thereby, the model should be compact to reduce computational cost and achieve better tradeoff between efficiency and accuracy.
Recently, many research work focus on the field of model compression 5 ; 6 ; 7 ; 8 ; 10 . These works can be separated into two main kinds of approaches: compression for pertrained network and efficient architecture design. The compressing approach usually bases on traditional compression techniques such as pruning and quantization which removes connections to eliminate redundancy or reduce the number of bits to represent the parameters. These approaches are simple and intuitive, but always needs multiple steps, i.e., pretraining and compressing, thus cannot do an endtoend training at one time. The second approach trains model from scratch in a fully endtoend manner. It usually utilizes a sequence of sparselyconnected convolutions rather than the standard fullyconnected convolution to design new efficient architectures. For instance, in the ShuffleNet 8 , the original convolution is replaced with a depthwise convolution, while the convolution is substituted with a pointwise group convolution. The application of group convolution significantly reduces amount of parameters and the computational cost. However, the group convolution blocks the information flow between channels of each group, as shown in Figure 1(a), the groups are computed independently from completely separate groups of input feature maps, thus there is no interaction between each group and leads to severe performance degradation. Although ShuffleNet introduces a channel shuffle operation to facilitate intergroup information exchange, it still suffers from the loss of intergroup information. As shown in Figure 1(b), even with a shuffle operation, a large portion of the intergroup information cannot be leveraged. This problem is aggravated when number of channel groups increases.
To solve the above issue, we propose a novel operation named Hierarchical Group Convolution (HGC) to effectively facilitate the interaction of information between different groups. In contrast to common group convolution, HGC can hierarchically fuse the feature maps from each group and leverage the intergroup information effectively. Specifically, we split the input feature maps of a layer into multiple groups, in which the first group features are extracted by a group of filters; output feature maps of the previous group are concatenated with the next group of input feature maps, and then feed to the next group of filters. This process repeats until all input feature maps are included. By exploiting the HGC operation and depthwise seperable convolution, we introduce the HGC module, a powerful and effective unit to build a highly efficient architecture called HGCNets. A series of controlled experiments show the effectiveness of our design. Compared to other structures, HGCNets perform better in alleviating the loss of intergroup information, and thus achieve substantial improvement as the group number increases.
Our work brings following contributions and benefits: First, a new hierarchical group convolution operation is proposed to facilitate the interaction of information between different groups of feature maps and leverage the intergroup information effectively. Second, our proposed HGCNets achieve higher classification accuracy than prior compact CNNs at the same or even lower complexity.
The rest of this paper is organized as follow: Section 2 provides an overview of the related work on model compression. The details of the proposed Hierarchical Group Convolution operation is introduced in Section 3. In Section 4, we describe the structure of the HGC module and HGCNets architecture. The performance evaluation of the proposed method is described in Section 5. Finally, we conclude this paper in Section 6.
2 Related work
We first review two main approaches of model compression: compression for pretrained networks and designing efficient architectures, which inspire our work. Next, we review the group convolution that form the basis for HGCNets.
2.1 Compression for pretrained networks
Most of works applied this approach improve the efficiency of CNNs via weight pruning 12 ; 13 ; 19 and quantization 14 . These approaches are effective because deep neural networks often have a substantial number of redundant weights that can be pruned or quantized without sacrificing much accuracy. For convolutional neural networks, different pruning techniques may lead to different levels of granularity. Finegrained pruning, e.g., independent weight pruning 12 , generally achieves a high degree of sparsity. However, it requires storing a large number of indices, and relies on special hardware/software accelerators. In contrast, coarsegrained pruning methods such as filterlevel pruning 19 achieve a lower degree of sparsity, but the resulting networks are much more regular, which facilitates efficient implementations. These approaches are simple and intuitive, however, iterative optimization strategy is commonly utilized in these approaches, which slows down the training procedure.
2.2 Designing efficient architectures
Considering the abovementioned limitations, some researchers go other way to directly design efficient network architectures 7 ; 8 ; 20 that can be trained endtoend using smaller filters, such as depthwise separable convolution, group convolution, and etc. Two wellknown applicants of this kind of approach that are sufficiently efficient to be deployed on mobile devices are MobileNet 7 and ShuffleNet 8 . MobileNet exploited depthwise separable convolution as its building unit, which decompose a standard convolution into a combination of a depthwise convolution and a pointwise convolution. ShuffleNet utilize depthwise convolution and pointwise group convolution into the bottleneck unit, and proposed the channel shuffle operation to enable intergroup information exchange. Compact networks can be trained from the scratch, so the training procedure is very fast. Moreover, the model can be further compressed combined with the aforementioned model compression methods which are orthogonal to this approach, e.g., Huang 18 combined the channel pruning and group convolution to sparsify networks, however, this channel pruning methods obtain a sparse network based on a complex training procedure that requires significant cost of offline training and directly removing the input feature maps typically has limited compression and speedup with significant accuracy drop.In addition to the methods described above, some other approaches such as lowrank factorization 15 and knowledge distillation 16 can also efficiently accelerate deep neural network.
2.3 Group Convolution
Group convolution is a special case of a sparsely connected convolution. It was first used in the AlexNet 1 architecture, and has more recently been popularized by their successful application in ResNeXt 17 . Standard convolutional layers generate output feature maps by applying convolutional filters over all input feature maps, leading to a computational cost of . In comparison, group convolution reduces this computational cost by partitioning the input features into G mutually exclusive groups and each group produces its own outputs—reducing the computational cost by a factor to . However, the grouping operation usually compromises performance because there is no interaction among groups. As a result, information of feature maps in different groups is not combined, as opposed to the original convolution that combines information of all input channels, which restricts their representation capability. To solve this problem, in ShuffleNet 8 , a channel shuffle operation is proposed to permute the output channels of group convolution and makes the output better related to the input. But any output group still only accesses input feature maps and thus collects partial information. Due to this reason, ShuffleNet has to employ a deeper architecture than MobileNet to achieve competitive results.
3 Hierarchical Group Convolution
3.1 Motivation
In modern deep neural networks, the size of convolutional filters is mostly or , and the main computational cost is from the convolutional layer, that the fully connected layer can be considered as a special case of the convolutional layer. To reduce the parameters in convolution operation, an extremely efficient scheme is to replace standard convolution by a depthwise separable convolution 27 followed by interleaved group convolution 20 ; 8 . This scheme significantly reduces the model size and therefore attracts increasing attention.
Since the filters are nonseperable, group convolution becomes a hopeful and feasible solution and works well with many deep neural network architectures. However, preliminary experiments show that a naive adaptation of group convolution in the convolutional layer leads to drastic reductions in accuracy especially in dense architectures. As analyzed in CondenseNet 18 , this is caused by the fact that the inputs to the convolutional layer are concatenations of feature maps generated by preceding layers and they have an intrinsic order or they are far more diverse. The hard assignment of these features to disjoint groups hinders effective feature reuse in the network. More specifically, as investigated in network explanation, individual feature maps across different layers play different roles in the network, e.g., features from shallow layers usually encode lowlevel spatial visual information like edges, corners, circles, etc., and features from deep layers encode highlevel semantic information. Group convolution severely blocks the intergroup information exchange and induce the severe performance degradation. In order to facilitate the fusion of feature maps from each group and leverage the intergroup information effectively, we develop a novel approach, named hierarchical group convolution operation that efficiently overcomes the side effects brought by the group convolution.
3.2 Details of Hierarchical Group Convolution
Details of the proposed Hierarchical Group Convolution are shown in Figure 2. Generally, a standard convolutional layer transforms the input feature maps into the output feature maps by using the filters . Here, and is the number of the input feature maps and the output feature maps respectively. In HGC operation, the input channels and filters are divided into groups respectively, i.e., input channels and filters in each group, denote as where each , and , where when , and when . Except that the first group feature maps directly go through the , the feature group is concatenated with the output on the channel dimension, and then fed into . Thus, the can be formulated as follows:
(1) 
where * represents the convolutional operation. For simplicity, the biases are omitted for easy presentation. After all input feature maps are processed, we finally concatenate each as the output of HGC.
Notice that each convolutional operator could potentially receive information from all feature subsets of the previous layer. Each time a feature group go through a convolutional operator, the output result can have more information from input feature maps. The split and concatenation strategy can effectively process feature maps with less parameters. The parameters of the HGC is calculated as bellow:
(2) 
compared with the parameters of standard convolution, the compression ratio of each layer is:
(3) 
As can be observed in Eq. 3, HGC contains about fewer parameters than standard convolution. Although with negligible parameters increase than standard group convolution, HGC has stronger ability of feature representation. As will be shown in Section 5.1, HGC has a substantial improvement in accuracy especially in the case of large number of groups.
4 HGCNet
4.1 HGC module
Taking advantage of the proposed HGC operation, we propose a novel HGC module specially designed for efficient neural networks. The HGC module is shown in Figure 3(b). The typical bottleneck structure shown in Figure 3(a) is a basic building block in many modern backbone CNNs architectures, e.g., Densenet 11 . Instead of directly extracting features using a group of convolutional filters as in the bottleneck, we use HGC operation with stronger intergroup information exchange ability, while maintaining similar computational load. A channel shuffle operation before the HGC allows for more intergroup information exchange. Finally, feature maps from all groups are concatenated and sent to a computational economical
depthwise seperable convolution to capture spatial information. The usage of batch normalization
9 and nonlinearity 21 is similar to Xception 27, that we do not use ReLU before depthwise convolution.
As discussed in Section 3.1, the information contained in each output group gradually increase, which results in each channel has different contribution to latter layers. Thus, we can integrate the SE 28 block to the HGC module to adaptively recalibrates channelwise feature responses by explicitly modeling importance of each channel. Our HGC module can benefit from the integration of the SE block, which we have experimentally demonstrated in Section 5.3.
4.2 HGCNet Architecture
Combined with the efficient HGC module and dense connectivity, we propose HGCNets, a new family of compact neural networks. Similar to CondenseNet 18 , we exponentially increasing the growth rate as the depth grows to increase the proportion of features coming from later layers relative to those from earlier layers due to the fact that deeper layers in DenseNet tend to rely on highlevel features more than on lowlevel features. For simplicity, we multiply the growth rate by a power of 2. The overall architecture of HGCNets for CIFAR classification is group into three stages. The number of HGC module output channels is kept the same to the growthrate in each stage, and doubled in the next stage.
5 Experiments
In this section, we evaluate the effectiveness of our proposed HGCNets on the CIFAR10, CIFAR100 22
image classification datasets. We implement all the proposed models using the Pytorch framework
26 .Datasets. The CIFAR10 and CIFAR100 datasets consist of RGB images of size pixels, corresponding to 10 and 100 classes, respectively. Both datasets contain 50,000 training images and 10,000 test images. We use a standard dataaugmentation scheme 23 ; 24 ; 25
, in which the images are zeropadded with 4 pixels on each side, randomly cropped to produce
images, and horizontally mirrored with probability 0.5.
Model  FLOPs  Params  Top1 err. (%) 

SGCNet42 (G = 1)  91M  0.55M  5.94 
HGCNet42 (G = 2)  74M  0.41M  6.21 
SGCNet42 (G = 2)  62M  0.33M  6.46 
HGCNet42 (G = 4)  56M  0.28M  6.36 
SGCNet42 (G = 4)  47M  0.22M  6.64 
HGCNet42 (G = 6)  49M  0.23M  6.52 
SGCNet42 (G = 6)  42M  0.19M  6.94 
5.1 Ablation study on CIFAR
We first perform a set of experiments on CIFAR10 to validate the effectiveness of the efficient HGC operation and the proposed HGCNets.
Training details
. We train all models with stochastic gradient descent (SGD) using similar optimization hyperparameters as in
4 ; 11, Specifically, we adopt Nesterov momentum with a momentum weight decay of
. All models are trained with minibath size 128 for 300 epochs, unless otherwise specified. We use a cosine shape learning rate which starts from 0.1 and gradually reduces to 0.
Ablation Study. For better contrast with standard group convolution (SGC), we replace the hierarchical group convolution with SGC in the HGC module which is formed the SGCNets. We first explore the accuracy of them with respect to different number of groups, the results are shown in Table 1 and Figure 4(a). When the group number is kept the same, HGCNets surpass SGCNets by a large margin. As can be seen, the accuracy drops dramatically when the standard group convolution is applied to the convolution, mainly due to the loss of representation capability from hard assignment. Differently, our HGC successfully generates more discriminative features and maintains the accuracy even with large number of groups. More importantly, HGCNets gain substantial improvements as the group number increases. Figure 4(b) shows the computational efficiency gains brought by the HGC. Compared to SGCNets, HGCNets require 30% fewer parameters to achieve comparable performance.
As discussed above, increasing makes more intergroup connections lost, which aggravates the loss of intergroup information and harms the representation capability. However, the hierarchical group convolution fuses the features from all channels hierarchically and generates more discriminative features than ShuffleNet. As shown in Figure 5, HGCNet overcomes the performance degradation and has a better convergence than the network which uses standard group convolution. These improvements are consistent with our initial motivation to design HGC module.
5.2 Comparison to stateoftheart compact CNNs
In Table 2, we show the results of experiments comparing HGCNets with alternative stateoftheart compact CNN architectures. Following 11 , our models were trained for 300 epochs, and set to 4 for better tradeoff between the compression and accuracy. From the results, we can observe that HGCNets require fewer parameters and FLOPs to achieve a better accuracy than MobileNets and ShuffleNets.
5.3 Comparison to stateoftheart large CNNs
In this subsection, we experimentally demonstrate that the proposed HGCNets, as a lightweight architecture, can still outperform stateoftheart large models, e.g., ResNet4 . We can also integrate the SEblock 28 to the HGC module to adaptively recalibrate channelwise feature responses by explicitly modeling importance of each channel. As shown in Table 3, the original HGCNets can already outperform 110layer ResNet using 6x fewer parameters. When we insert SE block into HGC module, the top1 error of HGCNet on CIFAR10 further decreases to 5.81%, with negligible increase in the number of parameters.
6 Conclusion
In this paper, we propose a novel hierarchical group convolution operation to perform model compression by replacing standard group convolution in deep neural networks. Different from standard group convolution which blocks the intergroup information exchange and induce the severe performance degradation, HGC can effectively leverage the intergroup information and generate more discriminative features even with a large number of groups. Based on the proposed HGC, we propose HGCNets, a new family of compact neural networks. Extensive experiments show that HGCNets achieve higher classification accuracy than the prior CNNs designed for mobile devices at the same or even lower complexity.
References

(1)
F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2017, pp. 1251–1258. 
(2)
R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlablike environment for machine learning,” Tech. Rep., 2011.
 (3) S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, 2015, pp. 1135–1143.
 (4) K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 (5) Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1389–1397.
 (6) G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
 (7) A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
 (8) J. Hu, L. Shen, and G. Sun, “Squeezeandexcitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
 (9) G. Huang, S. Liu, L. Van der Maaten, and K. Q. Weinberger, “Condensenet: An efficient densenet using learned group convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2752–2761.
 (10) G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
 (11) G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” in European conference on computer vision. Springer, 2016, pp. 646–661.
 (12) F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and< 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
 (13) S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 (14) M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional neural networks with low rank expansions,” arXiv preprint arXiv:1405.3866, 2014.
 (15) A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.

(16)
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, pp. 1097–1105.  (17) M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
 (18) Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2736–2744.

(19)
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in
Proceedings of the 27th international conference on machine learning (ICML10), 2010, pp. 807–814. 
(20)
M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnornet: Imagenet classification using binary convolutional neural networks,” in
European Conference on Computer Vision. Springer, 2016, pp. 525–542.  (21) M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
 (22) K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 (23) R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Advances in neural information processing systems, 2015, pp. 2377–2385.
 (24) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 (25) J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutional neural networks for mobile devices,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4820–4828.
 (26) S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
 (27) T. Zhang, G.J. Qi, B. Xiao, and J. Wang, “Interleaved group convolutions,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4373–4382.
 (28) X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6848–6856.