# Differentiable Learning-to-Group Channels via Groupable Convolutional Neural Networks

Group convolution, which divides the channels of ConvNets into groups, has achieved impressive improvement over the regular convolution operation. However, existing models, eg. ResNeXt, still suffers from the sub-optimal performance due to manually defining the number of groups as a constant over all of the layers. Toward addressing this issue, we present Groupable ConvNet (GroupNet) built by using a novel dynamic grouping convolution (DGConv) operation, which is able to learn the number of groups in an end-to-end manner. The proposed approach has several appealing benefits. (1) DGConv provides a unified convolution representation and covers many existing convolution operations such as regular dense convolution, group convolution, and depthwise convolution. (2) DGConv is a differentiable and flexible operation which learns to perform various convolutions from training data. (3) GroupNet trained with DGConv learns different number of groups for different convolution layers. Extensive experiments demonstrate that GroupNet outperforms its counterparts such as ResNet and ResNeXt in terms of accuracy and computational complexity. We also present introspection and reproducibility study, for the first time, showing the learning dynamics of training group numbers.

## Authors

• 41 publications
• 7 publications
• 7 publications
• 11 publications
• 30 publications
• 152 publications
• 94 publications
• ### Differentiable Learning-to-Group Channels viaGroupable Convolutional Neural Networks

Group convolution, which divides the channels of ConvNets into groups, h...
08/16/2019 ∙ by Zhaoyang Zhang, et al. ∙ 0

• ### Interleaved Group Convolutions for Deep Neural Networks

In this paper, we present a simple and modularized neural network archit...
07/10/2017 ∙ by Ting Zhang, et al. ∙ 0

• ### Fully Learnable Group Convolution for Acceleration of Deep Neural Networks

Benefitted from its great success on many tasks, deep learning is increa...
03/31/2019 ∙ by Xijun Wang, et al. ∙ 0

• ### VarGNet: Variable Group Convolutional Neural Network for Efficient Embedded Computing

In this paper, we propose a novel network design mechanism for efficient...
07/12/2019 ∙ by Qian Zhang, et al. ∙ 8

• ### Partial Convolution based Padding

In this paper, we present a simple yet effective padding scheme that can...
11/28/2018 ∙ by Guilin Liu, et al. ∙ 6

• ### Merging and Evolution: Improving Convolutional Neural Networks for Mobile Applications

Compact neural networks are inclined to exploit "sparsely-connected" con...
03/24/2018 ∙ by Zheng Qin, et al. ∙ 0

• ### SubSpectral Normalization for Neural Audio Data Processing

Convolutional Neural Networks are widely used in various machine learnin...
03/25/2021 ∙ by Simyung Chang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Convolutional Neural Networks (ConvNets) have achieved remarkable successes in computer vision. For example, ResNet

[7] was a pioneer work on building very deep networks with shortcut connections. This strategy exposes depth of network as an essential dimension of ConvNets to achieve good performance. Other than tailoring network architectures on depth and width [28, 11, 27, 29], ResNeXt [33] proposed a new dimension “cardinality”, utilizing group convolution to design effective and efficient ConvNets. The main hallmark of group convolution is proven to be compact and parameter-saving, which means that ResNeXt improves accuracy and reduces network parameters, outperforming its counterpart ResNet.

Although group convolution is easy to implement, applying group convolution in previous networks such as ResNeXt still has drawbacks.

First, when designing network architectures by using group convolutions, the number of groups for each hidden layer has been treated as a hyper-parameter typically. The group number is often defined by human experts and kept the same for all hidden layers of a ConvNet. Second, previous work employed homogeneous group convolutions, leading to sub-optimal solution. For instance, one of the most practical setting of ResNeXt is “32x4d” that applies group convolution with 32 groups, which is found by trial and error. However, convolution layers in different depths of a ConvNet typically learn different visual features which represent different abstractions and semantic meanings. Thus, uniformly reducing model parameters via group convolutions may suffer from decreasing performance.

To address the above issues, this work introduces an autonomous formulation of group convolution, naming Dynamic Grouping Convolution (DGConv), which generally extends many convolution operations with the following appealing properties. (1) Dynamic grouping . The core of DGConv is to train the convolution kernels and the grouping strategy simultaneously. As shown in Fig. 1, DGConv is able to learn grouping strategy (group number and connections between channels in a group) during training. In this way, each DGConv layer can have individual grouping strategy. Moreover, by imposing a regularization term on computational complexity, we can control the overall model size and computational overhead. (2) Differentiability.

The learning of DGConv is fully differentiable and can be trained in an end-to-end manner by using stochastic gradient descent (SGD). Thus, DGConv is compatible with existing ConvNets. (3)

Parameter-saving. The extra parameters to learn the grouping strategy in DGConv is just scaled in , where is the number of channels of a convolution layer. This extra number of parameters is far less than the parameters of the convolution kernels, which are proportional to the scale111The kernel parameters are , which indicate the input channel size and the output channel size. of .

Furthermore, the extra parameters could be discarded after training. In the testing stage, only the parameters of the convolution kernels will be stored and loaded. Fig. 2 shows an example of the group numbers learned by DGConv, which is able to achieve comparable performance with respect to its counterpart, but significantly reducing parameters and computations.

This work makes three key contributions. (1) We propose a novel convolution operation, Dynamic Grouping Convolution (DGConv), which is able to differentiably learn the number of groups for group convolution, unlike existing work that treated the group number as a hyper-parameter. To our knowledge, this is the first time to learn group number in a differentiable and data-driven way. (2) DGConv can be used to replace previous convolutions and build state-of-the-art deep networks such as the proposed Groupable ResNeXt in section 3.3

, where the group number for each convolution layer is automatically determined during end-to-end training. (3) Extensive experiments demonstrate that Groupable ResNeXt is able to outperform both ResNet and ResNeXt, by using comparable or even smaller number of parameters. For example, it surpasses ResNeXt101 by 0.8% top-1 accuracy in ImageNet with slightly less parameters and computations. Moreover, we study the learning dynamics of group numbers, showing interesting findings.

## 2 Related Work

Group Convolution. Group convolution (GConv) is a special case of sparsely connected convolution. In regular convolution, we produce output channels by applying convolution filters over all input channels, resulting in a computational cost of . In contrast, GConv reduces this cost by dividing the input channels into non-overlapping groups. After applying filters over each group, GConv generates output channels by concatenating the outputs of each group. GConv has a complexity of .

GConv is firstly discussed in AlexNet [12] as a model distributing approach to handle memory limitation. ResNeXt [33] presented an additional dimension for network architecture “cardinality” by using GConv, leading to a series of further researches on applying group convolution in portable neural architecture design [35, 19, 36, 10]. To the extreme, group convolution partitions each channel into a single group, which is known as depthwise convolution. It has been widely used in efficient neural architecture design [9, 19, 36, 25].

Moreover, CondenseNet [10] and FLGC [31] learned the connections of group convolution, but the number of groups is still a predefined hyper-parameter. CondenseNet and FLGC treated connection learning as a pruning problem, where unimportant filters are abolished. In contrast, DGConv learns both the group number and the channel connections of each group.

Neural Architecture Search. Recently, there has been growing interests in automating the design process of neural architectures, usually referred as Neural Architecture Search (NAS) and AutoML. For example, NASNet [38, 37] and MetaQNN [1]

lead the trend of architecture search by using reinforcement learning (RL). In NASNet, the network architecture is decomposed into repeatable and transferable blocks, such that the control parameters of the architectures can be limited in a finite searching space. The sequence of these architecture parameters was generated by a controller RNN, which is trained by maximizing rewards (val accuracy). These methods were extended in many ways such as progressive searching

[13], parameter sharing [21], network transformation [3], resource-constrained searching [30], and differentiable searching like DARTS [15] and SNAS [34]

. Evolutionary algorithm is an alternative to RL. The architectures are searched by mutating the best architectures found so far

[23, 24, 32, 20, 14]. However, all the above methods either treated the group number as a hyper-parameter, or searched its value by using sampling methods such as RL. In contrast, DGConv is the first model that can optimize the group number in a data-driven way and a differentiable end-to-end manner together with the network parameters.

## 3 Our Approach

### 3.1 Dynamic Grouping Convolution (DGConv)

We first present conventional convolution and group convolution, and then introduce DGConv.

Regular Convolution. Let a feature map of a ConvNet be , where represent number of samples in a minibatch, number of channels, height and width of a channel respectively. If a regular convolution is applied on with kernel size

and stride

with padding, the output feature map is denoted as

, where every output unit  is

 oij=k−1∑m=0k−1∑n=0f(i+m)(j+n)ωmn, (1)

where , , and represents the hidden units of the input feature map . And represents the convolution weights (kernels).

Group Convolution. Group convolution (GConv) can be defined as a regular convolution with sparse kernels. GConv is often implemented as concatenation of separated convolution over grouped channels,

 oij=o1ij∪⋯∪oγij∪⋯∪oGij  andoγij=∑k−1m=0∑k−1n=0fγ(i+m)(j+n)ωγmn, (2)

where is the group number, , and means the concatenation operation. In context of GConv, we have and . To the extreme, when every channel is a group , Eqn.(2) expresses the depthwise convolution  [9, 25, 19, 36]

. Both GConv and depthwise convolution reduce computational resources and can be efficiently implemented in existing deep learning libraries. However, intrinsic hyper-parameter G is manually designed, making performance away from idealism.

Dynamic Grouping Convolution. Dynamic grouping convolution (DGConv) extends group convolution, enabling to learn grouping strategies, that is, group number and channel connections of each group. The strategies can be modeled by a binary relationship matrix . DGConv can be defined as

 oij=k−1∑m=0k−1∑n=0f(i+m)(j+n)(U⊙ωmn), (3)

where denotes elementwise product. It is note-worthy that Eqn.(3) has rich representation capacity. Many convolution operations can be treated as special cases of DGConv. To build some intuition on flexibility of DGConv, several illustrative examples are presented in the following:

(1) Let , where is a matrix of ones. Since we have , DGConv represents a regular convolution, as shown in Fig. 3 (a). (2) Let , where is an identity matrix. Then becomes a matrix with diagonal elements while the off-diagonal elements are zeros as depicted in Fig. 3 (b), implying that every channel is independent. Thus, DGConv becomes a depthwise convolution [9]. (3) If is a binary block-diagonal matrix as shown in Fig. 3 (d), then divides channels into groups. Since all diagonal blocks of are constant matrix of ones, DGConv expresses a conventional group convolution (GConv), which groups adjacent channels as a group. (4) If is an arbitrary binary matrix such as Fig. 3 (f), this leads to unstructured convolution.

Therefore, by appropriately constructing binary relationship matrix ,the proposed DGConv is expected to represent a large variety of convolution operations.

Discussions. We have defined DGConv as above. Although it has huge potential to boost learning capacity of CNN due to its flexibility in convolution representation, some foreseeable difficulties are also introduced.

First, since Stochastic Gradient Descent (SGD) can only optimize continuous variables, training a binary matrix by directly using SGD can be challenging. Second, the matrix introduces a large amount of extra parameters into the convolution operation, making the deep networks difficult to train. Third, updating the entire matrix without any constraint in the training stage could learn a unstructured relationship matrix as illustrated in Fig. 3 (f). In this case, DGConv is not a valid GConv, making learned convolution operation inexplicable.

Therefore, for DGConv, special construction of is required to maintain the group structures and reduce the extra number of parameters.

### 3.2 Construction of the Relationship Matrix

Instead of directly learning the entire matrix , we decompose it into a set of small matrixes,

 {Uk|Uk∈{0,1}Cink×Coutk,∀Cink

We see that each small matrix is of shape , where and . Then we define as

 U=U1⊗U2⊗⋯⊗UK, (4)

where denotes a Kronecker product. Therefore, we have and , implying that the -by- large matrix is decomposed into a set of small submatrixes by using a sequence of Kronecker products [2].

Construction of Submatrix. Here we introduce how to construct each submatrix . As an illustrative example, we suppose , which is a common setting in ResNet and ResNeXt. To pursue a most parameter-saving convolution operation, we further represent

by a single binary variable as follow:

 {Uk=gk1+(1−gk)I,  ∀gk∈g,g=sign(~g), (5)

where denotes a 2-by-2 constant matrix of ones, denotes a 2-by-2 identity matrix and indicates the -th component.

is a learnable gate vector taking continues value, and

is a binary gate vector derived from . The represents a sign function,

 sign(x)={0,   x<0.1,   x⩾0. (6)

By combing Eqn.(5), Eqn.(4) could be written as

 U=(g11+(1−g1)I)⊗⋯⊗(gK1+(1−gK)I). (7)

Constructing relationship matrix by Eqn.(7) not only remarkably reduces the amount of parameters but also makes have group structure. First, note that the parameters to be optimized are , the above construction method therefore reduces the number of parameters of from to . For example, if there is channels of a convolution layer, we can learn the block diagonal matrix in Eqn.(7) by using merely 10 parameters, remarkably reducing the number of training parameters, which previously is more than . Second, we see that constructed by Eqn.(7) is a symmetric matrix with diagonal element of ones. Moreover, each row or column of has the same elements. Hence, has a group structure. For example, when and , Eqn.(7) becomes , which is a 8-by-8 matrix of 2 groups as shown in Fig. 3 (e); when , Eqn.(7) becomes , which is a 8-by-8 matrix of 4 groups as shown in Fig. 3 (c). They show that our proposed DGConv can group non-adjacent channels. Fig. 3 (g) shows the dynamical process of actual of DGConv when and . It can be observed that the position of ‘’ in can control the group structure of and . Note that we use only 3 continuous parameters to produce , enabling to learn the large 8-by-8 matrix that originally needs 64 parameters to train. A more general case when is discussed in Appendix A.

Training Algorithm of DGConv. Here we introduce the training algorithm of DGConv. Note that every DGConv layer is trained in the same way, implying that it can be easily plugged into a deep ConvNet by replacing the traditional convolution operations.

The training of DGConv can be simply implemented in existing software platforms such as PyTorch and TensorFlow. To see this, DGConv is computed by combining Eqn.(

3), (4), (5), and (6). All these equations define differentiable transformations except the sign function in Eqn.(6

). Therefore, the gradients from the loss function can be propagated down to the binary gates

in Eqn.(5), by simply using auto differentiation (AD) in the above platforms. The only remaining thing to deal with is the sign function in Eqn.(6). The optimization of binary variables has been well established in the literature [22, 18, 17, 26]

, which can be also used to train DGConv. The gate params are optimized by Straight-Through Estimator similar to recent network quantization approaches, which is guaranteed to converge

[5]. Furthermore, Appendix B also provides the explicit gradient computations of DGConv, facilitating implementation of DGConv in the platforms without auto differentiation.

### 3.3 Groupable Residual Networks

DGConv is closely related to ResNet and ResNeXt, where ResNeXt extends ResNet by dividing channels into groups. DGConv can be also used with residual learning by simply replacing the traditional group convolutions of ResNeXt with the proposed dynamic grouping convolutions, as shown in Fig. 4. We name this new network architecture Groupable ResNeXt. Table 1 compares the architecture of Groupable-ResNeXt50 (G-ResNeXt50) to that of the original ResNeXt50.

Resource-constrained Groupable Networks. Besides simply replacing convolution layers by using DGConv layers in a deep network, we also provide a resource-constrained training scheme. Different DGConv layers can have different group numbers, such that how and where to reduce computations are totally dependent on training data and tasks.

Towards this end, we propose a regularization term denoted by to constrain the computational complexity of Groupable-ResNeXt, where is computed by

 ζ=L∑ℓ=1 ζℓ  and  ζℓ=Cin∑i=1Cout∑j=1uij, ∀uij∈U (8)

where denotes the number of DGConv layers and denotes an element of . It is seen that represents the number of non-zero elements in , measuring the number of activated convolution weights (kernels) of the -th DGConv layer. Thus, can be treated as a measurement of the model’s computational complexity.

In fact, it can be deduced by Eqn.(7) that the sum of each row or each column of can be calculated as . Substituting it to Eqn.(8) gives us

 ζ=L∑ℓ=1ζk=L∑ℓ=1Cℓ⋅Kℓ∏k=1(1+gℓk), (9)

where and indicate and in the -th layer, respectively. Here we assume . Let represent the desire computational complexity of the entire network, our objective is to search a deep model that

 minimize  L({ωℓ}Lℓ=1,{~gℓ}Lℓ=1)⋅[oζ]a, subject to  ζ≤o

where is a weighted product to approximate the Pareto optimal problem [30] and is a constant value. We have if , implying that the complexity constraint is satisfied. Otherwise, is used to penalize the model complexity when . For the value of , [30] empirically set or and this setting works well in reinforcement learning by using rewards. However, these empirical values make the regularizer too sensitive in our problem. In our experiments, we have as a constant.

The above loss function can be optimized by using SGD. By setting the value of , we can learn deep neural networks under different complexity constraints, allowing us to carry on careful studies on the trade-off between model accuracy and computational complexity.

## 4 Experiments

Implementation. We conduct experiments on the challenging ImageNet [4] benchmark, which has 1.2 million images for training and 50k images for validation. Following Section 3.3 and  [33], we construct 50-layer and 101-layer Groupable ResNeXts. In the training stage, each input image is of size that is randomly cropped from randomly horizontal flipped. The overall batch size is 512, partitioned to 16 GPUs (32 samples per GPU). We train the networks by using SGD with momentum and weight decay . We adopt the cosine learning rate schedule [16] and weight initialization of  [6]. In the evaluation stage, the error is evaluated on a single center crop. For Groupable ConvNets, the continuous gates are the only extra parameters required to train. We initialize them as small values or randomly.

Resource Constraint. In experiments, we derive the resource constraint by , where denotes a scale of complexity of the group convolution layers in the entire network. For an example, when , is equivalent to the number of parameters of all GConv layers in ResNeXt d, and represents the complexity of GConv layers in ResNeXt d. When , is complexity compared to the ResNeXt d, and so on. By setting , we are able to control the overall complexity of Groupable ConvNets.

Comparisons. We first evaluate the performance of Groupable-ResNeXt and its counterparts ResNet/ResNeXt. For fair comparison, we re-implement ResNet and ResNeXt under the settings of Section. 4, achieving comparable results to the original papers (top-1 accuracy of ResNeXt101, d, 79.1% (ours) 78.8%[33] ). Table 2 shows the results, and Fig. 5 shows the learned group numbers. Although maintaining similar module topology as ResNeXt, Groupable-ResNeXt learns optimal grouping strategies for group convolution. Compared to ResNet50 and ResNeXt50, G-ResNeXt50 obtains 1.5% / 0.5% higher top-1 accuracy. This trend is also observed in deeper architectures ResNet101 and ResNeXt101, and the gains of top-1 accuracy are enlarged to 1.7% and 0.8%.

Fig.2 and Fig.5 show the learned group numbers. Table 2 reports performance of G-ResNext50() and G-ResNeXt101(), which correspond to Fig.5 (d) and Fig.5 (a). Unlike ResNeXt that shares uniform group number, diverse group numbers could be observed in G-ResNext. An interesting phenomenon is that different networks manifest some homology. That is, when preserving the overall model complexity, DGConv tends to allocate more computation in lower layers. This is an evidence that the representation ability of ConvNet is highly related to the design of lower layers.

He  [33] found that, when the network complexity is similar, the networks with larger cardinality perform better than those deeper or wider. The performance gain comes from stronger representations. We suggest that the representations could be even stronger by adjusting the grouping strategy at each layer using DGConv.

Learning dynamics of DGConv. For every DGConv layers in G-ResNeXt50 (), we plot the learning procedure of group numbers and value of gates in Fig.  6. To our observation, DGConv appears some features. First, different DGConv layer shows different learning dynamics. Second, similar to Fig. 5, lower layers prefer fewer groups than higher layers. Therefore, lower layers tend to have fewer groups corresponding to more parameters, implying that they are essential for extracting texture-related features.

Complexity Accuracy. The resource constraint allows us to learn optimal grouping strategies subject to a given model complexity threshold. We then explore the trade-off between complexity of group convolution and model accuracy. Table 3 shows our results, where “FLOPs” denotes computational complexity of all group convolution layers in a network. We set the FLOPs of ResNeXt as baseline and show complexity of Groupable-ResNeXt by proportion. By modifying , we alter the constraint and learn Groupable-ResNeXt of various capacity. For example, when , is equivalent to the size of group convolutions with group number uniformly, and Groupable-ResNeXt will be regularized to choose group strategy less than ResNeXt’s complexity.

From Table 3, we see that G-ResNeXt50 achieves comparable top-1 accuracy with ResNeXt50 in the setting, and G-ResNeXt101 achieves comparable top-1 accuracy with ResNeXt101 in the setting. These results indicate that DGConv is able to learn more efficient group strategy than regular GConv when preserving accuracy. He  [33] suggests that learning wide cardinality has stronger representation than wide depth or width, and we learn dynamic grouping to improve representation learning of wide cardinality.

Furthermore, we also see the strong robustness of dynamic grouping convolution, even when the computational complexity of group convolution is significantly reduced. For example, when FLOPs decrease from to , G-ResNeXt101 is able to preserve its accuracy (about % top-1 accuracy).

Deeper or Wider Networks. Next we extend our experiments to more complex networks. We expand ResNet101 to complexity by increasing its width, depth, and cardinality respectively. When expanding on cardinality, we implement both the regular GConv and DGConv. Table 4 reports our results. The larger ResNet and ResNeXt are implemented by following [33, 8]. G-ResNeXt101 is constrained to the size of ResNeXt101 d. In Table 4, we see that increasing the model complexity consistently improves network performance (the original ResNet101 is %). Besides, increasing cardinality brings larger improvement than increasing the network depth and width (%/%/% %/%). Among the last three networks with larger cardinality, G-ResNeXt101 () outperforms corresponding ResNext101 () by % top-1 accuracy. G-ResNeXt101 increases cardinality by using DGConv. We show that DGConv is superior to regular GConv even in more complex networks.

Reproducibility. We verify the reproducibility of DGConv. We retrain G-ResNeXt101 by maintaining training strategy and hyper-parameters, but initialize gates as or randomly with different random seeds. We name the retrained models ”G-ResNeXt101R2” and ”G-ResNeXt101R3”. Table 5 reports their performances. All models are trained with constraint , showing comparable top-1 accuracy. These results indicate that DGConv is able to consistently express strong representation ability. We also see that the learned models have similar performance with slightly different grouping strategy, showing the flexibility of DGConv. Detailed group number distribution can be seen in Appendix D.

Evaluation of Learned Architecture We extend our experiments to the architecture learned by DGConv. We replace group numbers of each GConv layers in ResNeXt with the group numbers learned by G-ResNeXt. Then the formed models are directly trained on ImageNet from scratch. Table. 6 reports their performance. As we can see, the ResNeXt models learned by DGConv perform comparable top-1 and top-5 accuracy with G-ResNeXt, superior to the d baseline. The results manifest strong representation in the learned structure.

## 5 Conclusion

In this work, we propose a novel architecture Groupable ConvNet (GroupNet) for computation efficiency and performance boosting. GroupNet is able to differentiably learn group strategy for convolution operation on a layer-by-layer basis. It has been demonstrated that GroupNet outperforms ResNet and ResNeXt in terms of both accuracy and computational complexity. To achieve GroupNet, we develop dynamic grouping convolution (DGConv), providing an unified representation for convolution operation. DGConv can be easily plugged into any deep network model and is expected to learn a better feature representation for convolution layer.

## 6 Acknowledgement

This work is supported in part by SenseTime Group Limited, and in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14205615, CUHK14207814, CUHK14213616.