Differentiable Learning-to-Group Channels via Groupable Convolutional Neural Networks

08/16/2019 ∙ by Zhaoyang Zhang, et al. ∙ SenseTime Corporation The University of Hong Kong 0

Group convolution, which divides the channels of ConvNets into groups, has achieved impressive improvement over the regular convolution operation. However, existing models, eg. ResNeXt, still suffers from the sub-optimal performance due to manually defining the number of groups as a constant over all of the layers. Toward addressing this issue, we present Groupable ConvNet (GroupNet) built by using a novel dynamic grouping convolution (DGConv) operation, which is able to learn the number of groups in an end-to-end manner. The proposed approach has several appealing benefits. (1) DGConv provides a unified convolution representation and covers many existing convolution operations such as regular dense convolution, group convolution, and depthwise convolution. (2) DGConv is a differentiable and flexible operation which learns to perform various convolutions from training data. (3) GroupNet trained with DGConv learns different number of groups for different convolution layers. Extensive experiments demonstrate that GroupNet outperforms its counterparts such as ResNet and ResNeXt in terms of accuracy and computational complexity. We also present introspection and reproducibility study, for the first time, showing the learning dynamics of training group numbers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional Neural Networks (ConvNets) have achieved remarkable successes in computer vision. For example, ResNet  

[7] was a pioneer work on building very deep networks with shortcut connections. This strategy exposes depth of network as an essential dimension of ConvNets to achieve good performance. Other than tailoring network architectures on depth and width [28, 11, 27, 29], ResNeXt [33] proposed a new dimension “cardinality”, utilizing group convolution to design effective and efficient ConvNets. The main hallmark of group convolution is proven to be compact and parameter-saving, which means that ResNeXt improves accuracy and reduces network parameters, outperforming its counterpart ResNet.

Figure 1: Illustration of different convolution strategies, where the blue circles represent input and output channels, and the lines are the connections between them. (a) Regular convolution. Every input channel is connected to every output channel. (b) Group convolution with cardinality and width

. (c) Depthwise convolution. Each input channel is connected to only one output channel, so this convolution can be understood as linear transformation for each channel. (d) Our proposed dynamic grouping convolution (DGConv).

The grouping strategy of DGConv is learned end-to-end together with the network parameters , so the group number and connection location are changing dynamically. This example is one candidate strategy with 2 groups and non-adjacent channel connection. During the test stage, the DGConv can be simply implemented by group convolution with the group number learned from training, which reduces the computations and parameters.

Although group convolution is easy to implement, applying group convolution in previous networks such as ResNeXt still has drawbacks.

First, when designing network architectures by using group convolutions, the number of groups for each hidden layer has been treated as a hyper-parameter typically. The group number is often defined by human experts and kept the same for all hidden layers of a ConvNet. Second, previous work employed homogeneous group convolutions, leading to sub-optimal solution. For instance, one of the most practical setting of ResNeXt is “32x4d” that applies group convolution with 32 groups, which is found by trial and error. However, convolution layers in different depths of a ConvNet typically learn different visual features which represent different abstractions and semantic meanings. Thus, uniformly reducing model parameters via group convolutions may suffer from decreasing performance.

To address the above issues, this work introduces an autonomous formulation of group convolution, naming Dynamic Grouping Convolution (DGConv), which generally extends many convolution operations with the following appealing properties. (1) Dynamic grouping . The core of DGConv is to train the convolution kernels and the grouping strategy simultaneously. As shown in Fig. 1, DGConv is able to learn grouping strategy (group number and connections between channels in a group) during training. In this way, each DGConv layer can have individual grouping strategy. Moreover, by imposing a regularization term on computational complexity, we can control the overall model size and computational overhead. (2) Differentiability.

The learning of DGConv is fully differentiable and can be trained in an end-to-end manner by using stochastic gradient descent (SGD). Thus, DGConv is compatible with existing ConvNets. (3)

Parameter-saving. The extra parameters to learn the grouping strategy in DGConv is just scaled in , where is the number of channels of a convolution layer. This extra number of parameters is far less than the parameters of the convolution kernels, which are proportional to the scale111The kernel parameters are , which indicate the input channel size and the output channel size. of .

Furthermore, the extra parameters could be discarded after training. In the testing stage, only the parameters of the convolution kernels will be stored and loaded. Fig. 2 shows an example of the group numbers learned by DGConv, which is able to achieve comparable performance with respect to its counterpart, but significantly reducing parameters and computations.

This work makes three key contributions. (1) We propose a novel convolution operation, Dynamic Grouping Convolution (DGConv), which is able to differentiably learn the number of groups for group convolution, unlike existing work that treated the group number as a hyper-parameter. To our knowledge, this is the first time to learn group number in a differentiable and data-driven way. (2) DGConv can be used to replace previous convolutions and build state-of-the-art deep networks such as the proposed Groupable ResNeXt in section 3.3

, where the group number for each convolution layer is automatically determined during end-to-end training. (3) Extensive experiments demonstrate that Groupable ResNeXt is able to outperform both ResNet and ResNeXt, by using comparable or even smaller number of parameters. For example, it surpasses ResNeXt101 by 0.8% top-1 accuracy in ImageNet with slightly less parameters and computations. Moreover, we study the learning dynamics of group numbers, showing interesting findings.

Figure 2: Comparison of group numbers in ResNeXt and GroupNet. We employ ResNeXt50324d as an example here, which has 32 groups with width 4. And G-ResNeXt5096 denotes the ResNeXt50 trained with DGConv, where represents the constraint setting (will be discussed later). The y-axis indicates number of groups, and the x-axis is the number of channels in different convolution layers.

2 Related Work

Group Convolution. Group convolution (GConv) is a special case of sparsely connected convolution. In regular convolution, we produce output channels by applying convolution filters over all input channels, resulting in a computational cost of . In contrast, GConv reduces this cost by dividing the input channels into non-overlapping groups. After applying filters over each group, GConv generates output channels by concatenating the outputs of each group. GConv has a complexity of .

GConv is firstly discussed in AlexNet [12] as a model distributing approach to handle memory limitation. ResNeXt [33] presented an additional dimension for network architecture “cardinality” by using GConv, leading to a series of further researches on applying group convolution in portable neural architecture design [35, 19, 36, 10]. To the extreme, group convolution partitions each channel into a single group, which is known as depthwise convolution. It has been widely used in efficient neural architecture design [9, 19, 36, 25].

Moreover, CondenseNet [10] and FLGC [31] learned the connections of group convolution, but the number of groups is still a predefined hyper-parameter. CondenseNet and FLGC treated connection learning as a pruning problem, where unimportant filters are abolished. In contrast, DGConv learns both the group number and the channel connections of each group.

Neural Architecture Search. Recently, there has been growing interests in automating the design process of neural architectures, usually referred as Neural Architecture Search (NAS) and AutoML. For example, NASNet [38, 37] and MetaQNN [1]

lead the trend of architecture search by using reinforcement learning (RL). In NASNet, the network architecture is decomposed into repeatable and transferable blocks, such that the control parameters of the architectures can be limited in a finite searching space. The sequence of these architecture parameters was generated by a controller RNN, which is trained by maximizing rewards (val accuracy). These methods were extended in many ways such as progressive searching

[13], parameter sharing [21], network transformation [3], resource-constrained searching [30], and differentiable searching like DARTS [15] and SNAS [34]

. Evolutionary algorithm is an alternative to RL. The architectures are searched by mutating the best architectures found so far 

[23, 24, 32, 20, 14]. However, all the above methods either treated the group number as a hyper-parameter, or searched its value by using sampling methods such as RL. In contrast, DGConv is the first model that can optimize the group number in a data-driven way and a differentiable end-to-end manner together with the network parameters.

Figure 3: Illustration of structures with relationship matrix . The hollow circle and solid black circle indicate ‘’ and ‘

’ respectively. A matrix of ones(a), identity matrix(b) and block diagonal matrix(d) imply regular convolution, depthwise convolution and group convolution (GConv) respectively. (c) and (e) show Dynamic Grouping Convolution (DGConv) under two non-adjacent group strategies respectively, one with a group number of 4 and the other with 2. (f) is a random group strategy, while it cannot been achieved under our constraint. (g) illustrates the construct process of DGConv when

and . The binary relationship matrix disables weights of via elementwise product operation.

3 Our Approach

3.1 Dynamic Grouping Convolution (DGConv)

We first present conventional convolution and group convolution, and then introduce DGConv.

Regular Convolution. Let a feature map of a ConvNet be , where represent number of samples in a minibatch, number of channels, height and width of a channel respectively. If a regular convolution is applied on with kernel size

and stride

with padding, the output feature map is denoted as

, where every output unit  is

(1)

where , , and represents the hidden units of the input feature map . And represents the convolution weights (kernels).

Group Convolution. Group convolution (GConv) can be defined as a regular convolution with sparse kernels. GConv is often implemented as concatenation of separated convolution over grouped channels,

(2)

where is the group number, , and means the concatenation operation. In context of GConv, we have and . To the extreme, when every channel is a group , Eqn.(2) expresses the depthwise convolution  [9, 25, 19, 36]

. Both GConv and depthwise convolution reduce computational resources and can be efficiently implemented in existing deep learning libraries. However, intrinsic hyper-parameter G is manually designed, making performance away from idealism.

Dynamic Grouping Convolution. Dynamic grouping convolution (DGConv) extends group convolution, enabling to learn grouping strategies, that is, group number and channel connections of each group. The strategies can be modeled by a binary relationship matrix . DGConv can be defined as

(3)

where denotes elementwise product. It is note-worthy that Eqn.(3) has rich representation capacity. Many convolution operations can be treated as special cases of DGConv. To build some intuition on flexibility of DGConv, several illustrative examples are presented in the following:

(1) Let , where is a matrix of ones. Since we have , DGConv represents a regular convolution, as shown in Fig. 3 (a). (2) Let , where is an identity matrix. Then becomes a matrix with diagonal elements while the off-diagonal elements are zeros as depicted in Fig. 3 (b), implying that every channel is independent. Thus, DGConv becomes a depthwise convolution [9]. (3) If is a binary block-diagonal matrix as shown in Fig. 3 (d), then divides channels into groups. Since all diagonal blocks of are constant matrix of ones, DGConv expresses a conventional group convolution (GConv), which groups adjacent channels as a group. (4) If is an arbitrary binary matrix such as Fig. 3 (f), this leads to unstructured convolution.

Therefore, by appropriately constructing binary relationship matrix ,the proposed DGConv is expected to represent a large variety of convolution operations.

Discussions. We have defined DGConv as above. Although it has huge potential to boost learning capacity of CNN due to its flexibility in convolution representation, some foreseeable difficulties are also introduced.

First, since Stochastic Gradient Descent (SGD) can only optimize continuous variables, training a binary matrix by directly using SGD can be challenging. Second, the matrix introduces a large amount of extra parameters into the convolution operation, making the deep networks difficult to train. Third, updating the entire matrix without any constraint in the training stage could learn a unstructured relationship matrix as illustrated in Fig. 3 (f). In this case, DGConv is not a valid GConv, making learned convolution operation inexplicable.

Therefore, for DGConv, special construction of is required to maintain the group structures and reduce the extra number of parameters.

3.2 Construction of the Relationship Matrix

Instead of directly learning the entire matrix , we decompose it into a set of small matrixes,

We see that each small matrix is of shape , where and . Then we define as

(4)

where denotes a Kronecker product. Therefore, we have and , implying that the -by- large matrix is decomposed into a set of small submatrixes by using a sequence of Kronecker products [2].

Construction of Submatrix. Here we introduce how to construct each submatrix . As an illustrative example, we suppose , which is a common setting in ResNet and ResNeXt. To pursue a most parameter-saving convolution operation, we further represent

by a single binary variable as follow:

(5)

where denotes a 2-by-2 constant matrix of ones, denotes a 2-by-2 identity matrix and indicates the -th component.

is a learnable gate vector taking continues value, and

is a binary gate vector derived from . The represents a sign function,

(6)

By combing Eqn.(5), Eqn.(4) could be written as

(7)

Constructing relationship matrix by Eqn.(7) not only remarkably reduces the amount of parameters but also makes have group structure. First, note that the parameters to be optimized are , the above construction method therefore reduces the number of parameters of from to . For example, if there is channels of a convolution layer, we can learn the block diagonal matrix in Eqn.(7) by using merely 10 parameters, remarkably reducing the number of training parameters, which previously is more than . Second, we see that constructed by Eqn.(7) is a symmetric matrix with diagonal element of ones. Moreover, each row or column of has the same elements. Hence, has a group structure. For example, when and , Eqn.(7) becomes , which is a 8-by-8 matrix of 2 groups as shown in Fig. 3 (e); when , Eqn.(7) becomes , which is a 8-by-8 matrix of 4 groups as shown in Fig. 3 (c). They show that our proposed DGConv can group non-adjacent channels. Fig. 3 (g) shows the dynamical process of actual of DGConv when and . It can be observed that the position of ‘’ in can control the group structure of and . Note that we use only 3 continuous parameters to produce , enabling to learn the large 8-by-8 matrix that originally needs 64 parameters to train. A more general case when is discussed in Appendix A.

Training Algorithm of DGConv. Here we introduce the training algorithm of DGConv. Note that every DGConv layer is trained in the same way, implying that it can be easily plugged into a deep ConvNet by replacing the traditional convolution operations.

The training of DGConv can be simply implemented in existing software platforms such as PyTorch and TensorFlow. To see this, DGConv is computed by combining Eqn.(

3), (4), (5), and (6). All these equations define differentiable transformations except the sign function in Eqn.(6

). Therefore, the gradients from the loss function can be propagated down to the binary gates

in Eqn.(5), by simply using auto differentiation (AD) in the above platforms. The only remaining thing to deal with is the sign function in Eqn.(6). The optimization of binary variables has been well established in the literature [22, 18, 17, 26]

, which can be also used to train DGConv. The gate params are optimized by Straight-Through Estimator similar to recent network quantization approaches, which is guaranteed to converge

[5]. Furthermore, Appendix B also provides the explicit gradient computations of DGConv, facilitating implementation of DGConv in the platforms without auto differentiation.

3.3 Groupable Residual Networks

DGConv is closely related to ResNet and ResNeXt, where ResNeXt extends ResNet by dividing channels into groups. DGConv can be also used with residual learning by simply replacing the traditional group convolutions of ResNeXt with the proposed dynamic grouping convolutions, as shown in Fig. 4. We name this new network architecture Groupable ResNeXt. Table 1 compares the architecture of Groupable-ResNeXt50 (G-ResNeXt50) to that of the original ResNeXt50.

Resource-constrained Groupable Networks. Besides simply replacing convolution layers by using DGConv layers in a deep network, we also provide a resource-constrained training scheme. Different DGConv layers can have different group numbers, such that how and where to reduce computations are totally dependent on training data and tasks.

Towards this end, we propose a regularization term denoted by to constrain the computational complexity of Groupable-ResNeXt, where is computed by

(8)

where denotes the number of DGConv layers and denotes an element of . It is seen that represents the number of non-zero elements in , measuring the number of activated convolution weights (kernels) of the -th DGConv layer. Thus, can be treated as a measurement of the model’s computational complexity.

Figure 4: Comparison of the residual building blocks of ResNeXt50324d (left) and Groupable-ResNeXt50 (right). We simply replace all group convolution layers with dynamic grouping convolution layers.

In fact, it can be deduced by Eqn.(7) that the sum of each row or each column of can be calculated as . Substituting it to Eqn.(8) gives us

(9)

where and indicate and in the -th layer, respectively. Here we assume . Let represent the desire computational complexity of the entire network, our objective is to search a deep model that

where is a weighted product to approximate the Pareto optimal problem [30] and is a constant value. We have if , implying that the complexity constraint is satisfied. Otherwise, is used to penalize the model complexity when . For the value of , [30] empirically set or and this setting works well in reinforcement learning by using rewards. However, these empirical values make the regularizer too sensitive in our problem. In our experiments, we have as a constant.

The above loss function can be optimized by using SGD. By setting the value of , we can learn deep neural networks under different complexity constraints, allowing us to carry on careful studies on the trade-off between model accuracy and computational complexity.

stage output ResNeXt5032x4d G-ResNeXt50
conv1 , 64, stride 2 , 64, stride 2
maxpool , stride 2 , stride 2
conv2
conv3
conv4
conv5
Table 1: Comparison of network structures between ResNeXt50324d and Groupable-ResNeXt50. In ResNeXt50324d, is a hyper-parameter, indicating group number in channel domain. Groupable-ResNeXt50 replaces all group convolution layers in ResNeXt50324d by using DGConv layers, keeping others unchanged.
Figure 5: Learned number of groups for each DGConv layer in Groupable-ResNeXt, including: (a) G-ResNeXt101, , (b) G-ResNeXt101, , (c) G-ResNeXt50, and (d) G-ResNeXt50, . The -axis denotes the number of channels in DGConv layers under network’s input to output direction, and the -axis is the group number of corresponding layers.

4 Experiments

Implementation. We conduct experiments on the challenging ImageNet [4] benchmark, which has 1.2 million images for training and 50k images for validation. Following Section 3.3 and  [33], we construct 50-layer and 101-layer Groupable ResNeXts. In the training stage, each input image is of size that is randomly cropped from randomly horizontal flipped. The overall batch size is 512, partitioned to 16 GPUs (32 samples per GPU). We train the networks by using SGD with momentum and weight decay . We adopt the cosine learning rate schedule [16] and weight initialization of  [6]. In the evaluation stage, the error is evaluated on a single center crop. For Groupable ConvNets, the continuous gates are the only extra parameters required to train. We initialize them as small values or randomly.

Resource Constraint. In experiments, we derive the resource constraint by , where denotes a scale of complexity of the group convolution layers in the entire network. For an example, when , is equivalent to the number of parameters of all GConv layers in ResNeXt d, and represents the complexity of GConv layers in ResNeXt d. When , is complexity compared to the ResNeXt d, and so on. By setting , we are able to control the overall complexity of Groupable ConvNets.

Architecture Params# Top-1 Accuracy
ResNet50 25 M 76.4
InceptionV3 23 M 77.5
IBN-Net50-a 25 M 77.5
SE-ResNet50 28 M 77.7
ResNeXt50 25 M 77.8
DenseNet161(k=48) 29 M 77.8
DenseNet264(k=32) 33 M 77.9
G-ResNeXt50(b=32, ours) 25M 78.4
ResNet101 44 M 78.0
SE-ResNet101 48 M 78.4
ResNeXt101 44 M 78.8
DenseNet-232 (k=48) 55 M 78.8
G-ResNeXt101(b=32, ours) 43M 79.9
Table 2: Comparisons of top-1 and top-5 accuracy on ImageNet when the number of #parameters in different networks are almost the same. Our approach shows superior performance to its counterparts. Groupable-ResNeXt is abbreviated as G-ResNeXt. The accuracy is evaluated on a signle crop of image. We set scale constant b of the model complexity in Groupable-ResNeXt to , so as to keep proximate parameter size with their counterpart ResNet and ResNeXt. We choose ResNeXt of setting d, which outperforms other settings in  [33]

Comparisons. We first evaluate the performance of Groupable-ResNeXt and its counterparts ResNet/ResNeXt. For fair comparison, we re-implement ResNet and ResNeXt under the settings of Section. 4, achieving comparable results to the original papers (top-1 accuracy of ResNeXt101, d, 79.1% (ours) 78.8%[33] ). Table 2 shows the results, and Fig. 5 shows the learned group numbers. Although maintaining similar module topology as ResNeXt, Groupable-ResNeXt learns optimal grouping strategies for group convolution. Compared to ResNet50 and ResNeXt50, G-ResNeXt50 obtains 1.5% / 0.5% higher top-1 accuracy. This trend is also observed in deeper architectures ResNet101 and ResNeXt101, and the gains of top-1 accuracy are enlarged to 1.7% and 0.8%.

Fig.2 and Fig.5 show the learned group numbers. Table 2 reports performance of G-ResNext50() and G-ResNeXt101(), which correspond to Fig.5 (d) and Fig.5 (a). Unlike ResNeXt that shares uniform group number, diverse group numbers could be observed in G-ResNext. An interesting phenomenon is that different networks manifest some homology. That is, when preserving the overall model complexity, DGConv tends to allocate more computation in lower layers. This is an evidence that the representation ability of ConvNet is highly related to the design of lower layers.

He  [33] found that, when the network complexity is similar, the networks with larger cardinality perform better than those deeper or wider. The performance gain comes from stronger representations. We suggest that the representations could be even stronger by adjusting the grouping strategy at each layer using DGConv.

Learning dynamics of DGConv. For every DGConv layers in G-ResNeXt50 (), we plot the learning procedure of group numbers and value of gates in Fig.  6. To our observation, DGConv appears some features. First, different DGConv layer shows different learning dynamics. Second, similar to Fig. 5, lower layers prefer fewer groups than higher layers. Therefore, lower layers tend to have fewer groups corresponding to more parameters, implying that they are essential for extracting texture-related features.

Complexity Accuracy. The resource constraint allows us to learn optimal grouping strategies subject to a given model complexity threshold. We then explore the trade-off between complexity of group convolution and model accuracy. Table 3 shows our results, where “FLOPs” denotes computational complexity of all group convolution layers in a network. We set the FLOPs of ResNeXt as baseline and show complexity of Groupable-ResNeXt by proportion. By modifying , we alter the constraint and learn Groupable-ResNeXt of various capacity. For example, when , is equivalent to the size of group convolutions with group number uniformly, and Groupable-ResNeXt will be regularized to choose group strategy less than ResNeXt’s complexity.

From Table 3, we see that G-ResNeXt50 achieves comparable top-1 accuracy with ResNeXt50 in the setting, and G-ResNeXt101 achieves comparable top-1 accuracy with ResNeXt101 in the setting. These results indicate that DGConv is able to learn more efficient group strategy than regular GConv when preserving accuracy. He  [33] suggests that learning wide cardinality has stronger representation than wide depth or width, and we learn dynamic grouping to improve representation learning of wide cardinality.

Furthermore, we also see the strong robustness of dynamic grouping convolution, even when the computational complexity of group convolution is significantly reduced. For example, when FLOPs decrease from to , G-ResNeXt101 is able to preserve its accuracy (about % top-1 accuracy).

(a) Learning dynamics of group Number in different layers (b) Learning dynamics of gate values in different layers

Figure 6: Learning dynamics of group number and learnable gate vector during training Groupable-ResNeXt50 on ImageNet. (a) visualizes how the number of group in different depth evolves with training. (b) shows the corresponding learning process of gate values . The number of channels is plotted for each layer (in the bottom).
Architecture Settings GConv FLOPs top-1 top-5
ResNeXt50 d 77.9 93.9
G-ResNeXt50 78.4 94.0
G-ResNeXt50 78.2 93.9
G-ResNeXt50 78.0 93.9
G-ResNeXt50 78.0 93.9
G-ResNeXt50 77.8 93.8
ResNeXt101 d 79.1 94.2
G-ResNeXt101 79.9 94.7
G-ResNeXt101 79.7 94.6
G-ResNeXt101 79.8 94.7
G-ResNeXt101 79.5 94.5
G-ResNeXt101 79.4 94.5
G-ResNeXt101 79.0 94.3
Table 3: Trade-off between complexity and accuracy. Here GConv FLOPs represents the computational complexity of all group convolution layers in the corresponding network architecture. The FLOPs of ResNeXt50/101 is regarded as baselines, and we report complexity of other models as proportions of them. All G-ResNeXt models outperform baselines at top1 accuracy with much less computation. Even given only about FLOPs, both G-ResNeXt50/101 achieve comparable top1/top5 accuracy with respect to baselines.

Deeper or Wider Networks. Next we extend our experiments to more complex networks. We expand ResNet101 to complexity by increasing its width, depth, and cardinality respectively. When expanding on cardinality, we implement both the regular GConv and DGConv. Table 4 reports our results. The larger ResNet and ResNeXt are implemented by following [33, 8]. G-ResNeXt101 is constrained to the size of ResNeXt101 d. In Table 4, we see that increasing the model complexity consistently improves network performance (the original ResNet101 is %). Besides, increasing cardinality brings larger improvement than increasing the network depth and width (%/%/% %/%). Among the last three networks with larger cardinality, G-ResNeXt101 () outperforms corresponding ResNext101 () by % top-1 accuracy. G-ResNeXt101 increases cardinality by using DGConv. We show that DGConv is superior to regular GConv even in more complex networks.

Architecture Settings Complexity top-1 top-5
ResNet200 (depth) d ResNet101 78.6 94.1
ResNet101 (wider [8]) d ResNet101 78.8 94.4
ResNeXt101 (card.) d ResNet101 79.8 94.7
ResNeXt101 (card.) d ResNet101 79.6 94.6
G-ResNeXt101 (card.) ResNet101 80.1 94.7
Table 4: Network performance on ImageNet when increasing number of parameters to ResNeXt101. All of above networks are re-implemented under the same settings for fair comparison. G-ResNeXt represents Groupable-ResNeXt. To keep proximate parameter size with ResNeXt101 d, the scale constant b of the model complexity in G-ResNeXt is set to . G-ResNeXt achieves the highest top1/top5 accuracy among all architectures.

Reproducibility. We verify the reproducibility of DGConv. We retrain G-ResNeXt101 by maintaining training strategy and hyper-parameters, but initialize gates as or randomly with different random seeds. We name the retrained models ”G-ResNeXt101R2” and ”G-ResNeXt101R3”. Table 5 reports their performances. All models are trained with constraint , showing comparable top-1 accuracy. These results indicate that DGConv is able to consistently express strong representation ability. We also see that the learned models have similar performance with slightly different grouping strategy, showing the flexibility of DGConv. Detailed group number distribution can be seen in Appendix D.

Architecture Settings Params top-1 top-5
G-ResNeXt101 79.9 94.7
G-ResNeXt101R2 79.8 94.5
G-ResNeXt101R3 79.6 94.5
Table 5: Reproducibility experiments results. G-ResNeXt101R2 and G-ResNeXt101R3 are re-trained under the same setting as G-ResNeXt. After training, these three models approach proximate results and top1/top5 accuracy even they use different random seeds for initialization, which shows that DGConv is robust to randomness.

Evaluation of Learned Architecture We extend our experiments to the architecture learned by DGConv. We replace group numbers of each GConv layers in ResNeXt with the group numbers learned by G-ResNeXt. Then the formed models are directly trained on ImageNet from scratch. Table. 6 reports their performance. As we can see, the ResNeXt models learned by DGConv perform comparable top-1 and top-5 accuracy with G-ResNeXt, superior to the d baseline. The results manifest strong representation in the learned structure.

Architecture Settings top-1 top-5
ResNeXt50 d 77.9 93.9
G-ResNeXt50 78.4 94.0
ResNeXt50 learned by 78.3 94.0
G-ResNeXt50 78.0 93.9
ResNeXt50 learned by 78.0 93.9
ResNeXt101 d 79.1 94.2
G-ResNeXt101 79.9 94.7
ResNeXt101 learned by 79.8 94.7
G-ResNeXt101 79.5 94.5
ResNeXt101 learned by 79.5 94.5
Table 6: Performance of ResNeXt using group number learned by DGConv, denoted by ResNeXt. To demonstrate the effectiveness of the structures learned by DGConv, we just simply replace the group numbers in ResNeXt50 by the numbers learned from G-ResNeXt.

5 Conclusion

In this work, we propose a novel architecture Groupable ConvNet (GroupNet) for computation efficiency and performance boosting. GroupNet is able to differentiably learn group strategy for convolution operation on a layer-by-layer basis. It has been demonstrated that GroupNet outperforms ResNet and ResNeXt in terms of both accuracy and computational complexity. To achieve GroupNet, we develop dynamic grouping convolution (DGConv), providing an unified representation for convolution operation. DGConv can be easily plugged into any deep network model and is expected to learn a better feature representation for convolution layer.

6 Acknowledgement

This work is supported in part by SenseTime Group Limited, and in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14205615, CUHK14207814, CUHK14213616.

References