1 Introduction
Convolutional Neural Networks (ConvNets) have achieved remarkable successes in computer vision. For example, ResNet
[7] was a pioneer work on building very deep networks with shortcut connections. This strategy exposes depth of network as an essential dimension of ConvNets to achieve good performance. Other than tailoring network architectures on depth and width [28, 11, 27, 29], ResNeXt [33] proposed a new dimension “cardinality”, utilizing group convolution to design effective and efficient ConvNets. The main hallmark of group convolution is proven to be compact and parametersaving, which means that ResNeXt improves accuracy and reduces network parameters, outperforming its counterpart ResNet.Although group convolution is easy to implement, applying group convolution in previous networks such as ResNeXt still has drawbacks.
First, when designing network architectures by using group convolutions, the number of groups for each hidden layer has been treated as a hyperparameter typically. The group number is often defined by human experts and kept the same for all hidden layers of a ConvNet. Second, previous work employed homogeneous group convolutions, leading to suboptimal solution. For instance, one of the most practical setting of ResNeXt is “32x4d” that applies group convolution with 32 groups, which is found by trial and error. However, convolution layers in different depths of a ConvNet typically learn different visual features which represent different abstractions and semantic meanings. Thus, uniformly reducing model parameters via group convolutions may suffer from decreasing performance.
To address the above issues, this work introduces an autonomous formulation of group convolution, naming Dynamic Grouping Convolution (DGConv), which generally extends many convolution operations with the following appealing properties. (1) Dynamic grouping . The core of DGConv is to train the convolution kernels and the grouping strategy simultaneously. As shown in Fig. 1, DGConv is able to learn grouping strategy (group number and connections between channels in a group) during training. In this way, each DGConv layer can have individual grouping strategy. Moreover, by imposing a regularization term on computational complexity, we can control the overall model size and computational overhead. (2) Differentiability.
The learning of DGConv is fully differentiable and can be trained in an endtoend manner by using stochastic gradient descent (SGD). Thus, DGConv is compatible with existing ConvNets. (3)
Parametersaving. The extra parameters to learn the grouping strategy in DGConv is just scaled in , where is the number of channels of a convolution layer. This extra number of parameters is far less than the parameters of the convolution kernels, which are proportional to the scale^{1}^{1}1The kernel parameters are , which indicate the input channel size and the output channel size. of .Furthermore, the extra parameters could be discarded after training. In the testing stage, only the parameters of the convolution kernels will be stored and loaded. Fig. 2 shows an example of the group numbers learned by DGConv, which is able to achieve comparable performance with respect to its counterpart, but significantly reducing parameters and computations.
This work makes three key contributions. (1) We propose a novel convolution operation, Dynamic Grouping Convolution (DGConv), which is able to differentiably learn the number of groups for group convolution, unlike existing work that treated the group number as a hyperparameter. To our knowledge, this is the first time to learn group number in a differentiable and datadriven way. (2) DGConv can be used to replace previous convolutions and build stateoftheart deep networks such as the proposed Groupable ResNeXt in section 3.3
, where the group number for each convolution layer is automatically determined during endtoend training. (3) Extensive experiments demonstrate that Groupable ResNeXt is able to outperform both ResNet and ResNeXt, by using comparable or even smaller number of parameters. For example, it surpasses ResNeXt101 by 0.8% top1 accuracy in ImageNet with slightly less parameters and computations. Moreover, we study the learning dynamics of group numbers, showing interesting findings.
2 Related Work
Group Convolution. Group convolution (GConv) is a special case of sparsely connected convolution. In regular convolution, we produce output channels by applying convolution filters over all input channels, resulting in a computational cost of . In contrast, GConv reduces this cost by dividing the input channels into nonoverlapping groups. After applying filters over each group, GConv generates output channels by concatenating the outputs of each group. GConv has a complexity of .
GConv is firstly discussed in AlexNet [12] as a model distributing approach to handle memory limitation. ResNeXt [33] presented an additional dimension for network architecture “cardinality” by using GConv, leading to a series of further researches on applying group convolution in portable neural architecture design [35, 19, 36, 10]. To the extreme, group convolution partitions each channel into a single group, which is known as depthwise convolution. It has been widely used in efficient neural architecture design [9, 19, 36, 25].
Moreover, CondenseNet [10] and FLGC [31] learned the connections of group convolution, but the number of groups is still a predefined hyperparameter. CondenseNet and FLGC treated connection learning as a pruning problem, where unimportant filters are abolished. In contrast, DGConv learns both the group number and the channel connections of each group.
Neural Architecture Search. Recently, there has been growing interests in automating the design process of neural architectures, usually referred as Neural Architecture Search (NAS) and AutoML. For example, NASNet [38, 37] and MetaQNN [1]
lead the trend of architecture search by using reinforcement learning (RL). In NASNet, the network architecture is decomposed into repeatable and transferable blocks, such that the control parameters of the architectures can be limited in a finite searching space. The sequence of these architecture parameters was generated by a controller RNN, which is trained by maximizing rewards (val accuracy). These methods were extended in many ways such as progressive searching
[13], parameter sharing [21], network transformation [3], resourceconstrained searching [30], and differentiable searching like DARTS [15] and SNAS [34]. Evolutionary algorithm is an alternative to RL. The architectures are searched by mutating the best architectures found so far
[23, 24, 32, 20, 14]. However, all the above methods either treated the group number as a hyperparameter, or searched its value by using sampling methods such as RL. In contrast, DGConv is the first model that can optimize the group number in a datadriven way and a differentiable endtoend manner together with the network parameters.3 Our Approach
3.1 Dynamic Grouping Convolution (DGConv)
We first present conventional convolution and group convolution, and then introduce DGConv.
Regular Convolution. Let a feature map of a ConvNet be , where represent number of samples in a minibatch, number of channels, height and width of a channel respectively. If a regular convolution is applied on with kernel size
and stride
with padding, the output feature map is denoted as
, where every output unit is(1) 
where , , and represents the hidden units of the input feature map . And represents the convolution weights (kernels).
Group Convolution. Group convolution (GConv) can be defined as a regular convolution with sparse kernels. GConv is often implemented as concatenation of separated convolution over grouped channels,
(2) 
where is the group number, , and means the concatenation operation. In context of GConv, we have and . To the extreme, when every channel is a group , Eqn.(2) expresses the depthwise convolution [9, 25, 19, 36]
. Both GConv and depthwise convolution reduce computational resources and can be efficiently implemented in existing deep learning libraries. However, intrinsic hyperparameter G is manually designed, making performance away from idealism.
Dynamic Grouping Convolution. Dynamic grouping convolution (DGConv) extends group convolution, enabling to learn grouping strategies, that is, group number and channel connections of each group. The strategies can be modeled by a binary relationship matrix . DGConv can be defined as
(3) 
where denotes elementwise product. It is noteworthy that Eqn.(3) has rich representation capacity. Many convolution operations can be treated as special cases of DGConv. To build some intuition on flexibility of DGConv, several illustrative examples are presented in the following:
(1) Let , where is a matrix of ones. Since we have , DGConv represents a regular convolution, as shown in Fig. 3 (a). (2) Let , where is an identity matrix. Then becomes a matrix with diagonal elements while the offdiagonal elements are zeros as depicted in Fig. 3 (b), implying that every channel is independent. Thus, DGConv becomes a depthwise convolution [9]. (3) If is a binary blockdiagonal matrix as shown in Fig. 3 (d), then divides channels into groups. Since all diagonal blocks of are constant matrix of ones, DGConv expresses a conventional group convolution (GConv), which groups adjacent channels as a group. (4) If is an arbitrary binary matrix such as Fig. 3 (f), this leads to unstructured convolution.
Therefore, by appropriately constructing binary relationship matrix ,the proposed DGConv is expected to represent a large variety of convolution operations.
Discussions. We have defined DGConv as above. Although it has huge potential to boost learning capacity of CNN due to its flexibility in convolution representation, some foreseeable difficulties are also introduced.
First, since Stochastic Gradient Descent (SGD) can only optimize continuous variables, training a binary matrix by directly using SGD can be challenging. Second, the matrix introduces a large amount of extra parameters into the convolution operation, making the deep networks difficult to train. Third, updating the entire matrix without any constraint in the training stage could learn a unstructured relationship matrix as illustrated in Fig. 3 (f). In this case, DGConv is not a valid GConv, making learned convolution operation inexplicable.
Therefore, for DGConv, special construction of is required to maintain the group structures and reduce the extra number of parameters.
3.2 Construction of the Relationship Matrix
Instead of directly learning the entire matrix , we decompose it into a set of small matrixes,
We see that each small matrix is of shape , where and . Then we define as
(4) 
where denotes a Kronecker product. Therefore, we have and , implying that the by large matrix is decomposed into a set of small submatrixes by using a sequence of Kronecker products [2].
Construction of Submatrix. Here we introduce how to construct each submatrix . As an illustrative example, we suppose , which is a common setting in ResNet and ResNeXt. To pursue a most parametersaving convolution operation, we further represent
by a single binary variable as follow:
(5) 
where denotes a 2by2 constant matrix of ones, denotes a 2by2 identity matrix and indicates the th component.
is a learnable gate vector taking continues value, and
is a binary gate vector derived from . The represents a sign function,(6) 
By combing Eqn.(5), Eqn.(4) could be written as
(7) 
Constructing relationship matrix by Eqn.(7) not only remarkably reduces the amount of parameters but also makes have group structure. First, note that the parameters to be optimized are , the above construction method therefore reduces the number of parameters of from to . For example, if there is channels of a convolution layer, we can learn the block diagonal matrix in Eqn.(7) by using merely 10 parameters, remarkably reducing the number of training parameters, which previously is more than . Second, we see that constructed by Eqn.(7) is a symmetric matrix with diagonal element of ones. Moreover, each row or column of has the same elements. Hence, has a group structure. For example, when and , Eqn.(7) becomes , which is a 8by8 matrix of 2 groups as shown in Fig. 3 (e); when , Eqn.(7) becomes , which is a 8by8 matrix of 4 groups as shown in Fig. 3 (c). They show that our proposed DGConv can group nonadjacent channels. Fig. 3 (g) shows the dynamical process of actual of DGConv when and . It can be observed that the position of ‘’ in can control the group structure of and . Note that we use only 3 continuous parameters to produce , enabling to learn the large 8by8 matrix that originally needs 64 parameters to train. A more general case when is discussed in Appendix A.
Training Algorithm of DGConv. Here we introduce the training algorithm of DGConv. Note that every DGConv layer is trained in the same way, implying that it can be easily plugged into a deep ConvNet by replacing the traditional convolution operations.
The training of DGConv can be simply implemented in existing software platforms such as PyTorch and TensorFlow. To see this, DGConv is computed by combining Eqn.(
3), (4), (5), and (6). All these equations define differentiable transformations except the sign function in Eqn.(6). Therefore, the gradients from the loss function can be propagated down to the binary gates
in Eqn.(5), by simply using auto differentiation (AD) in the above platforms. The only remaining thing to deal with is the sign function in Eqn.(6). The optimization of binary variables has been well established in the literature [22, 18, 17, 26], which can be also used to train DGConv. The gate params are optimized by StraightThrough Estimator similar to recent network quantization approaches, which is guaranteed to converge
[5]. Furthermore, Appendix B also provides the explicit gradient computations of DGConv, facilitating implementation of DGConv in the platforms without auto differentiation.3.3 Groupable Residual Networks
DGConv is closely related to ResNet and ResNeXt, where ResNeXt extends ResNet by dividing channels into groups. DGConv can be also used with residual learning by simply replacing the traditional group convolutions of ResNeXt with the proposed dynamic grouping convolutions, as shown in Fig. 4. We name this new network architecture Groupable ResNeXt. Table 1 compares the architecture of GroupableResNeXt50 (GResNeXt50) to that of the original ResNeXt50.
Resourceconstrained Groupable Networks. Besides simply replacing convolution layers by using DGConv layers in a deep network, we also provide a resourceconstrained training scheme. Different DGConv layers can have different group numbers, such that how and where to reduce computations are totally dependent on training data and tasks.
Towards this end, we propose a regularization term denoted by to constrain the computational complexity of GroupableResNeXt, where is computed by
(8) 
where denotes the number of DGConv layers and denotes an element of . It is seen that represents the number of nonzero elements in , measuring the number of activated convolution weights (kernels) of the th DGConv layer. Thus, can be treated as a measurement of the model’s computational complexity.
In fact, it can be deduced by Eqn.(7) that the sum of each row or each column of can be calculated as . Substituting it to Eqn.(8) gives us
(9) 
where and indicate and in the th layer, respectively. Here we assume . Let represent the desire computational complexity of the entire network, our objective is to search a deep model that
where is a weighted product to approximate the Pareto optimal problem [30] and is a constant value. We have if , implying that the complexity constraint is satisfied. Otherwise, is used to penalize the model complexity when . For the value of , [30] empirically set or and this setting works well in reinforcement learning by using rewards. However, these empirical values make the regularizer too sensitive in our problem. In our experiments, we have as a constant.
The above loss function can be optimized by using SGD. By setting the value of , we can learn deep neural networks under different complexity constraints, allowing us to carry on careful studies on the tradeoff between model accuracy and computational complexity.
stage  output  ResNeXt5032x4d  GResNeXt50 

conv1  , 64, stride 2  , 64, stride 2  
maxpool  , stride 2  , stride 2  
conv2  
conv3  
conv4  
conv5 
4 Experiments
Implementation. We conduct experiments on the challenging ImageNet [4] benchmark, which has 1.2 million images for training and 50k images for validation. Following Section 3.3 and [33], we construct 50layer and 101layer Groupable ResNeXts. In the training stage, each input image is of size that is randomly cropped from randomly horizontal flipped. The overall batch size is 512, partitioned to 16 GPUs (32 samples per GPU). We train the networks by using SGD with momentum and weight decay . We adopt the cosine learning rate schedule [16] and weight initialization of [6]. In the evaluation stage, the error is evaluated on a single center crop. For Groupable ConvNets, the continuous gates are the only extra parameters required to train. We initialize them as small values or randomly.
Resource Constraint. In experiments, we derive the resource constraint by , where denotes a scale of complexity of the group convolution layers in the entire network. For an example, when , is equivalent to the number of parameters of all GConv layers in ResNeXt d, and represents the complexity of GConv layers in ResNeXt d. When , is complexity compared to the ResNeXt d, and so on. By setting , we are able to control the overall complexity of Groupable ConvNets.
Architecture  Params#  Top1 Accuracy 

ResNet50  25 M  76.4 
InceptionV3  23 M  77.5 
IBNNet50a  25 M  77.5 
SEResNet50  28 M  77.7 
ResNeXt50  25 M  77.8 
DenseNet161(k=48)  29 M  77.8 
DenseNet264(k=32)  33 M  77.9 
GResNeXt50(b=32, ours)  25M  78.4 
ResNet101  44 M  78.0 
SEResNet101  48 M  78.4 
ResNeXt101  44 M  78.8 
DenseNet232 (k=48)  55 M  78.8 
GResNeXt101(b=32, ours)  43M  79.9 
Comparisons. We first evaluate the performance of GroupableResNeXt and its counterparts ResNet/ResNeXt. For fair comparison, we reimplement ResNet and ResNeXt under the settings of Section. 4, achieving comparable results to the original papers (top1 accuracy of ResNeXt101, d, 79.1% (ours) 78.8%[33] ). Table 2 shows the results, and Fig. 5 shows the learned group numbers. Although maintaining similar module topology as ResNeXt, GroupableResNeXt learns optimal grouping strategies for group convolution. Compared to ResNet50 and ResNeXt50, GResNeXt50 obtains 1.5% / 0.5% higher top1 accuracy. This trend is also observed in deeper architectures ResNet101 and ResNeXt101, and the gains of top1 accuracy are enlarged to 1.7% and 0.8%.
Fig.2 and Fig.5 show the learned group numbers. Table 2 reports performance of GResNext50() and GResNeXt101(), which correspond to Fig.5 (d) and Fig.5 (a). Unlike ResNeXt that shares uniform group number, diverse group numbers could be observed in GResNext. An interesting phenomenon is that different networks manifest some homology. That is, when preserving the overall model complexity, DGConv tends to allocate more computation in lower layers. This is an evidence that the representation ability of ConvNet is highly related to the design of lower layers.
He [33] found that, when the network complexity is similar, the networks with larger cardinality perform better than those deeper or wider. The performance gain comes from stronger representations. We suggest that the representations could be even stronger by adjusting the grouping strategy at each layer using DGConv.
Learning dynamics of DGConv. For every DGConv layers in GResNeXt50 (), we plot the learning procedure of group numbers and value of gates in Fig. 6. To our observation, DGConv appears some features. First, different DGConv layer shows different learning dynamics. Second, similar to Fig. 5, lower layers prefer fewer groups than higher layers. Therefore, lower layers tend to have fewer groups corresponding to more parameters, implying that they are essential for extracting texturerelated features.
Complexity Accuracy. The resource constraint allows us to learn optimal grouping strategies subject to a given model complexity threshold. We then explore the tradeoff between complexity of group convolution and model accuracy. Table 3 shows our results, where “FLOPs” denotes computational complexity of all group convolution layers in a network. We set the FLOPs of ResNeXt as baseline and show complexity of GroupableResNeXt by proportion. By modifying , we alter the constraint and learn GroupableResNeXt of various capacity. For example, when , is equivalent to the size of group convolutions with group number uniformly, and GroupableResNeXt will be regularized to choose group strategy less than ResNeXt’s complexity.
From Table 3, we see that GResNeXt50 achieves comparable top1 accuracy with ResNeXt50 in the setting, and GResNeXt101 achieves comparable top1 accuracy with ResNeXt101 in the setting. These results indicate that DGConv is able to learn more efficient group strategy than regular GConv when preserving accuracy. He [33] suggests that learning wide cardinality has stronger representation than wide depth or width, and we learn dynamic grouping to improve representation learning of wide cardinality.
Furthermore, we also see the strong robustness of dynamic grouping convolution, even when the computational complexity of group convolution is significantly reduced. For example, when FLOPs decrease from to , GResNeXt101 is able to preserve its accuracy (about % top1 accuracy).
Architecture  Settings  GConv FLOPs  top1  top5 
ResNeXt50  d  77.9  93.9  
GResNeXt50  78.4  94.0  
GResNeXt50  78.2  93.9  
GResNeXt50  78.0  93.9  
GResNeXt50  78.0  93.9  
GResNeXt50  77.8  93.8  
ResNeXt101  d  79.1  94.2  
GResNeXt101  79.9  94.7  
GResNeXt101  79.7  94.6  
GResNeXt101  79.8  94.7  
GResNeXt101  79.5  94.5  
GResNeXt101  79.4  94.5  
GResNeXt101  79.0  94.3 
Deeper or Wider Networks. Next we extend our experiments to more complex networks. We expand ResNet101 to complexity by increasing its width, depth, and cardinality respectively. When expanding on cardinality, we implement both the regular GConv and DGConv. Table 4 reports our results. The larger ResNet and ResNeXt are implemented by following [33, 8]. GResNeXt101 is constrained to the size of ResNeXt101 d. In Table 4, we see that increasing the model complexity consistently improves network performance (the original ResNet101 is %). Besides, increasing cardinality brings larger improvement than increasing the network depth and width (%/%/% %/%). Among the last three networks with larger cardinality, GResNeXt101 () outperforms corresponding ResNext101 () by % top1 accuracy. GResNeXt101 increases cardinality by using DGConv. We show that DGConv is superior to regular GConv even in more complex networks.
Architecture  Settings  Complexity  top1  top5 

ResNet200 (depth)  d  ResNet101  78.6  94.1 
ResNet101 (wider [8])  d  ResNet101  78.8  94.4 
ResNeXt101 (card.)  d  ResNet101  79.8  94.7 
ResNeXt101 (card.)  d  ResNet101  79.6  94.6 
GResNeXt101 (card.)  ResNet101  80.1  94.7 
Reproducibility. We verify the reproducibility of DGConv. We retrain GResNeXt101 by maintaining training strategy and hyperparameters, but initialize gates as or randomly with different random seeds. We name the retrained models ”GResNeXt101R2” and ”GResNeXt101R3”. Table 5 reports their performances. All models are trained with constraint , showing comparable top1 accuracy. These results indicate that DGConv is able to consistently express strong representation ability. We also see that the learned models have similar performance with slightly different grouping strategy, showing the flexibility of DGConv. Detailed group number distribution can be seen in Appendix D.
Architecture  Settings  Params  top1  top5 

GResNeXt101  79.9  94.7  
GResNeXt101R2  79.8  94.5  
GResNeXt101R3  79.6  94.5 
Evaluation of Learned Architecture We extend our experiments to the architecture learned by DGConv. We replace group numbers of each GConv layers in ResNeXt with the group numbers learned by GResNeXt. Then the formed models are directly trained on ImageNet from scratch. Table. 6 reports their performance. As we can see, the ResNeXt models learned by DGConv perform comparable top1 and top5 accuracy with GResNeXt, superior to the d baseline. The results manifest strong representation in the learned structure.
Architecture  Settings  top1  top5 

ResNeXt50  d  77.9  93.9 
GResNeXt50  78.4  94.0  
ResNeXt50  learned by  78.3  94.0 
GResNeXt50  78.0  93.9  
ResNeXt50  learned by  78.0  93.9 
ResNeXt101  d  79.1  94.2 
GResNeXt101  79.9  94.7  
ResNeXt101  learned by  79.8  94.7 
GResNeXt101  79.5  94.5  
ResNeXt101  learned by  79.5  94.5 
5 Conclusion
In this work, we propose a novel architecture Groupable ConvNet (GroupNet) for computation efficiency and performance boosting. GroupNet is able to differentiably learn group strategy for convolution operation on a layerbylayer basis. It has been demonstrated that GroupNet outperforms ResNet and ResNeXt in terms of both accuracy and computational complexity. To achieve GroupNet, we develop dynamic grouping convolution (DGConv), providing an unified representation for convolution operation. DGConv can be easily plugged into any deep network model and is expected to learn a better feature representation for convolution layer.
6 Acknowledgement
This work is supported in part by SenseTime Group Limited, and in part by the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14205615, CUHK14207814, CUHK14213616.
References
 [1] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.

[2]
Kim Batselier and Ngai Wong.
A constructive arbitrarydegree kronecker product decomposition of tensors.
Numerical Linear Algebra with Applications, 24(5):e2097, 2017.  [3] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Reinforcement learning for architecture search by network transformation. arXiv preprint arXiv:1707.04873, 2017.

[4]
Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei.
Imagenet: A largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pages 248–255. Ieee, 2009.  [5] Yin Penghang et al. Understanding straightthrough estimator in training activation quantized neural nets. 2019.
 [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
 [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
 [9] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [10] Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q Weinberger. Condensenet: An efficient densenet using learned group convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2752–2761, 2018.
 [11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [13] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
 [14] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436, 2017.
 [15] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
 [16] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
 [17] Ping Luo, Ruimao Zhang, Jiamin Ren, Zhanglin Peng, and Jingyu Li. Switchable normalization for learningtonormalize deep representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.

[18]
Ping Luo, Peng Zhanglin, Shao Wenqi, Zhang Ruimao, Ren Jiamin, and Wu Lingyun.
Differentiable dynamic normalization for learning deep
representation.
In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,
Proceedings of the 36th International Conference on Machine Learning
, volume 97 of Proceedings of Machine Learning Research, pages 4203–4211, Long Beach, California, USA, 09–15 Jun 2019. PMLR.  [19] Ningning Ma, Xiangyu Zhang, HaiTao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pages 116–131, 2018.
 [20] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Daniel Fink, Olivier Francon, Bala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, et al. Evolving deep neural networks. In Artificial Intelligence in the Age of Neural Networks and Brain Computing, pages 293–312. Elsevier, 2019.
 [21] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
 [22] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
 [23] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.

[24]
Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu,
Jie Tan, Quoc V Le, and Alexey Kurakin.
Largescale evolution of image classifiers.
In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2902–2911. JMLR. org, 2017.  [25] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
 [26] Wenqi Shao, Tianjian Meng, Jingyu Li, Ruimao Zhang, Yudian Li, Xiaogang Wang, and Ping Luo. Ssn: Learning sparse switchable normalization via sparsestmax. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 443–451, 2019.

[27]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.  [28] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [29] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
 [30] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platformaware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018.
 [31] Xijun Wang, Meina Kan, Shiguang Shan, and Xilin Chen. Fully learnable group convolution for acceleration of deep neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [32] Lingxi Xie and Alan Yuille. Genetic cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1379–1388, 2017.
 [33] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1492–1500, 2017.
 [34] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. Snas: stochastic neural architecture search. arXiv preprint arXiv:1812.09926, 2018.
 [35] Ting Zhang, GuoJun Qi, Bin Xiao, and Jingdong Wang. Interleaved group convolutions. In Proceedings of the IEEE International Conference on Computer Vision, pages 4373–4382, 2017.
 [36] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
 [37] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
 [38] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.