Greedy Network Enlarging

by   Chuanjian Liu, et al.
HUAWEI Technologies Co., Ltd.

Recent studies on deep convolutional neural networks present a simple paradigm of architecture design, i.e., models with more MACs typically achieve better accuracy, such as EfficientNet and RegNet. These works try to enlarge all the stages in the model with one unified rule by sampling and statistical methods. However, we observe that some network architectures have similar MACs and accuracies, but their allocations on computations for different stages are quite different. In this paper, we propose to enlarge the capacity of CNN models by improving their width, depth and resolution on stage level. Under the assumption that the top-performing smaller CNNs are a proper subcomponent of the top-performing larger CNNs, we propose an greedy network enlarging method based on the reallocation of computations. With step-by-step modifying the computations on different stages, the enlarged network will be equipped with optimal allocation and utilization of MACs. On EfficientNet, our method consistently outperforms the performance of the original scaling method. In particular, with application of our method on GhostNet, we achieve state-of-the-art 80.9 600M and 4.4B MACs, respectively.


page 1

page 2

page 3

page 4


Training CNNs with Selective Allocation of Channels

Recent progress in deep convolutional neural networks (CNNs) have enable...

Self-Reorganizing and Rejuvenating CNNs for Increasing Model Capacity Utilization

In this paper, we propose self-reorganizing and rejuvenating convolution...

Dissected 3D CNNs: Temporal Skip Connections for Efficient Online Video Processing

Convolutional Neural Networks with 3D kernels (3D CNNs) currently achiev...

A Survey of the Recent Architectures of Deep Convolutional Neural Networks

Deep Convolutional Neural Networks (CNNs) are a special type of Neural N...

SplitNet: Divide and Co-training

The width of a neural network matters since increasing the width will ne...

MBS: Macroblock Scaling for CNN Model Reduction

We estimate the proper channel (width) scaling of Convolution Neural Net...

1 Introduction

Convolutional neural networks (CNNs) deliver state-of-the-art accuracy in many computer vision tasks such as image classification 

[18, 10], object detection [26]

, image super-resolution 

[16]. Most of deep CNNs are well designed with a predefined number of parameters and computational complexities. For example, ResNet [10] mainly consists of versions with , , , and layers. These CNNs have provided strong baselines for visual applications.

To improve the accuracy further, the most common way is to scale up the base CNN model. Three factors including depth, width and input resolution heavily affect the model size. A number of works propose to scale models by the depth [10, 29], width [37] or input image resolution [14]. These works consider only one dimension from depth, width or resolution, which leads to the imbalance in utilization of the computations or multiply-accumulate operations (MACs) . Simultaneously enlarging the width, depth and resolution can provide more flexible design space to find the high-performance models. Recently, several works focus on how to efficiently scale the three factors. EfficientNet [31] constructed one compound scaling formula to constrain the network width, depth and dimension. RegNet [23] studied the relationship between width and depth by exploring the network design spaces. These methods utilize a unified principle to scale the whole model, but ignore the stage-wise differences.

Figure 1: ImageNet classification results of our method. The black dash line is the original EfficientNet series. The blue dash line is searched S-EfficientNet and the blue solid line is S-EfficientNet with relabel trick. The red circle and triangle is the performance of GhostNet-based architectures.
Figure 2: MACs of different stages of CNN models. Left figure presents the MACs of ResNet series. Right figure presents the MACs of EfficientNet series.

Here we rethink the procedure of enlarging CNN models from the viewpoint of stage-wise computation resource allocation. Modern CNNs usually consists of several stages, where one stage contains all layers with the same spatial size of feature maps. In Figure 2, we present the computations of different stages for ResNet [10] and EfficientNet [31]. Figure 2 left demonstrate the discrepancy between ResNet series. ResNet18 has balanced MACs for each stage, while ResNet50 and ResNet101 get more MACs in the intermediate stages but few MACs in the head and tail stages. Figure 2 right presents the allocation of FLOPS for EfficientNet-B0, EfficientNet-B2 and EfficientNet-B4. EfficientNet utilizes one unified model scaling principle for network width, depth and resolution, so different configurations of EfficientNet have the similar tendency of MACs on different stages. The later stages have far more MACs (> times) than the former and intermediate stages. A universal rule of the computation allocation for different models is impractical. Neither the manual designed or unified rule is the solution of optimal computations allocation.

In this paper, we propose a network enlarging method based on greedy search of computations for each stage. In contrast to conventional unified principle, the method performs fine-grained search on the reallocation of computations. Given a baseline network, our goal is to enlarge it to the target MACs with the best configuration of depth, width and resolution in each stage. Under the assumption that the top-performing smaller CNNs are a proper subcomponent of the top-performing larger CNNs, we are able to enlarge CNNs step-by-step using greedy network enlarging algorithm. For each iteration in proposed algorithm, 1) a series of candidate networks are constructed by searching width, depth and resolution of each stage under constrained MACs; 2) with fast performance evaluation method, the architecture with the best performance in this iteration is appended to the baseline model pool for next iteration. By gradually adding MACs at each iteration, we find the optimal architecture until achieving the target MACs. Experiment results on ImageNet classification task demonstrate the superiority of our proposed method. The searched network configurations can largely boost the performances of existing base models. For example, searched EfficientNet models by proposed method outperform the original EfficientNets by a large margin.

2 Related Work

Manual Network Design.

In the early days after AlexNet [18], a large number of manually designed network architecture emerged. VGG [29] is the typical CNN architecture without any special connections, and deeper VGG-nets get high accuracies. However, the convergence problem emerged for very deep network. ResNet [10] with shortcut was proposed with higher accuracy and more layers. Except deeper network, wider network is another direction. WideResnet [37] has higher accuracy by adding channels for each layer in Resnet. Besides, a number of light-weight network are proposed in order to meet the demands of mobile devices. GoogLeNet [30], MobileNets [12, 27], ShuffleNets [38, 22] and GhostNet [9]are these type networks. By setting one width scaling factor, the accuracy and MACs of Mobilenets and GhostNet are improved. The design pattern behind these networks was largely man-powered and focused on discovering new design choices that improve accuracy, e.g., the use of deeper or wider models or shortcuts.

Automatic Network Design.

Currently expert designed architectures are time-consuming. Because of this, there is a growing interest in automated neural architecture search (NAS) methods [6]. By now, NAS methods have outperformed manually designed architectures on some tasks such as image classification [40, 24, 20, 21, 11], object detection [8, 35, 15, 33] or semantic segmentation [19, 28]. Generally, more MACs means higher accuracy. Traditionally, researchers have already learned to change the depth, width or resolution of models. But only one dimension is considered usually. EfficientNet [31] showed that it was critical to balance all dimensions of network width/depth/resolution and proposed a simple yet effective compound scaling method in accordance with the results by random sampling. RegNet [23] got several patterns by a huge number of experiments: good network have increasing widths with stages; the stage depths are likewise tend to increase for the best models, although not necessarily in the last stage. These methods construct principles from small networks, and use the rule to get various sizes of model, even very large models. In this paper, we take use of greedy allocation of MACs to enlarge model and get the specific model architecture under constrained MACs. During the expansion, the width, depth and input resolution are considered for each stage. Our intention is to maximize the utilization of MACs for the network.

3 Approach

In this section, we describe the proposed network enlarging method based on greedy allocation of MACs. Firstly, we define the goal of our method to find the optimal depth and width of each stage and the input resolution. Secondly, we introduce the main algorithm of greedy network enlarging. Further, we introduce how to efficiently evaluate the performance of candidate models.

Figure 3: The framework for adjusting input resolution, width and depth for each stage. The surrounding box out of the input image means candidate input resolution.

3.1 Problem Definition

The modern CNN backbone architectures usually consists of a stem layer, network body and a head [23, 10, 31]. The main MACs and parameters burdens lie in the network body, as typically the stem layer is a convolutional layer and the head is a fully-connected layer. Thus, in this paper we focus on the scaling strategy of network body. The network body consists of several stages [23], which are defined as a sequence of layers or blocks with the same spatial size. For example, ResNet50 [10] body is composed of stages with , , , and output sizes, respectively.

Scaling up convolutional neural networks is widely used to achieve better accuracy. Network depth, width and input resolution are three key factors for model scaling. Deeper convolutional neural networks capture richer and more complex features, and usually have high performance in contrast to shallow network. With the help of shortcuts, very deep network can be trained to convergence. However, the improvements on accuracy become smaller with the increase of depth. Another direction is scaling the width of network. More kernels mean more fine grained features can be learned. However, the MACs is squared with the width. As a result, the network depth will be constrained and high level features maybe loss. EfficientNet [31] showed that the accuracy quickly saturates when networks become wider. Higher resolution provide rich fine-grained information. In order to match high resolution, more powerful network is wanted. Deeper and wider network can acquire large receptive field and capture fine grained features.

As a result, the network depth, width and resolution are not independent. These three dimension have various combinations. And one unified principle can not acquire the best configuration for all tasks. In this paper, we decompose the network depth, width and resolution into stage depth, width and input resolution. This will maximize the utilization of computations for each stage and the whole network.

Given a base network with stages, width and depth are , and input resolution is . The objective is to acquire the network architecture with best performance by optimizing the allocation of target MACs :


where is the trainable parameters of the network. denote the validation accuracy, is the target threshold of MACs and is used to control the difference between the MACs of the searched model and the target MACs.

Search space. We consider the combinations of input resolution, width and depth of each stage. Suppose a base network with stages and configurations of width and depth = (, , …, , , , …, ), and input resolution . For each tieration, the growth rate for width, depth and resolution is , and , respectively. Under constrained target MACs, we enlarge the width, depth for each stage and the network input resolution step-by-step. For example, ResNet18 contains stages, if we constrain the search upper bound is times in both depth and width for each stage, and the growth rate is for depth and width. The total number of combinations is without considering the variation of input size.

3.2 Algorithm of Greedy Network Enlarging

Figure 3 presents our framework. Our intention is to find the optimal allocation of computations by enlarging depth, width and input resolution for each stage under constrained computations. So as to maximize the utilization of MACs, as shown in Eq. 1. For each stage in the network, we try to find its optimal depth and width . The optimal input size of resolution is searched to match the specialized width and depth. In the problem1,

have discrete values and massive combinations. Deep learning is both time and resource consuming. Due to the extreme complexity, traditional mathematical optimization method is impracticable. So we turn to efficient neural architecture search method.

To simply the search complexity, we first introduce an assumption. Finding the global optimal model is difficult with the massive search space, so we can smooth the target to find a top-performing configuration of target MACs. We introduce an assumption that the top-performing smaller CNNs are a proper subcomponent of the top-performing larger CNNs, as shown in Assumption 1. Resnet [10], VGG [29], EfficientNet [31] etc., fit this assumption perfectly. This assumption enables the idea of efficient search algorithm via greedy network enlarging.

Assumption 1

Given an optimal network with MACs , depth , width and resolution , there exists at least one top-performing network with MACs , depth , width and resolution that satisfies


With the above assumption, we transform the optimal network architecture search problem into a series of interrelated single-stage optimal sub-network architecture search problems, and then solve them one by one. Decisions need to be made at each stage to optimize the process. The selection of decisions at each stage depends only on the current state (here, the current state refers to the resolutions, widths and depths of the current stage). When the decision of each stage is determined, a decision sequence is formed, which determines the final solution. The overall algorithm is illustrated in Algorithm 1.

Result: Configurations with target MACs .
Initialization: Base network with MACs and stages: width = (, , …, ), depth = (, , …, ) and input resolution . Total dimension of search space is . The target MACs of output network is and the rate of error is . The search number is . Initialize the set of optimal sub-configurations as , the growth rate of resolution ;
while  do
       current target MACs: ; current candidates ; for  in  do
             while  do
             end while
            if  then
             end if
       end for
      for  in  do
             for  in  do
                   while  do
                         as in Algorithm 2;
                   end while
             end for
       end for
       for  in  do
             as stated in Sec. 3.3;
             append ;
       end for
       append ;
       if  then
       end if
end while
Algorithm 1 Greedy network enlarging.

In the algorithm, we use exponential increment of MACs in the process of search. This way make the changes of network more gentaly in contrast to uniform increment. For each iteration in Algorithm 1, in order to find the local optimal architecture configuration, we have to search and evaluate the candidate architectures. This step contains two targets: the first is to find the candidate architectures under limited increase of MACs; the second is to find the local optimal architectures with maximum validation accuracy. In the step of acquiring candidates, we consider the increase of resolution separately, which reduces the candidates. The increase of width and depth of each stage is on the basis of corresponding resolution.

Result: Candidates with collected width and depth
Initialization: Target MACs , configuration , ratios set . The growth rate of depth and width is and , respectively, current stage ;
; ;
for  do
       while  do
       end while
      while  do
       end while
      if  then
       end if
end for
Algorithm 2 Proportional collection of width and depth.

In order to reduce the searched candidates, we take use of proportional control factor to assign the MACs between depth and width for each stage. Specifically, the ratios of MACs between depth and width are in one set . Under this setting, we search depth first and then width for each stage. The algorithm is illustrated in Algorithm 2.

3.3 Performance Estimation

To guide the search process, we have to estimate the performance of a given architecture. The most accurate method is to train the candidates on the whole training data and evaluate their performance on validation data. However, this way requires great computational demands in the order of thousands of GPU days. Developing methods for speeding up the process of performance estimation is crucial.

We turn to proxy tasks to estimate performance. Including shorter training times [23], training on a subset of the data [17], on proxy data [40] or using less filters per layer and less cells [25]. These low-fidelity approximations reduce the cost, they also introduce bias in the performance estimation. Proxy data and simplified architecture have large deviation which leads to poor rank preservation.

In this section, we determine the optimal proxy task for performance estimation with empirical experiments. Firstly, we get the proxy sub-dataset by evaluating the performance of different sub-datasets. Secondly, the hyper-parameters of training are acquired with parameter search. Spearman’s rank correlation coefficient is a non-parametric measure of rank correlation, which is used as the measure of proxy task.

For the proxy sub-dataset, we create two sub-datasets ImageNet1000-100 and ImageNet100-500 by random selecting images from ImageNet. To evaluate these datasets, network architectures with different width, depth and input sizes are generated on the basis of EfficientNet-B0. We train all the networks and EfficientNet-B0 on the whole train set of ImageNet for epochs, the Top-1 accuracies on the validation dataset are used as the comparison object. We finetune the networks for different epochs. Besides, we train the networks from scratch for few epochs. On ImageNet100-500, the average Spearman value is . On ImageNet1000-100, the average Spearman value is . So we choose ImageNet1000-100 as the proxy sub-dataset. More details are presented in the supplementary materials.

After determining the proxy dataset, we try to improve the correlation between the proxy task and original task by searching the hyper-parameters. network architectures with different width, depth and input sizes are generated on the basis of EfficientNet-B0. We train all of the networks on the whole train set of ImageNet for epochs, the Top-1 accuracies on the validation dataset are used as the comparison object. Two pretrained EfficientNet-B0 models on the ImageNet and ImageNet1000-100 are provided, respectively. The learning rate, mode of learning rate decay and training epochs are considered. Among these hyper-parameter combinations, the top-2 Spearman value is and , these values indicate moderate positive correlation. They both use cosine decay method and the initial learning rate is for training epochs. The difference is that the first use the ImageNet1000-100 pretrained model and the second use the ImageNet pretrained model. More details are presented in the supplementary materials. Figure. 4 presents the consistency of different networks. In the next section, we take use of initial learning rate is and cosine decay for finetuning epochs on the ImageNet1000-100 pretrained model.

Figure 4: Correlation between the proxy task and original task. The blue line is the target. The red line has Spearman value , it is trained on the basis of ImageNet1000-100. The cyan line has spearman value which is trained on the basis of ImageNet. For comparison purposes, we manually adjust the accuracy up by and for red line and cyan line, respectively.

4 Experiments

In this section, we evaluate greedy network enlarging method on general image classification dataset ImageNet [5]. We demonstrate the method gets state-of-the-art accuracy with similar MACs.

4.1 Datasets, Networks and Experimental Settings

We extensively evaluate our methods on popular classification datasets ImageNet(ILSVRC2012) [5], which contains M images and categories, the validation set contains K images. On ImageNet, in order to speed up the search process, we create proxy ImageNet1000-100 dataset, which contains K train images and K validation images randomly sampled from ImageNet train set. Two baseline networks are considered: EfficientNet [31] and improved GhostNet [9].

To accelerate the search process, we set the growth rate of resolution and depth as and , respectively. For the growth rate of width, we use for small model and for large model. The ratios of MACs between depth and width are in one set The error rate of MACs is . We take use of exponential growth of MACs. We set different number of search iterations for small and large models. The finetune method comes from function preserving algorithm [3].

After the process of search is completed, we retrain the acquired network architecture on the whole ImageNet from scratch. The train setting is from timm [34] under its license and EfficientNet [31]

. RMSProp optimizer with momentum 0.9; weight decay 1e-5; multi-step learning rate with warmup, initial learning rate 0.064 that decays by 0.97 every 2.4 epochs. Moving average of weight, dropblock 

[7], random erasing [39] and random augment [4] are used.

ImageNet has noise labels and the method of crop augmentation introduces more noisy input and labels. To prevent this, we use the relabel method [36] to get higher accuracy.

4.2 ImageNet Results and Analysis

For EfficientNet, we take EfficientNet-B0 as the baseline, and we search the models with MACs similar to EfficientNet-B to B. Besides, we enlarge GhostNet with the principle of EfficientNet and search GhostNet architectures with greedy search method. For GhostNet, we add Squeeze-and-Excitation [13] module for each block. Table.1 shows the main results and comparison with other networks. The searched models are marked with ’S-’.

GhostNet-B1 and GhostNet-B4 in 1 are obtained by the compounding scale rule of EfficientNet. Their performance is lower in contrast to greedy search methods. This suggests that the rule on EfficientNet is not fit for GhostNet. We need to resample and optimize for new networks to get suitable rules. Besides, the compounding scale principle ignores the difference of stages, which leads to the loss of elaborate adjustment.

Model Top-1 Acc. #Params #MACs Ratio-to-EfficientNet
EfficientNet-B0 [31] 77.1% 5.3M 0.39B 1x
Ghostnet  [9] 73.9% 5.2M 0.14B 0.36x
EfficientNet-B1 [31] 79.1% 7.8M 0.69B 1x
ResNet-RS-50 [1] 78.8% 36M 4.6B 6.7x
REGNETY-800MF [23] 76.3% 6.3M 0.8B 1.16x
S-EfficientNet-B1 79.91% 8.8M 0.68B 1x
S-EfficientNet-B1-re 80.71% 8.8M 0.68B 1x
GhostNet-B1 79.13% 13.3M 0.59B 0.85x
S-GhostNet-B1 80.08% 16.2M 0.67B 1x
S-GhostNet-B1-re 80.87% 16.2M 0.67B 1x
EfficientNet-B2 [31] 80.1% 9.1M 0.99B 1x
REGNETY-1.6GF [23] 78.0% 11.2M 1.6B 1.6x
S-EfficientNet-B2 80.92% 9.3M 1.0B 1x
S-EfficientNet-B2-re 81.58% 9.3M 1.0B 1x
EfficientNet-B3 [31] 81.6% 12.2M 1.83B 1x
ResNet-RS-101 [1] 81.2% 64M 12B 6.6x
REGNETY-4.0GF [23] 79.4% 20.6M 4.0B 2.18x
S-EfficientNet-B3 81.98% 12.3M 1.88B 1x
S-EfficientNet-B3-re 82.87% 12.3M 1.88B 1x
EfficientNet-B4 [31] 82.9% 19.3M 4.39B 1x
REGNETY-8.0GF [23] 79.9% 39.2M 8.0B 1.82x
NFNet-F0 [2] 83.6% 71.5M 12.4B 2.8x
ResNet-RS-152 [1] 83.0% 87M 31B 7.1x
EfficientNetV2-S [32] 83.9% 24M 8.8B 2.0x
S-EfficientNet-B4 83.0% 17.0M 4.34B 1x
S-EfficientNet-B4-re 84.0% 17.0M 4.34B 1x
GhostNet-B4 82.78% 36.1M 4.39B 1x
S-GhostNet-B4 83.2% 32.9M 4.37B 1x
S-GhostNet-B4-re 84.3% 32.9M 4.37B 1x
Table 1: Searched Architecture Performance on ImageNet. The ’-re’ means the models are trained with relabel trick [36]. Our results are in bold.

In Table.1, Top-1 accuracies of all searched architectures outperform the compound scaling tricks of EfficientNet [31] and RegNet [23]. On M MACs, our searched architectures get and , improve performance and , respectively. On EfficientNet-B2 and B3, our searched EfficientNet architectures achieve and . We search networks on B MACs level, S-EfficientNet-B4 gets and S-GhostNet-B4 gets , respectively.

The relabel training trick improve the accuracy further. The Top-1 accuracy improves to on all searched architectures. We achieve a new SOTA 80.87% and 84.3% ImageNet top-1 accuracy under the setting of M and B MACs, respectively. All searched network architectures are presented in the supplementary materials.

4.3 Process of Greedy Search

Figure 5 is used specifically to show the changes of accuracy and input resolution of the search process. With increase of MACs, the resolution rises wavily, which verifies the role of dynamic search. The accuracy increases slowly and steadily.

Figure 5: The search process of target M MACs of EfficientNet-B1. The blue and black line demonstrate the changes of accuracy and input size as the increase of MACS, respectively.

Furtherly, the schematic diagram of greedy search for EfficientNet-B1 is shown in Figure 6. Under constrained MACs, we show the candidate network architectures. The green box means the best architecture in current iteration, and the gray box are discarded. Besides, the best architecture of each iteration are delivered to the later iterations.

Figure 6: The schematic diagram of greedy search for EfficientNet-B1.

5 Conclusion

Network enlarging is an effective scheme for generating deep neural networks with excellent performance from a smaller baseline. Different from the conventional approach that directly enlarge the given network using a unified strategy, we present a novel greedy network enlarging algorithm. The entire network enlarging task is therefore divided into several iterations for searching the best computational allocation in a step-by-step fashion. In the enlarging process of the base model, the added MACs will be assigned to the most appropriate location. Experimental results on several benchmark models and datasets show that the proposed method is able to surpass the original unified enlarging scheme and achieves state-of-the-art network performance in terms of both network accuracy and computational costs. Beyond allocation of MACs in the stage level, more fine grained allocation of MACs are expected.