Convolutional neural networks (CNNs) deliver state-of-the-art accuracy in many computer vision tasks such as image classification[18, 10], object detection 
, image super-resolution. Most of deep CNNs are well designed with a predefined number of parameters and computational complexities. For example, ResNet  mainly consists of versions with , , , and layers. These CNNs have provided strong baselines for visual applications.
To improve the accuracy further, the most common way is to scale up the base CNN model. Three factors including depth, width and input resolution heavily affect the model size. A number of works propose to scale models by the depth [10, 29], width  or input image resolution . These works consider only one dimension from depth, width or resolution, which leads to the imbalance in utilization of the computations or multiply-accumulate operations (MACs) . Simultaneously enlarging the width, depth and resolution can provide more flexible design space to find the high-performance models. Recently, several works focus on how to efficiently scale the three factors. EfficientNet  constructed one compound scaling formula to constrain the network width, depth and dimension. RegNet  studied the relationship between width and depth by exploring the network design spaces. These methods utilize a unified principle to scale the whole model, but ignore the stage-wise differences.
Here we rethink the procedure of enlarging CNN models from the viewpoint of stage-wise computation resource allocation. Modern CNNs usually consists of several stages, where one stage contains all layers with the same spatial size of feature maps. In Figure 2, we present the computations of different stages for ResNet  and EfficientNet . Figure 2 left demonstrate the discrepancy between ResNet series. ResNet18 has balanced MACs for each stage, while ResNet50 and ResNet101 get more MACs in the intermediate stages but few MACs in the head and tail stages. Figure 2 right presents the allocation of FLOPS for EfficientNet-B0, EfficientNet-B2 and EfficientNet-B4. EfficientNet utilizes one unified model scaling principle for network width, depth and resolution, so different configurations of EfficientNet have the similar tendency of MACs on different stages. The later stages have far more MACs (> times) than the former and intermediate stages. A universal rule of the computation allocation for different models is impractical. Neither the manual designed or unified rule is the solution of optimal computations allocation.
In this paper, we propose a network enlarging method based on greedy search of computations for each stage. In contrast to conventional unified principle, the method performs fine-grained search on the reallocation of computations. Given a baseline network, our goal is to enlarge it to the target MACs with the best configuration of depth, width and resolution in each stage. Under the assumption that the top-performing smaller CNNs are a proper subcomponent of the top-performing larger CNNs, we are able to enlarge CNNs step-by-step using greedy network enlarging algorithm. For each iteration in proposed algorithm, 1) a series of candidate networks are constructed by searching width, depth and resolution of each stage under constrained MACs; 2) with fast performance evaluation method, the architecture with the best performance in this iteration is appended to the baseline model pool for next iteration. By gradually adding MACs at each iteration, we find the optimal architecture until achieving the target MACs. Experiment results on ImageNet classification task demonstrate the superiority of our proposed method. The searched network configurations can largely boost the performances of existing base models. For example, searched EfficientNet models by proposed method outperform the original EfficientNets by a large margin.
2 Related Work
Manual Network Design.
In the early days after AlexNet , a large number of manually designed network architecture emerged. VGG  is the typical CNN architecture without any special connections, and deeper VGG-nets get high accuracies. However, the convergence problem emerged for very deep network. ResNet  with shortcut was proposed with higher accuracy and more layers. Except deeper network, wider network is another direction. WideResnet  has higher accuracy by adding channels for each layer in Resnet. Besides, a number of light-weight network are proposed in order to meet the demands of mobile devices. GoogLeNet , MobileNets [12, 27], ShuffleNets [38, 22] and GhostNet are these type networks. By setting one width scaling factor, the accuracy and MACs of Mobilenets and GhostNet are improved. The design pattern behind these networks was largely man-powered and focused on discovering new design choices that improve accuracy, e.g., the use of deeper or wider models or shortcuts.
Automatic Network Design.
Currently expert designed architectures are time-consuming. Because of this, there is a growing interest in automated neural architecture search (NAS) methods . By now, NAS methods have outperformed manually designed architectures on some tasks such as image classification [40, 24, 20, 21, 11], object detection [8, 35, 15, 33] or semantic segmentation [19, 28]. Generally, more MACs means higher accuracy. Traditionally, researchers have already learned to change the depth, width or resolution of models. But only one dimension is considered usually. EfficientNet  showed that it was critical to balance all dimensions of network width/depth/resolution and proposed a simple yet effective compound scaling method in accordance with the results by random sampling. RegNet  got several patterns by a huge number of experiments: good network have increasing widths with stages; the stage depths are likewise tend to increase for the best models, although not necessarily in the last stage. These methods construct principles from small networks, and use the rule to get various sizes of model, even very large models. In this paper, we take use of greedy allocation of MACs to enlarge model and get the specific model architecture under constrained MACs. During the expansion, the width, depth and input resolution are considered for each stage. Our intention is to maximize the utilization of MACs for the network.
In this section, we describe the proposed network enlarging method based on greedy allocation of MACs. Firstly, we define the goal of our method to find the optimal depth and width of each stage and the input resolution. Secondly, we introduce the main algorithm of greedy network enlarging. Further, we introduce how to efficiently evaluate the performance of candidate models.
3.1 Problem Definition
The modern CNN backbone architectures usually consists of a stem layer, network body and a head [23, 10, 31]. The main MACs and parameters burdens lie in the network body, as typically the stem layer is a convolutional layer and the head is a fully-connected layer. Thus, in this paper we focus on the scaling strategy of network body. The network body consists of several stages , which are defined as a sequence of layers or blocks with the same spatial size. For example, ResNet50  body is composed of stages with , , , and output sizes, respectively.
Scaling up convolutional neural networks is widely used to achieve better accuracy. Network depth, width and input resolution are three key factors for model scaling. Deeper convolutional neural networks capture richer and more complex features, and usually have high performance in contrast to shallow network. With the help of shortcuts, very deep network can be trained to convergence. However, the improvements on accuracy become smaller with the increase of depth. Another direction is scaling the width of network. More kernels mean more fine grained features can be learned. However, the MACs is squared with the width. As a result, the network depth will be constrained and high level features maybe loss. EfficientNet  showed that the accuracy quickly saturates when networks become wider. Higher resolution provide rich fine-grained information. In order to match high resolution, more powerful network is wanted. Deeper and wider network can acquire large receptive field and capture fine grained features.
As a result, the network depth, width and resolution are not independent. These three dimension have various combinations. And one unified principle can not acquire the best configuration for all tasks. In this paper, we decompose the network depth, width and resolution into stage depth, width and input resolution. This will maximize the utilization of computations for each stage and the whole network.
Given a base network with stages, width and depth are , and input resolution is . The objective is to acquire the network architecture with best performance by optimizing the allocation of target MACs :
where is the trainable parameters of the network. denote the validation accuracy, is the target threshold of MACs and is used to control the difference between the MACs of the searched model and the target MACs.
Search space. We consider the combinations of input resolution, width and depth of each stage. Suppose a base network with stages and configurations of width and depth = (, , …, , , , …, ), and input resolution . For each tieration, the growth rate for width, depth and resolution is , and , respectively. Under constrained target MACs, we enlarge the width, depth for each stage and the network input resolution step-by-step. For example, ResNet18 contains stages, if we constrain the search upper bound is times in both depth and width for each stage, and the growth rate is for depth and width. The total number of combinations is without considering the variation of input size.
3.2 Algorithm of Greedy Network Enlarging
Figure 3 presents our framework. Our intention is to find the optimal allocation of computations by enlarging depth, width and input resolution for each stage under constrained computations. So as to maximize the utilization of MACs, as shown in Eq. 1. For each stage in the network, we try to find its optimal depth and width . The optimal input size of resolution is searched to match the specialized width and depth. In the problem1,
have discrete values and massive combinations. Deep learning is both time and resource consuming. Due to the extreme complexity, traditional mathematical optimization method is impracticable. So we turn to efficient neural architecture search method.
To simply the search complexity, we first introduce an assumption. Finding the global optimal model is difficult with the massive search space, so we can smooth the target to find a top-performing configuration of target MACs. We introduce an assumption that the top-performing smaller CNNs are a proper subcomponent of the top-performing larger CNNs, as shown in Assumption 1. Resnet , VGG , EfficientNet  etc., fit this assumption perfectly. This assumption enables the idea of efficient search algorithm via greedy network enlarging.
Given an optimal network with MACs , depth , width and resolution , there exists at least one top-performing network with MACs , depth , width and resolution that satisfies
With the above assumption, we transform the optimal network architecture search problem into a series of interrelated single-stage optimal sub-network architecture search problems, and then solve them one by one. Decisions need to be made at each stage to optimize the process. The selection of decisions at each stage depends only on the current state (here, the current state refers to the resolutions, widths and depths of the current stage). When the decision of each stage is determined, a decision sequence is formed, which determines the final solution. The overall algorithm is illustrated in Algorithm 1.
In the algorithm, we use exponential increment of MACs in the process of search. This way make the changes of network more gentaly in contrast to uniform increment. For each iteration in Algorithm 1, in order to find the local optimal architecture configuration, we have to search and evaluate the candidate architectures. This step contains two targets: the first is to find the candidate architectures under limited increase of MACs; the second is to find the local optimal architectures with maximum validation accuracy. In the step of acquiring candidates, we consider the increase of resolution separately, which reduces the candidates. The increase of width and depth of each stage is on the basis of corresponding resolution.
In order to reduce the searched candidates, we take use of proportional control factor to assign the MACs between depth and width for each stage. Specifically, the ratios of MACs between depth and width are in one set . Under this setting, we search depth first and then width for each stage. The algorithm is illustrated in Algorithm 2.
3.3 Performance Estimation
To guide the search process, we have to estimate the performance of a given architecture. The most accurate method is to train the candidates on the whole training data and evaluate their performance on validation data. However, this way requires great computational demands in the order of thousands of GPU days. Developing methods for speeding up the process of performance estimation is crucial.
We turn to proxy tasks to estimate performance. Including shorter training times , training on a subset of the data , on proxy data  or using less filters per layer and less cells . These low-fidelity approximations reduce the cost, they also introduce bias in the performance estimation. Proxy data and simplified architecture have large deviation which leads to poor rank preservation.
In this section, we determine the optimal proxy task for performance estimation with empirical experiments. Firstly, we get the proxy sub-dataset by evaluating the performance of different sub-datasets. Secondly, the hyper-parameters of training are acquired with parameter search. Spearman’s rank correlation coefficient is a non-parametric measure of rank correlation, which is used as the measure of proxy task.
For the proxy sub-dataset, we create two sub-datasets ImageNet1000-100 and ImageNet100-500 by random selecting images from ImageNet. To evaluate these datasets, network architectures with different width, depth and input sizes are generated on the basis of EfficientNet-B0. We train all the networks and EfficientNet-B0 on the whole train set of ImageNet for epochs, the Top-1 accuracies on the validation dataset are used as the comparison object. We finetune the networks for different epochs. Besides, we train the networks from scratch for few epochs. On ImageNet100-500, the average Spearman value is . On ImageNet1000-100, the average Spearman value is . So we choose ImageNet1000-100 as the proxy sub-dataset. More details are presented in the supplementary materials.
After determining the proxy dataset, we try to improve the correlation between the proxy task and original task by searching the hyper-parameters. network architectures with different width, depth and input sizes are generated on the basis of EfficientNet-B0. We train all of the networks on the whole train set of ImageNet for epochs, the Top-1 accuracies on the validation dataset are used as the comparison object. Two pretrained EfficientNet-B0 models on the ImageNet and ImageNet1000-100 are provided, respectively. The learning rate, mode of learning rate decay and training epochs are considered. Among these hyper-parameter combinations, the top-2 Spearman value is and , these values indicate moderate positive correlation. They both use cosine decay method and the initial learning rate is for training epochs. The difference is that the first use the ImageNet1000-100 pretrained model and the second use the ImageNet pretrained model. More details are presented in the supplementary materials. Figure. 4 presents the consistency of different networks. In the next section, we take use of initial learning rate is and cosine decay for finetuning epochs on the ImageNet1000-100 pretrained model.
In this section, we evaluate greedy network enlarging method on general image classification dataset ImageNet . We demonstrate the method gets state-of-the-art accuracy with similar MACs.
4.1 Datasets, Networks and Experimental Settings
We extensively evaluate our methods on popular classification datasets ImageNet(ILSVRC2012) , which contains M images and categories, the validation set contains K images. On ImageNet, in order to speed up the search process, we create proxy ImageNet1000-100 dataset, which contains K train images and K validation images randomly sampled from ImageNet train set. Two baseline networks are considered: EfficientNet  and improved GhostNet .
To accelerate the search process, we set the growth rate of resolution and depth as and , respectively. For the growth rate of width, we use for small model and for large model. The ratios of MACs between depth and width are in one set The error rate of MACs is . We take use of exponential growth of MACs. We set different number of search iterations for small and large models. The finetune method comes from function preserving algorithm .
. RMSProp optimizer with momentum 0.9; weight decay 1e-5; multi-step learning rate with warmup, initial learning rate 0.064 that decays by 0.97 every 2.4 epochs. Moving average of weight, dropblock, random erasing  and random augment  are used.
ImageNet has noise labels and the method of crop augmentation introduces more noisy input and labels. To prevent this, we use the relabel method  to get higher accuracy.
4.2 ImageNet Results and Analysis
For EfficientNet, we take EfficientNet-B0 as the baseline, and we search the models with MACs similar to EfficientNet-B to B. Besides, we enlarge GhostNet with the principle of EfficientNet and search GhostNet architectures with greedy search method. For GhostNet, we add Squeeze-and-Excitation  module for each block. Table.1 shows the main results and comparison with other networks. The searched models are marked with ’S-’.
GhostNet-B1 and GhostNet-B4 in 1 are obtained by the compounding scale rule of EfficientNet. Their performance is lower in contrast to greedy search methods. This suggests that the rule on EfficientNet is not fit for GhostNet. We need to resample and optimize for new networks to get suitable rules. Besides, the compounding scale principle ignores the difference of stages, which leads to the loss of elaborate adjustment.
In Table.1, Top-1 accuracies of all searched architectures outperform the compound scaling tricks of EfficientNet  and RegNet . On M MACs, our searched architectures get and , improve performance and , respectively. On EfficientNet-B2 and B3, our searched EfficientNet architectures achieve and . We search networks on B MACs level, S-EfficientNet-B4 gets and S-GhostNet-B4 gets , respectively.
The relabel training trick improve the accuracy further. The Top-1 accuracy improves to on all searched architectures. We achieve a new SOTA 80.87% and 84.3% ImageNet top-1 accuracy under the setting of M and B MACs, respectively. All searched network architectures are presented in the supplementary materials.
4.3 Process of Greedy Search
Figure 5 is used specifically to show the changes of accuracy and input resolution of the search process. With increase of MACs, the resolution rises wavily, which verifies the role of dynamic search. The accuracy increases slowly and steadily.
Furtherly, the schematic diagram of greedy search for EfficientNet-B1 is shown in Figure 6. Under constrained MACs, we show the candidate network architectures. The green box means the best architecture in current iteration, and the gray box are discarded. Besides, the best architecture of each iteration are delivered to the later iterations.
Network enlarging is an effective scheme for generating deep neural networks with excellent performance from a smaller baseline. Different from the conventional approach that directly enlarge the given network using a unified strategy, we present a novel greedy network enlarging algorithm. The entire network enlarging task is therefore divided into several iterations for searching the best computational allocation in a step-by-step fashion. In the enlarging process of the base model, the added MACs will be assigned to the most appropriate location. Experimental results on several benchmark models and datasets show that the proposed method is able to surpass the original unified enlarging scheme and achieves state-of-the-art network performance in terms of both network accuracy and computational costs. Beyond allocation of MACs in the stage level, more fine grained allocation of MACs are expected.
-  Irwan Bello, William Fedus, Xianzhi Du, Ekin D Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. Revisiting resnets: Improved training and scaling strategies. arXiv preprint arXiv:2103.07579, 2021.
-  Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. arXiv preprint arXiv:2102.06171, 2021.
-  Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015.
Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le.
Randaugment: Practical automated data augmentation with a reduced
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
-  Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey, 2019.
-  Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks. arXiv preprint arXiv:1810.12890, 2018.
-  Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7036–7045, 2019.
-  Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1580–1589, 2020.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pages 1314–1324, 2019.
-  Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
-  Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems, pages 103–112, 2019.
-  Chenhan Jiang, Hang Xu, Wei Zhang, Xiaodan Liang, and Zhenguo Li. Sp-nas: Serial-to-parallel backbone search for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11863–11872, 2020.
-  Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.
Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter.
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 528–536, 2017.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 82–92, 2019.
-  Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
-  Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
-  Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
-  Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10428–10436, 2020.
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le.
Regularized evolution for image classifier architecture search.In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 4780–4789, 2019.
-  Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classifier architecture search. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):4780–4789, Jul. 2019.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
-  Albert Shaw, Daniel Hunter, Forrest Landola, and Sammy Sidhu. Squeezenas: Fast neural architecture search for faster semantic segmentation. In Proceedings of the IEEE international conference on computer vision workshops, pages 0–0, 2019.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
-  Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
-  Mingxing Tan and Quoc V Le. Efficientnetv2: Smaller models and faster training. arXiv preprint arXiv:2104.00298, 2021.
-  Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10781–10790, 2020.
-  Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
-  Hang Xu, Lewei Yao, Wei Zhang, Xiaodan Liang, and Zhenguo Li. Auto-fpn: Automatic network architecture adaptation for object detection beyond classification. In Proceedings of the IEEE International Conference on Computer Vision, pages 6649–6658, 2019.
-  Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Junsuk Choe, and Sanghyuk Chun. Re-labeling imagenet: from single to multi-labels, from global to localized labels. arXiv preprint arXiv:2101.05022, 2021.
-  Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
-  Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.
-  Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13001–13008, 2020.
-  Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.