1 Introduction
Convolutional neural networks (CNNs) deliver stateoftheart accuracy in many computer vision tasks such as image classification
[18, 10], object detection [26], image superresolution
[16]. Most of deep CNNs are well designed with a predefined number of parameters and computational complexities. For example, ResNet [10] mainly consists of versions with , , , and layers. These CNNs have provided strong baselines for visual applications.To improve the accuracy further, the most common way is to scale up the base CNN model. Three factors including depth, width and input resolution heavily affect the model size. A number of works propose to scale models by the depth [10, 29], width [37] or input image resolution [14]. These works consider only one dimension from depth, width or resolution, which leads to the imbalance in utilization of the computations or multiplyaccumulate operations (MACs) . Simultaneously enlarging the width, depth and resolution can provide more flexible design space to find the highperformance models. Recently, several works focus on how to efficiently scale the three factors. EfficientNet [31] constructed one compound scaling formula to constrain the network width, depth and dimension. RegNet [23] studied the relationship between width and depth by exploring the network design spaces. These methods utilize a unified principle to scale the whole model, but ignore the stagewise differences.
Here we rethink the procedure of enlarging CNN models from the viewpoint of stagewise computation resource allocation. Modern CNNs usually consists of several stages, where one stage contains all layers with the same spatial size of feature maps. In Figure 2, we present the computations of different stages for ResNet [10] and EfficientNet [31]. Figure 2 left demonstrate the discrepancy between ResNet series. ResNet18 has balanced MACs for each stage, while ResNet50 and ResNet101 get more MACs in the intermediate stages but few MACs in the head and tail stages. Figure 2 right presents the allocation of FLOPS for EfficientNetB0, EfficientNetB2 and EfficientNetB4. EfficientNet utilizes one unified model scaling principle for network width, depth and resolution, so different configurations of EfficientNet have the similar tendency of MACs on different stages. The later stages have far more MACs (> times) than the former and intermediate stages. A universal rule of the computation allocation for different models is impractical. Neither the manual designed or unified rule is the solution of optimal computations allocation.
In this paper, we propose a network enlarging method based on greedy search of computations for each stage. In contrast to conventional unified principle, the method performs finegrained search on the reallocation of computations. Given a baseline network, our goal is to enlarge it to the target MACs with the best configuration of depth, width and resolution in each stage. Under the assumption that the topperforming smaller CNNs are a proper subcomponent of the topperforming larger CNNs, we are able to enlarge CNNs stepbystep using greedy network enlarging algorithm. For each iteration in proposed algorithm, 1) a series of candidate networks are constructed by searching width, depth and resolution of each stage under constrained MACs; 2) with fast performance evaluation method, the architecture with the best performance in this iteration is appended to the baseline model pool for next iteration. By gradually adding MACs at each iteration, we find the optimal architecture until achieving the target MACs. Experiment results on ImageNet classification task demonstrate the superiority of our proposed method. The searched network configurations can largely boost the performances of existing base models. For example, searched EfficientNet models by proposed method outperform the original EfficientNets by a large margin.
2 Related Work
Manual Network Design.
In the early days after AlexNet [18], a large number of manually designed network architecture emerged. VGG [29] is the typical CNN architecture without any special connections, and deeper VGGnets get high accuracies. However, the convergence problem emerged for very deep network. ResNet [10] with shortcut was proposed with higher accuracy and more layers. Except deeper network, wider network is another direction. WideResnet [37] has higher accuracy by adding channels for each layer in Resnet. Besides, a number of lightweight network are proposed in order to meet the demands of mobile devices. GoogLeNet [30], MobileNets [12, 27], ShuffleNets [38, 22] and GhostNet [9]are these type networks. By setting one width scaling factor, the accuracy and MACs of Mobilenets and GhostNet are improved. The design pattern behind these networks was largely manpowered and focused on discovering new design choices that improve accuracy, e.g., the use of deeper or wider models or shortcuts.
Automatic Network Design.
Currently expert designed architectures are timeconsuming. Because of this, there is a growing interest in automated neural architecture search (NAS) methods [6]. By now, NAS methods have outperformed manually designed architectures on some tasks such as image classification [40, 24, 20, 21, 11], object detection [8, 35, 15, 33] or semantic segmentation [19, 28]. Generally, more MACs means higher accuracy. Traditionally, researchers have already learned to change the depth, width or resolution of models. But only one dimension is considered usually. EfficientNet [31] showed that it was critical to balance all dimensions of network width/depth/resolution and proposed a simple yet effective compound scaling method in accordance with the results by random sampling. RegNet [23] got several patterns by a huge number of experiments: good network have increasing widths with stages; the stage depths are likewise tend to increase for the best models, although not necessarily in the last stage. These methods construct principles from small networks, and use the rule to get various sizes of model, even very large models. In this paper, we take use of greedy allocation of MACs to enlarge model and get the specific model architecture under constrained MACs. During the expansion, the width, depth and input resolution are considered for each stage. Our intention is to maximize the utilization of MACs for the network.
3 Approach
In this section, we describe the proposed network enlarging method based on greedy allocation of MACs. Firstly, we define the goal of our method to find the optimal depth and width of each stage and the input resolution. Secondly, we introduce the main algorithm of greedy network enlarging. Further, we introduce how to efficiently evaluate the performance of candidate models.
3.1 Problem Definition
The modern CNN backbone architectures usually consists of a stem layer, network body and a head [23, 10, 31]. The main MACs and parameters burdens lie in the network body, as typically the stem layer is a convolutional layer and the head is a fullyconnected layer. Thus, in this paper we focus on the scaling strategy of network body. The network body consists of several stages [23], which are defined as a sequence of layers or blocks with the same spatial size. For example, ResNet50 [10] body is composed of stages with , , , and output sizes, respectively.
Scaling up convolutional neural networks is widely used to achieve better accuracy. Network depth, width and input resolution are three key factors for model scaling. Deeper convolutional neural networks capture richer and more complex features, and usually have high performance in contrast to shallow network. With the help of shortcuts, very deep network can be trained to convergence. However, the improvements on accuracy become smaller with the increase of depth. Another direction is scaling the width of network. More kernels mean more fine grained features can be learned. However, the MACs is squared with the width. As a result, the network depth will be constrained and high level features maybe loss. EfficientNet [31] showed that the accuracy quickly saturates when networks become wider. Higher resolution provide rich finegrained information. In order to match high resolution, more powerful network is wanted. Deeper and wider network can acquire large receptive field and capture fine grained features.
As a result, the network depth, width and resolution are not independent. These three dimension have various combinations. And one unified principle can not acquire the best configuration for all tasks. In this paper, we decompose the network depth, width and resolution into stage depth, width and input resolution. This will maximize the utilization of computations for each stage and the whole network.
Given a base network with stages, width and depth are , and input resolution is . The objective is to acquire the network architecture with best performance by optimizing the allocation of target MACs :
(1)  
(2)  
(3) 
where is the trainable parameters of the network. denote the validation accuracy, is the target threshold of MACs and is used to control the difference between the MACs of the searched model and the target MACs.
Search space. We consider the combinations of input resolution, width and depth of each stage. Suppose a base network with stages and configurations of width and depth = (, , …, , , , …, ), and input resolution . For each tieration, the growth rate for width, depth and resolution is , and , respectively. Under constrained target MACs, we enlarge the width, depth for each stage and the network input resolution stepbystep. For example, ResNet18 contains stages, if we constrain the search upper bound is times in both depth and width for each stage, and the growth rate is for depth and width. The total number of combinations is without considering the variation of input size.
3.2 Algorithm of Greedy Network Enlarging
Figure 3 presents our framework. Our intention is to find the optimal allocation of computations by enlarging depth, width and input resolution for each stage under constrained computations. So as to maximize the utilization of MACs, as shown in Eq. 1. For each stage in the network, we try to find its optimal depth and width . The optimal input size of resolution is searched to match the specialized width and depth. In the problem1,
have discrete values and massive combinations. Deep learning is both time and resource consuming. Due to the extreme complexity, traditional mathematical optimization method is impracticable. So we turn to efficient neural architecture search method.
To simply the search complexity, we first introduce an assumption. Finding the global optimal model is difficult with the massive search space, so we can smooth the target to find a topperforming configuration of target MACs. We introduce an assumption that the topperforming smaller CNNs are a proper subcomponent of the topperforming larger CNNs, as shown in Assumption 1. Resnet [10], VGG [29], EfficientNet [31] etc., fit this assumption perfectly. This assumption enables the idea of efficient search algorithm via greedy network enlarging.
Assumption 1
Given an optimal network with MACs , depth , width and resolution , there exists at least one topperforming network with MACs , depth , width and resolution that satisfies
(4)  
With the above assumption, we transform the optimal network architecture search problem into a series of interrelated singlestage optimal subnetwork architecture search problems, and then solve them one by one. Decisions need to be made at each stage to optimize the process. The selection of decisions at each stage depends only on the current state (here, the current state refers to the resolutions, widths and depths of the current stage). When the decision of each stage is determined, a decision sequence is formed, which determines the final solution. The overall algorithm is illustrated in Algorithm 1.
In the algorithm, we use exponential increment of MACs in the process of search. This way make the changes of network more gentaly in contrast to uniform increment. For each iteration in Algorithm 1, in order to find the local optimal architecture configuration, we have to search and evaluate the candidate architectures. This step contains two targets: the first is to find the candidate architectures under limited increase of MACs; the second is to find the local optimal architectures with maximum validation accuracy. In the step of acquiring candidates, we consider the increase of resolution separately, which reduces the candidates. The increase of width and depth of each stage is on the basis of corresponding resolution.
In order to reduce the searched candidates, we take use of proportional control factor to assign the MACs between depth and width for each stage. Specifically, the ratios of MACs between depth and width are in one set . Under this setting, we search depth first and then width for each stage. The algorithm is illustrated in Algorithm 2.
3.3 Performance Estimation
To guide the search process, we have to estimate the performance of a given architecture. The most accurate method is to train the candidates on the whole training data and evaluate their performance on validation data. However, this way requires great computational demands in the order of thousands of GPU days. Developing methods for speeding up the process of performance estimation is crucial.
We turn to proxy tasks to estimate performance. Including shorter training times [23], training on a subset of the data [17], on proxy data [40] or using less filters per layer and less cells [25]. These lowfidelity approximations reduce the cost, they also introduce bias in the performance estimation. Proxy data and simplified architecture have large deviation which leads to poor rank preservation.
In this section, we determine the optimal proxy task for performance estimation with empirical experiments. Firstly, we get the proxy subdataset by evaluating the performance of different subdatasets. Secondly, the hyperparameters of training are acquired with parameter search. Spearman’s rank correlation coefficient is a nonparametric measure of rank correlation, which is used as the measure of proxy task.
For the proxy subdataset, we create two subdatasets ImageNet1000100 and ImageNet100500 by random selecting images from ImageNet. To evaluate these datasets, network architectures with different width, depth and input sizes are generated on the basis of EfficientNetB0. We train all the networks and EfficientNetB0 on the whole train set of ImageNet for epochs, the Top1 accuracies on the validation dataset are used as the comparison object. We finetune the networks for different epochs. Besides, we train the networks from scratch for few epochs. On ImageNet100500, the average Spearman value is . On ImageNet1000100, the average Spearman value is . So we choose ImageNet1000100 as the proxy subdataset. More details are presented in the supplementary materials.
After determining the proxy dataset, we try to improve the correlation between the proxy task and original task by searching the hyperparameters. network architectures with different width, depth and input sizes are generated on the basis of EfficientNetB0. We train all of the networks on the whole train set of ImageNet for epochs, the Top1 accuracies on the validation dataset are used as the comparison object. Two pretrained EfficientNetB0 models on the ImageNet and ImageNet1000100 are provided, respectively. The learning rate, mode of learning rate decay and training epochs are considered. Among these hyperparameter combinations, the top2 Spearman value is and , these values indicate moderate positive correlation. They both use cosine decay method and the initial learning rate is for training epochs. The difference is that the first use the ImageNet1000100 pretrained model and the second use the ImageNet pretrained model. More details are presented in the supplementary materials. Figure. 4 presents the consistency of different networks. In the next section, we take use of initial learning rate is and cosine decay for finetuning epochs on the ImageNet1000100 pretrained model.
4 Experiments
In this section, we evaluate greedy network enlarging method on general image classification dataset ImageNet [5]. We demonstrate the method gets stateoftheart accuracy with similar MACs.
4.1 Datasets, Networks and Experimental Settings
We extensively evaluate our methods on popular classification datasets ImageNet(ILSVRC2012) [5], which contains M images and categories, the validation set contains K images. On ImageNet, in order to speed up the search process, we create proxy ImageNet1000100 dataset, which contains K train images and K validation images randomly sampled from ImageNet train set. Two baseline networks are considered: EfficientNet [31] and improved GhostNet [9].
To accelerate the search process, we set the growth rate of resolution and depth as and , respectively. For the growth rate of width, we use for small model and for large model. The ratios of MACs between depth and width are in one set The error rate of MACs is . We take use of exponential growth of MACs. We set different number of search iterations for small and large models. The finetune method comes from function preserving algorithm [3].
After the process of search is completed, we retrain the acquired network architecture on the whole ImageNet from scratch. The train setting is from timm [34] under its license and EfficientNet [31]
. RMSProp optimizer with momentum 0.9; weight decay 1e5; multistep learning rate with warmup, initial learning rate 0.064 that decays by 0.97 every 2.4 epochs. Moving average of weight, dropblock
[7], random erasing [39] and random augment [4] are used.ImageNet has noise labels and the method of crop augmentation introduces more noisy input and labels. To prevent this, we use the relabel method [36] to get higher accuracy.
4.2 ImageNet Results and Analysis
For EfficientNet, we take EfficientNetB0 as the baseline, and we search the models with MACs similar to EfficientNetB to B. Besides, we enlarge GhostNet with the principle of EfficientNet and search GhostNet architectures with greedy search method. For GhostNet, we add SqueezeandExcitation [13] module for each block. Table.1 shows the main results and comparison with other networks. The searched models are marked with ’S’.
GhostNetB1 and GhostNetB4 in 1 are obtained by the compounding scale rule of EfficientNet. Their performance is lower in contrast to greedy search methods. This suggests that the rule on EfficientNet is not fit for GhostNet. We need to resample and optimize for new networks to get suitable rules. Besides, the compounding scale principle ignores the difference of stages, which leads to the loss of elaborate adjustment.
Model  Top1 Acc.  #Params  #MACs  RatiotoEfficientNet 

EfficientNetB0 [31]  77.1%  5.3M  0.39B  1x 
Ghostnet [9]  73.9%  5.2M  0.14B  0.36x 
EfficientNetB1 [31]  79.1%  7.8M  0.69B  1x 
ResNetRS50 [1]  78.8%  36M  4.6B  6.7x 
REGNETY800MF [23]  76.3%  6.3M  0.8B  1.16x 
SEfficientNetB1  79.91%  8.8M  0.68B  1x 
SEfficientNetB1re  80.71%  8.8M  0.68B  1x 
GhostNetB1  79.13%  13.3M  0.59B  0.85x 
SGhostNetB1  80.08%  16.2M  0.67B  1x 
SGhostNetB1re  80.87%  16.2M  0.67B  1x 
EfficientNetB2 [31]  80.1%  9.1M  0.99B  1x 
REGNETY1.6GF [23]  78.0%  11.2M  1.6B  1.6x 
SEfficientNetB2  80.92%  9.3M  1.0B  1x 
SEfficientNetB2re  81.58%  9.3M  1.0B  1x 
EfficientNetB3 [31]  81.6%  12.2M  1.83B  1x 
ResNetRS101 [1]  81.2%  64M  12B  6.6x 
REGNETY4.0GF [23]  79.4%  20.6M  4.0B  2.18x 
SEfficientNetB3  81.98%  12.3M  1.88B  1x 
SEfficientNetB3re  82.87%  12.3M  1.88B  1x 
EfficientNetB4 [31]  82.9%  19.3M  4.39B  1x 
REGNETY8.0GF [23]  79.9%  39.2M  8.0B  1.82x 
NFNetF0 [2]  83.6%  71.5M  12.4B  2.8x 
ResNetRS152 [1]  83.0%  87M  31B  7.1x 
EfficientNetV2S [32]  83.9%  24M  8.8B  2.0x 
SEfficientNetB4  83.0%  17.0M  4.34B  1x 
SEfficientNetB4re  84.0%  17.0M  4.34B  1x 
GhostNetB4  82.78%  36.1M  4.39B  1x 
SGhostNetB4  83.2%  32.9M  4.37B  1x 
SGhostNetB4re  84.3%  32.9M  4.37B  1x 
In Table.1, Top1 accuracies of all searched architectures outperform the compound scaling tricks of EfficientNet [31] and RegNet [23]. On M MACs, our searched architectures get and , improve performance and , respectively. On EfficientNetB2 and B3, our searched EfficientNet architectures achieve and . We search networks on B MACs level, SEfficientNetB4 gets and SGhostNetB4 gets , respectively.
The relabel training trick improve the accuracy further. The Top1 accuracy improves to on all searched architectures. We achieve a new SOTA 80.87% and 84.3% ImageNet top1 accuracy under the setting of M and B MACs, respectively. All searched network architectures are presented in the supplementary materials.
4.3 Process of Greedy Search
Figure 5 is used specifically to show the changes of accuracy and input resolution of the search process. With increase of MACs, the resolution rises wavily, which verifies the role of dynamic search. The accuracy increases slowly and steadily.
Furtherly, the schematic diagram of greedy search for EfficientNetB1 is shown in Figure 6. Under constrained MACs, we show the candidate network architectures. The green box means the best architecture in current iteration, and the gray box are discarded. Besides, the best architecture of each iteration are delivered to the later iterations.
5 Conclusion
Network enlarging is an effective scheme for generating deep neural networks with excellent performance from a smaller baseline. Different from the conventional approach that directly enlarge the given network using a unified strategy, we present a novel greedy network enlarging algorithm. The entire network enlarging task is therefore divided into several iterations for searching the best computational allocation in a stepbystep fashion. In the enlarging process of the base model, the added MACs will be assigned to the most appropriate location. Experimental results on several benchmark models and datasets show that the proposed method is able to surpass the original unified enlarging scheme and achieves stateoftheart network performance in terms of both network accuracy and computational costs. Beyond allocation of MACs in the stage level, more fine grained allocation of MACs are expected.
References
 [1] Irwan Bello, William Fedus, Xianzhi Du, Ekin D Cubuk, Aravind Srinivas, TsungYi Lin, Jonathon Shlens, and Barret Zoph. Revisiting resnets: Improved training and scaling strategies. arXiv preprint arXiv:2103.07579, 2021.
 [2] Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan. Highperformance largescale image recognition without normalization. arXiv preprint arXiv:2102.06171, 2021.
 [3] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015.

[4]
Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le.
Randaugment: Practical automated data augmentation with a reduced
search space.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
, pages 702–703, 2020.  [5] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and FeiFei Li. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 [6] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey, 2019.
 [7] Golnaz Ghiasi, TsungYi Lin, and Quoc V Le. Dropblock: A regularization method for convolutional networks. arXiv preprint arXiv:1810.12890, 2018.
 [8] Golnaz Ghiasi, TsungYi Lin, and Quoc V Le. Nasfpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7036–7045, 2019.
 [9] Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1580–1589, 2020.
 [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [11] Andrew Howard, Mark Sandler, Grace Chu, LiangChieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pages 1314–1324, 2019.
 [12] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [13] Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
 [14] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems, pages 103–112, 2019.
 [15] Chenhan Jiang, Hang Xu, Wei Zhang, Xiaodan Liang, and Zhenguo Li. Spnas: Serialtoparallel backbone search for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11863–11872, 2020.
 [16] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image superresolution using very deep convolutional networks. In CVPR, 2016.

[17]
Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter.
Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets.
InProceedings of the 20th International Conference on Artificial Intelligence and Statistics
, pages 528–536, 2017.  [18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [19] Chenxi Liu, LiangChieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li FeiFei. Autodeeplab: Hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 82–92, 2019.
 [20] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
 [21] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055, 2018.
 [22] Ningning Ma, Xiangyu Zhang, HaiTao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
 [23] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10428–10436, 2020.

[24]
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le.
Regularized evolution for image classifier architecture search.
In Proceedings of the aaai conference on artificial intelligence, volume 33, pages 4780–4789, 2019.  [25] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classifier architecture search. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):4780–4789, Jul. 2019.
 [26] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 [27] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
 [28] Albert Shaw, Daniel Hunter, Forrest Landola, and Sammy Sidhu. Squeezenas: Fast neural architecture search for faster semantic segmentation. In Proceedings of the IEEE international conference on computer vision workshops, pages 0–0, 2019.
 [29] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [30] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [31] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
 [32] Mingxing Tan and Quoc V Le. Efficientnetv2: Smaller models and faster training. arXiv preprint arXiv:2104.00298, 2021.
 [33] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10781–10790, 2020.
 [34] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorchimagemodels, 2019.
 [35] Hang Xu, Lewei Yao, Wei Zhang, Xiaodan Liang, and Zhenguo Li. Autofpn: Automatic network architecture adaptation for object detection beyond classification. In Proceedings of the IEEE International Conference on Computer Vision, pages 6649–6658, 2019.
 [36] Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Junsuk Choe, and Sanghyuk Chun. Relabeling imagenet: from single to multilabels, from global to localized labels. arXiv preprint arXiv:2101.05022, 2021.
 [37] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 [38] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.
 [39] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 13001–13008, 2020.
 [40] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.