Designing low-latency neural networks is critical to the application on mobile devices, e.g., mobile phones, security cameras, augmented reality glasses, self-driving cars, and many others. Neural architecture search(NAS) is expected to automatically search a neural network where the performance surpasses the human-designed one under the most common constraints, such as latency, FLOPs and parameter number. One of the key factors for the success of NAS is the artificially designed search space. Many previous NAS methods are designed to search on DARTS space[19, 24, 20] and MobileNet-like space[12, 23, 25]. DARTS space contains a multi-branch topology and a combination of various paths with different operations and complexities. The multi-branch design can enrich the feature space and improve the performance. Multi-branch structure is also used in some neural networks designed by human before, e.g., the Inception models and DenseNet models. However, these multi-branch architectures result in a longer inference time that is unfriendly for real tasks. Therefore, many applicable NAS methods are adopted to MobileNet-like space by keeping only two branches(skip connection and convolutional operation) for searched networks to realize a tradeoff between the inference time and the performance.
have been introduced. Rep techniques can retain multi-branch topology at training time while keeping a single path at running time, and thus can realize a balance of accuracy and efficiency. The multi-branch can be fused to one path because of linear operation, for example, a 3x3 Conv and a 1x1 Conv can be replaced by a 3x3 Conv by padding the 1x1 kernel to 3x3 and element-wise adding with the 3x3 Conv. It is worthy to note that the fused network has a lower inference time while keeping almost the same accuracy compared with the multi-branch network.
However, prior works that utilize Rep techniques to train models have some limitations. 1) The branch number and branch types of each block. The multi-branch training requires multiple memory(increasing linearly with the branches number) to save the middle feature representations. Therefore, the branch number and branch types of each block are manually fixed because of memory constraints. For instance, the RepVGG block only contains a 1x1 Conv, a 3x3 Conv and a skip connection while the ACB has 1x3 Conv and 3x1 Conv. 2) The role of each branch is unclear, which means some branches may counterproductive. Ding experimented with various micro Diverse Branch Block(DBB) structures to explore which structure works better. However, on one hand, the manual micro DBB(all blocks are the same) is always suboptimal. Many NAS work[10, 25, 12] revealed that the optimal structure of blocks is different from each other. On the other hand, it is infeasible to design macro DBB of which search space approximately reaches (Assuming there have 30 blocks and each block has 4 branches).
To address the limitation 1), we devise a multi-branch search space, called Rep search space. Unlike previous works that use Rep techniques, each block can preserve an arbitrary number of branches and all block architectures are independent in this paper. Facing increasing memory, the total branch number of the model can be flexibly adjusted. To address the limitation 2), a gradient-based NAS method, RepNAS, is proposed to automatically search the optimal branch for each block without parameter retraining. Compared with previous gradient-based NAS methods[19, 24]
that only learn the importance in one edge, RepNAS learns the net-wise importance of each branch. Moreover, RepNAS can be used under low GPU memory conditions by setting the branch number constraint. In each training iteration, the branches of low importance will be sequentially pruned until the memory constraint is met. The importance of branches is updated in the same round of back-propagation as gradients to neural parameters. As the training progresses, the importance of each branch is estimated accurately, meanwhile, the redundant branches hardly participate in the training. Once the training process finished, the optimized DDB structure is obtained with optimized network parameters.
To summarize, our main comtributions are as follow:
A Rep search space is proposed in this paper, which allows the searched model to preserve arbitrary branches in training and can be fused into one path in inference. To our best knowledge, it is the first time that Rep techniques can be used to NAS.
To fit the new search space, RepNAS is presented to automatically one-stage search the optimal DBB. The searched model can be converted to a single-path model and directly deployed without time-consuming retraining.
Extensive experiments on models with various sizes demonstrate that the searched ODDB outperform both the human-designed DDB and NAS models under similar inference time.
2 Related Work
2.1 Structural Re-parameterization
Structural re-parameterization techniques have been widely used to improve model performance by injecting several branches in training and fusing these branches in inference. RepVGG simply inserted a
and a residual connection into a, which makes the performance of VGG-like networks have a huge improvement. A concurrent work, DBB, summarized more structural re-parameterization techniques and proposed a Diverse Branch Block which can be inserted into any convolutional network. Here, we list the existing techniques as follows
a Conv for Conv-BN. BN can be fused into its preceding Conv parameters and for inference. The th fused channel parameters and can be formulated as
where , and denotes the learned scaling factor and bias term of BN, respectively.
A Conv for branch Conv addition. Convs with different kernel size in different branches can be fused into one Conv(without nonlinear activation). The kernel size of each Conv should be padded to the maximum of them. And then, they can be merged into a single Conv by
where denotes the branch number.
A Conv for sequential Convs. A sequence of Convs with - can be merged into one single Conv.
where represents transpose. and denote Conv and Conv, respectively.
A Conv for average pooling. A average pooling can be viewed as a special Conv. Its kernel parameter is
where is identity matrix.
However, the architecture of each block is task-specific. For simple tasks or tiny-scale datasets, too many branches lead to overfitting. Besides, the output of each branch needs to be saved in GPU memory for backward. Limited GPU memory prevents the structural Re-parameterization techniques from the application.
2.2 Neural Achitecture Search
The purpose of neural network structure search is to automatically find the optimal network structure with reinforcement learning(RL)[28, 29]
, evolutionary algorithm(EA)[20, 10, 25], gradient[19, 24] methods. RL-based and EA-based methods need to evaluate each sampled network by retraining them on the dataset, which are time-consuming. The gradient-based method can simultaneously train and search the optimal subnet by assigning a learnable weight to each candidate operation. However, the gradient-based approach causes incorrect ranking results. A subnet has top proxy accuracy while performing not as expected. Moreover, since gradient-based approaches need more memory for training, they cannot be applied to the large-scale dataset. To address the above problem, some one-stage NAS methods[24, 12, 9] are proposed to simultaneously optimize the architecture and parameters. Once the supernet training is complete, the top-performing subnet is also given without retraining.
3 Re-parameterization Neural Achitecture Search
In this section, we first present an overview of our proposed approach for searching optimal diverse branch blocks and discuss the difference with other NAS work. We then propose a new search space based on some Rep techniques mentioned in Eq. 1-4. Afterward, the RepNAS approach is presented to fit the proposed search space.
The goal of Rep techniques is to improve the training effect of a CNN by inserting various branches with different kernel size. The inserted branches can be linearly fused into the original convolutional branch after training, such that no extra computational complexity(FLOPs, latency, memory footprint or model size) is subjoined. However, training with various branches costs a large GPU memory consumption, and it is hard to optimize a network with too many branches. The core idea of the proposed method is to prune out some branches across different blocks in a differentiable way, which is shown in Figure 2. It has two essential steps:
(1) Given a CNN architecture(e.g., MobileNets, VGGs), we net-wisely insert several linear operations into original convolutional operations as its branches. For each branch, a learnable architecture parameter that represents the importance is set. During the training, we optimize both architecture parameters and network parameters by discretely sampling branches, simultaneously. Once training finishes, we can obtain a pruned architecture with optimized network parameters.
(2) In inference, the rest of the branches can be directly fused into the original convolutional operations, such that the multi-branch architecture can be converted to the single-path architecture without a performance drop. No extra finetune is required in this step.
Compared with many NAS work, cumbersome architectures with various branches and skip connections are no longer an obstacle to application in RepNAS. In contrast to prior structural re-parameterization work, the architectures of blocks in each layer can be automatically designed without any extra time consumption. The whole optimization is in one stage.
3.2 Rep Search Space Design
NAS methods are usually designed to search for the optimal subnetwork on DARTS space or MobileNet-like space. The former contains multi-branch architecture, which makes the searched network difficult to apply becausfunne of the large inference time. The latter which refers to the expert experience in designing mobile networks includes efficient networks, however, multi-branch architecture into no consideration. Many human-designed networks[22, 13] demonstrate multi-branch architecture can improve model performance by enhancing the feature representation from the various scale. To combine the advantages of multi-branch and single-path architecture, a more flexible search space is proposed based on some current structural re-parameterization work. Shown in Figure 3, each block contains 7 branches(,, , , , and
). Different from previous search space, we release the heuristic constraints to offer higher flexibility to the final architecture, which means each block can preserve arbitrary branches and all block architectures are independent. It is worthy to note that the multi-branch will be fused to a single-path after searching, thus, have no impact on inference.
The new search space reaches approximately architectures. Compared with either micro() or macro search space(, the proposed search space has greater potential to offer better architectures and evaluate the effectiveness of NAS algorithms. However, such an enlarged search space brings challenges for preceding NAS approaches:1) incorrect architecture rattings. 2) large memory cost because of multi-branch.
3.3 Weight Sharing for Network Parameters
Many previous NAS methods share weights across architectures during supernet training while decoupling the weights of different branches at the same block. However, this strategy does not work in the proposed Rep search space because of a surging number of subnets. In each training iteration, only a few branches can be sampled which results in the weights being updated by limited times. Therefore, the performances of sampled subnets grow slowly. This limits the learning of architecture parameter .
Inspired by BigNAS and slimmable networks, we also share network parameters across the same block. For any branch that needs a convolutional operation, the weights of this branch can be inherited from the convolutional operation in the main branch(see the right part of Figure 3). We represent its weights as
where is the floor operation, and denotes the kernel size of the inherited branch and main branch, respectively. Equipped with weight sharing, the supernet can get faster convergence, meanwhile, the ranking of branches can be evaluated precisely. We discuss the detail in Ablation Study.
3.4 Searching for Rep Blocks
To overcome the incorrect architecture ratings, an elegant solution is the one-stage differentiable NAS. The one-stage differentiable NAS is expected to simultaneously optimize the architecture parameter and its network parameter by differentiable way, which can be formulated as the following optimization:
denotes the loss function computed in a specified training dataset.
, the parameters probabilityof the th branch in the th block( branches) can be written as
Though the Eq.(6) can be optimized with gradient descent as most neural network training, it would suffer from the huge performance gap between supernet and its child network. Instead, a continuous and differentialble reparameterization trick, , is used in NAS approaches[9, 24]. With random variable, Eq.(7) can be rewritten as
where and is a uniform random variable.
can be approximated as a one-hot vector if temperature
. This relaxation realizes the discretization of probability distribution. In multi-branch search space proposed inRep Search Space Design, each block can preserve an arbitrary branches so that we cannot directly obtain probability as Eq.(7) or Eq.(8). To fit the new space, we can regard whether each branch is retained or not as a binary classification. Hence, the discretization probability of the th branch in the th block can be given as
where and represent preserve and prune this branch, respectively. Different from Eq.(8), each branch in the new space is independent. Thus, the temperature can be set differently according to requirements. Furthermore, we can combination the Eq.(9) and Eq.(10) into a sigmoid mode
where . We only need to optimize , instead of and , through grident descent. It grident can be given as
where and denotes the output of the th block and the th branch, respectively. In each training iteration, we can firstly compute a threshold according to the global ranking. Subsequently, the branches whose ranking is below the will be pruned out and do not participate in current forward or backward. Thanks to the independence of each branch, we can easily control the activation of each branch through its temperature .
where is a random variable. denotes the global ranking of the branch.
|Dataset||Arch.||Epochs||Batch size||Init LR||Weight decay||Data augmentation|
|CIFAR-10||VGG-16||600||128||0.1||same as |
|ImageNet||ODBB-A0/A1/A2/B1/B2/ResNet-18||150||256||0.1||same as |
|ImageNet||ODBB-B3/ResNet-101||240||256||0.1||same as |
In implementation, we sort the importance of each branch by Eq.(11) and only keep the top-k branches for forward. The importance of unpruned branch is convergence to by Eq.(13). Afterward, we simply multiply to the output of unpruned branch, such that
can be obtained by the chain rule. The whole algorithm is shown in Alg1.
|Model||Parameters(M)||FLOPs(B)||Inference(s)||Search Space||Search+Retrain Cost(GPU Days)||Top-1(%)|
We first compare ODBB with baseline, random search, DBB and ACB on CIFAR-10 for a quick sanity check. To further demonstrate the effectiveness of ODBB with various model size, experiments on a large-scale dataset, ImageNet1k are conducted. At last, the impact of weight sharing and the effect of the branch number constraints on model performance is given in the ablation study.
4.1 Quick Sanity Check on CIFAR-10
We use VGG-16 as the benchmark architecture. The convolutional operations in the benchmark architecture are replaced by ODBB, DBB and ACB, respectively. For a fair comparison, the data augmentation techniques and training hyper-parameter setting are followed with DBB and ACB which can be given in Table 1. To optimize architecture parameter , simultaneously, we use Adam optimizer with learning rate and betas. One of the indicators to evaluate the effectiveness of a NAS approach is whether the search result exceeds the result of the random search. To produce the random search result, 25 architectures are randomly sampled from the supernet. We train each of them for 100 epochs and pick up the best one for an entire 600-epoch re-training. The comparison results are shown in Table 3. ODBB can improve VGG-16 on CIFAR-10 by 0.85% and surpass the DBB and ACB. The architecture generated by random search also can slightly improve the performance of the benchmark model but fall behind the ODBB by 0.77%, which demonstrates the effectiveness of our proposed NAS algorithm.
4.2 Performance improvements on ImageNet
To reveal the generalization ability of our method, we then search for a series of models on ImageNet-1K which comprises 1.28M images for training and 50K for validation from 1000 classes. RepVgg-series are used as benchmark architectures. We replace each original RepVGG block by designed block in Rep Search Space Design. For a fair comparison, the total branch number of all ODBB models is limited to be equal to RepVGGs. The training strategies of each network are listed in Table 1. We also use Adam optimizer with learning rate and betas to optimize . We will discuss the impact of the branch number on performance in Ablation Study.
Results are summarized in Table 4. RepVGG is stacked with several multi-branch blocks that contain a , a and a skip connection and also can be fused into a single path in inference. We search for more powerful architectures for blocks in RepVGG. For RepVGG-A0-A2 that has 22 layers, ODBB can achieve 0.55%, 0.38% and 0.24% better accuracy than original baselines, respectively. To our best knowledge, RepVGG-B3 with ODBB refreshes the performance for plain models from 80.52% to 80.97%. Noting that ACB, DBB and RepVGG Block are special cases of our proposed search space, our proposed NAS really can search the optimal architecture beyond human-defined ones.
4.3 Comparison with Other NAS
We compare ODBB-series models with other models searched from DARTS and mobile search space on ImageNet-1K. We verify the real inference time on two different hardware platforms, including GPU(NVIDIA Tesla V100) and embedded device(NVIDIA Xavier). Table 2 presents the results and shows that: 1) ODDB-series models searched from Rep search space consistently outperform other networks searched from other search space with lower inference time since ODDB-series networks are only stacked by 3x3 convolutions and have no branches. 2) RepNAS can directly search on ImageNet-1k and combine the train and search, which brings high performance and low GPU days cost. We show the architectures of searched ODBB-series in the Appendix.
To verify the transferability of searched models on the other computer version task, object detection experiments are conducted on MS COCO dataset with CenterNet framework. The detailed implementation is following , ImageNet-1k pretrained backbones with three up-convolutional networks are finetuned on COCO. The input resolution is . We use Adam to optimize the overall parameters. Specifically, the baseline backbone of CenterNet is ResNet-series and we replace it with our searched ODBB-series. All backbone networks are pretrained on ImageNet-1K with train schedule in Table 1. After finetuning on COCO, ODBB-series will be fused into one path for fast inference. The results are shown in Table 5. Both in slight and heavy models, ODBB-A0 and ODBB-B3 surpass ResNet-18 and ResNet-101 by 3.3% and 2.7% AP with comparable inference speed, respectively, which demonstrates our searched models have outstanding performance in other computer version tasks.
4.5 Ablation Study
Training Under Branch Number Constraint. Training multi-branch models requires linearly increased GPU memory to save the middle feature maps for forward and backward. How to memory-friendly search and train multi-branch model is significant for the application of structural re-parameterization techniques. For simplicity, we reduce the branch number of the network to meet the various memory constraints by pushing temperature of low-rank branches to . To prove the efficiency and effectiveness over other human-designed blocks, we also train DBB and ACB on benchmark RepVGG-A0 by replacing the original RepVGG block, respectively. All experiments in this subsection are conducted on RepVGG-A0 with the same training set.
The results shown in Figure 4 reveal that the performance of ODBB can surpass other blocks with fewer branches. For instance, ODBB only containing 50 branches has higher top-1 Acc. over the original RepVGG block(66 branches), DBB(88 branches) and ACB(66 branches). Besides, we found that the ODBB with 75 branches is enough to obtain the same results as the ODBB with more branches.
Efficacy of Weight Sharing. We train and search ODBB(A0) with the following two methods: 1) The weights of the different branches in the same block are shared. 2) The weights of the different branches are independently updated. Figure 5 presents the comparisons on ImageNet-1k with ODBB(A0). It is apparent that the ODBB(A0) updated with weight sharing gets faster convergence than the one that is updated independently. Besides, we sample the high-performing subnet in every five training epochs according to the ranking of the architecture parameter . Each sampled model is trained from scratch with the scheme in Table 1. Shown as Figure 5, the performance of subnet sampled from weight sharing supernet has more stable and higher accuracy in each period. This phenomenon illustrates the weight-sharing supernet serves as a good indicator of ranking.
The searched networks have the same channels and depth as RepVGG networks which contain huge parameters. Therefore, less concerning the number of parameters prevents the deployment of our searched models from CPU devices. However, VGG-like models can be easily pruned by many existing fast channel pruning methods[5, 2, 11] for parameter reduction.
In this work, we first propose a new search space, Rep space, where each block is architecture-independent and can preserve arbitrary branches. Any subnet searched from Rep space has fast inference speed since it can be converted to a single path model in inference. To efficiently train the supernet, block-wisely weight sharing is used in supernet training. To fit the new search space, a new one-stage NAS method is presented. The optimal diverse branch blocks can be obtained without retraining. Extensive experiments demonstrate that the proposed RepNAS can search various sizes of multi-branch networks, named ODDB-series, which strongly outperform the previous NAS networks under a similar inference time. Moreover, Compared with other networks utilized Rep techniques, ODDB-series also achieve the state-of-the-art top-1 accuracy on ImageNet in various model size. In the feature work, we will consider adding the channel number of each block to search space, which can further boost the performance of RepNAS.
Progressive differentiable architecture search: bridging the depth gap between search and evaluation.
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1294–1303. Cited by: §3.2, §3.4.
Towards efficient model compression via learned global ranking.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1518–1528. Cited by: §5.
-  (2019) Fairnas: rethinking evaluation fairness of weight sharing neural architecture search. arXiv preprint arXiv:1907.01845. Cited by: §3.2.
-  (2019) Acnet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1911–1920. Cited by: §1, Figure 3, Table 3, §4.
-  (2020) Lossless cnn channel pruning via decoupling remembering and forgetting. arXiv preprint arXiv:2007.03260. Cited by: §1, §5.
-  (2021) Diverse branch block: building a convolution as an inception-like unit. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10886–10895. Cited by: §1, §1, §2.1, Figure 3, Table 1, Table 3, §4.
-  (2021) RepMLP: re-parameterizing convolutions into fully-connected layers for image recognition. arXiv preprint arXiv:2105.01883. Cited by: §1.
-  (2021) Repvgg: making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13733–13742. Cited by: §1, §2.1, Figure 3, Table 1, §5, Appendix: Model Architecture.
-  (2019) Searching for a robust neural architecture in four gpu hours. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1761–1770. Cited by: §2.2, §3.4.
-  (2020) Single path one-shot neural architecture search with uniform sampling. In European Conference on Computer Vision, pp. 544–560. Cited by: §1, §2.2.
-  (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pp. 1389–1397. Cited by: §5.
-  (2020) Dsnas: direct neural architecture search without parameter retraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12084–12092. Cited by: §1, §1, §2.2.
-  (2014) Densenet: implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869. Cited by: §1, §3.2.
Batch normalization: accelerating deep network training by reducing internal covariate shift.
International conference on machine learning, pp. 448–456. Cited by: §2.1.
-  (2016) Categorical reparameterization with gumbel-softmax. Cited by: §3.4.
-  (2015) Adam: a method for stochastic optimization. In ICLR (Poster), Cited by: §4.1.
Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §4.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.4.
-  (2018) DARTS: differentiable architecture search. In International Conference on Learning Representations, Cited by: §1, §1, §2.2, §3.4.
Regularized evolution for image classifier architecture search. In
Proceedings of the aaai conference on artificial intelligence, Vol. 33, pp. 4780–4789. Cited by: §1, §2.2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1, §4.2.
-  (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence, Cited by: §1, §3.2.
-  (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114. Cited by: §1.
-  (2018) SNAS: stochastic neural architecture search. In International Conference on Learning Representations, Cited by: §1, §1, §2.2, §3.4.
-  (2020) Bignas: scaling up neural architecture search with big single-stage models. In European Conference on Computer Vision, pp. 702–717. Cited by: §1, §1, §2.2, §3.3.
-  (2018) Slimmable neural networks. In International Conference on Learning Representations, Cited by: §3.3.
-  (2019) Objects as points. In arXiv preprint arXiv:1904.07850, Cited by: §4.4.
-  (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §2.2.
-  (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §2.2.
Appendix: Model Architecture
|Name||Layers of each stage||Channels of each stage|
|A0||1, 2, 4, 14, 1||64, 48, 96, 192, 1280|
|A1||1, 2, 4, 14, 1||64, 64, 128, 256, 1280|
|A2||1, 2, 4, 14, 1||64, 96, 192, 384, 1408|
|B1||1, 4, 6, 16, 1||64, 128, 256, 512, 2048|
|B3||1, 4, 6, 16, 1||64, 160, 320, 640, 2560|
|B3||1, 4, 6, 16, 1||64, 192, 384, 768, 2560|