Log In Sign Up

Tiered Pruning for Efficient Differentialble Inference-Aware Neural Architecture Search

by   Slawomir Kierat, et al.

We propose three novel pruning techniques to improve the cost and results of inference-aware Differentiable Neural Architecture Search (DNAS). First, we introduce , a stochastic bi-path building block for DNAS, which can search over inner hidden dimensions with memory and compute complexity. Second, we present an algorithm for pruning blocks within a stochastic layer of the SuperNet during the search. Third, we describe a novel technique for pruning unnecessary stochastic layers during the search. The optimized models resulting from the search are called PruNet and establishes a new state-of-the-art Pareto frontier for NVIDIA V100 in terms of inference latency for ImageNet Top-1 image classification accuracy. PruNet as a backbone also outperforms GPUNet and EfficientNet on the COCO object detection task on inference latency relative to mean Average Precision (mAP).


Fine-Grained Stochastic Architecture Search

State-of-the-art deep networks are often too large to deploy on mobile d...

Searching for Controllable Image Restoration Networks

Diverse user preferences over images have recently led to a great amount...

DHP: Differentiable Meta Pruning via HyperNetworks

Network pruning has been the driving force for the efficient inference o...

Knapsack Pruning with Inner Distillation

Neural network pruning reduces the computational cost of an over-paramet...

DARC: Differentiable ARchitecture Compression

In many learning situations, resources at inference time are significant...

Balancing Accuracy and Latency in Multipath Neural Networks

The growing capacity of neural networks has strongly contributed to thei...

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Tiny deep learning on microcontroller units (MCUs) is challenging due to...

1 Introduction


Figure 1: PruNet establishes a new state-of-the-art Pareto frontier in terms of inference latency for ImageNet Top-1 image classification accuracy.

Neural Architecture Search (NAS) is a well established technique in Deep Learning (DL), conceptually it is comprised of a search space of permissible neural architectures, a search strategy to sample architectures from this space, and an evaluation method to assess the selected architectures performance. Because of practical reasons, Inference-Aware Neural Architecture Search is the corner stone of modern Deep Learning application deployment process.

wang2022gpunet, wu2019fbnet, yang2018netadapt use NAS to directly optimize inference specific metrics (e.g., latency) on targeted devices instead of limiting the model’s FLOPs or other proxies. Inference-Aware NAS streamlines the development-to-deployment process. New concepts in NAS succeed with the ever-growing search space, increasing the dimensionality and complexity of the problem. The advance in data collection and robust training methods also prolong the end-to-end training and evaluation time. Balancing the search-cost and quality of the search hence is essential for employing NAS in practice.

Traditional NAS methods require evaluating many candidate networks to find optimized ones with respect to the desired metric. This approach can be successfully applied to simple problems like CIFAR-10 

krizhevsky2010cifar, but for more demanding problems, these methods may turn out to be computationally prohibitive. To minimize this computational cost, recent research has focused on partial training falkner2018bohb, li2020system, luo2018neural, performing network morphism cai2018path, jin2019auto, molchanov2021hant instead of training from scratch, or training many candidates at the same time by sharing the weights pham2018efficient. All of these approaches can save computational time but their reliability is questionable bender2018understanding, xiang2021zero, yu2021analysis, liang2019darts+, chen2019progressive, zela2019understanding, i.e., the final result can still be improved. We aim to revise the weight sharing approach both in terms of saving resources and improving the reliability of the method by introducing novel pruning techniques. This method is general and can be applied to any SuperNet approach. In our experiments we focus on a search space based on SOTA network to showcase the value of our methodology.

Our solution is to optimize weight sharing approaches based on SuperNets with architecture weights. SuperNets’ approach allows to search over candidates with computation cost, where is the number of candidates in layer i. Similarly GPU memory consumption is of complexity. For large networks and many candidates, this may lead to a situation where the SuperNet cannot fit into GPU memory, or one would need to train less efficiently because of lowered batch size. To save computation time and GPU memory consumption, as well as to improve the quality of the final architectures, we introduce three kinds of pruning techniques described below.

Prunode: pruning internal structure of the block

In the classical SuperNet approach, search space is defined by the initial SuperNet architecture. That means GPU memory capacity significantly limits search space size. In many practical use cases, one would limit themselves to just a few candidates per block. For example, FBNet wu2019fbnet defined 9 candidates: skip connection and 8 Inverted Residual Blocks (IRB) with . In particular, one can see that the expansion parameter was limited to only three options: 1, 3, and 6, while more options could be considered - not only larger values but also denser sampling using non-integer values. Each additional parameter increases both memory and compute cost, while only promising ones can improve the search. Selecting the right parameters for a search space requires domain knowledge. To solve this problem, we introduce a special multi-block called Prunode, which optimizes the value of parameters, such as the expansion parameter in the IRB block. The computation and memory cost of Prunode is equal to the cost of calculating two candidates. Essentially, the Prunode in each iteration emulates just two candidates, each with a different number of channels in the internal structure. These candidates are modified based on the current architecture weights. The procedure is designed in such a way that it encourages convergence towards an optimal number of channels.

Pruning blocks within a stochastic layer

In the classical SuperNet approach, all candidates are trained together throughout the search procedure, but ultimately, one or a few candidates are sampled as a result of the search. Hence, large amounts of resources are devoted to training blocks that are ultimately unused. Moreover, since they are trained together, results can be biased due to co-adaptation among operations bender2018understanding. We introduce progressive SuperNet pruning based on trained architecture weights to address this problem. Using this methodology blocks are removed from the search space when the likelihood of the block being sampled is below a linearly changing threshold. Reduction of the size of the search space saves unnecessary computation cost and reduces the co-adoption among operations.

Pruning unnecessary stochastic layers

By default, layer-wise SuperNet approaches force all networks that can be sampled from the search space to have the same number of layers, which is very limiting. That is why it is common to use a skip connection as an alternative to residual blocks in order to mimic shallower networks. Unfortunately, the output of skip connections blocks provide biased information when they are averaged with the outputs of other blocks. Because of this, SuperNet may tend to sample networks that are shallower than optimal. To solve this problem, we provide a novel method for skipping whole layers in a SuperNet. It introduces the skip connection to the SuperNet during the procedure. Because of that the skip connection is not present in the search space at the beginning of the search. Once skip connection is added to the search space, the outputs of the remaining blocks are being multiplied by coefficients.

2 Related Works

In NAS literature there is a widely known SuperNet approach liu2018darts, wu2019fbnet, which constructs a stochastic network. At the end of the architecture search the final non-stochastic network is sampled from the SuperNet using differentiable architecture parameters. The PruNet algorithm utilizes this scheme – it is based on the weight-sharing design cai2019once, wan2020fbnetv2 and it relies on the Gumbel-Softmax distribution jang2016categorical.

The PruNet algorithm is agnostic to one-shot NAS liu2018darts, pham2018efficient, wu2019fbnet where only one SuperNet is trained or few-shot NAS zhao2021few where multiple SuperNets were trained to improve the accuracy. In this work we evaluate our method on search space based on state-of-the-art GPUNet model wang2022gpunet. We follow its structure including number of channels, number of layers, and basic block types.

There are other methods that incorporate NAS dai2019chamnet, dong2018dpp, tan2019mnasnet

but they remain computationally expensive. Differentiable NAS with different gradient estimators and loss formulation based on latency  

cai2018proxylessnas, vahdat2020unas, wu2019fbnet significantly reduces the training cost. MobileNets howard2017mobilenets, sandler2018mobilenetv2 started to discuss the importance of model size and latency on embedded systems while latency-aware pruning cai2018proxylessnas, howard2017mobilenets, sandler2018mobilenetv2, wang2022gpunet remains a challenging task as it is usually focused on compressing fixed architectures chen2018constraint, shen2021halp, yang2018netadapt. molchanov2021hant further combines layer-wise knowledge-distillation (KD) with NAS to speedup the validation.

DNAS and weight-sharing approaches have two major opposing problems which we address in this article: Memory costs that bound the search space and co-adaptation among operation problem for large search spaces. We also compare our methodology to other similar SuperNet pruning techniques.

Memory cost

The best known work on the memory cost problem is FBNetV2 wan2020fbnetv2. The authors use different kinds of masking to mimic many candidates with memory and compute complexity. Unfortunately, masking the outputs means that some feature maps are not informative, which when averaged with informative feature maps provides biased information (see last paragraph in Section 1). Therefore, it may lead to suboptimal sampling such as feature maps that are smaller than the optimal ones. Through this approach, FBNetV2 is able to extend the search space even times. Unfortunately, such a large expansion may increase the impact of the co-adaptation among operations problem. Moreover, it can also make block benchmarking too expensive.

Co-adaptation among operations

The weight-sharing approach forces all the operations to be used in multiple contexts, losing the ability to evaluate them individually. This problem, named co-adaptation among operations, posterior fading, or curse of skip connect was noticed by many researchers bender2018understanding, li2020improving, zhao2021few, ding2022nap. Progressive search space shrinking li2020improving, liu2018progressive in different forms was applied to speed up and improve Neural Architecture Search.

SuperNet pruning

Ideas similar to our ”Pruning blocks within a stochastic layer” (see Section 3.2.2) has already been considered in literature. xia2022progressive removes blocks from the search space which are rarely chosen. In ci2021evolving the authors efficiently search over a large search space with multiple SuperNet trainings. In contrast to both methods, our approach requires a single training of a SuperNet, which makes the procedure much faster. ding2022nap uses DARTs liu2018darts to prune operations in a cell. It assesses importance of each operation and all operations but the two strongest are removed at the end of the training. In contrast, our methodology removes blocks from the SuperNet during the training, which also reduces the training time.

3 Methodology

Some of the previous NAS works bender2018understanding, liu2018darts, mei2019atomnas, xie2018snas focus on cell structure search. It aims to find a cell (building block), which is then repeated to build the whole network. However, as stated in lin2020neural, different building blocks might be optimal in different parts of a network. That is why we focus on layer-wise search, i.e., each layer can use a different building block.

The SuperNet is assumed to have a layered structure. Each layer is built from high-level blocks, e.g., IRB in computer vision or self-attention blocks in natural language processing. Only one block in each layer is selected in the final architecture. Residual layers can be replaced by skip connections during the search, effectively making the network shallower.

3.1 Search Space

Since layer-wise SuperNets must have a predetermined number of output channels for each layer, selecting the right numbers is critical. Thus, it is worth to get inspired by a very good baseline model and define the search space such that models similar to the baseline model can be sampled from the SuperNet. With a search space defined this way, we can expect to find models uniformly better than the baseline or models with different properties, e.g., optimized accuracy for a target latency or optimized latency for a target accuracy. In our main experiments, we chose GPUNet-1 wang2022gpunet as a baseline network. In the Table 1 we present the structure of our SuperNet defining the search space. Since we sample all possible expansions with channel granularity set to 32 (i.e., number of channels is forced to be a multiple of 32) and skip connections are considered for all residual layers our search space is covering candidates. For comparison, similar FBNet SuperNet, which would consume the same amount of GPU memory, would cover only candidates. So, in this case, our methodology is able to cover times more options with the same GPU memory consumption.

Stage Type Stride Kernel # Layers Activation Expansion Filters SE
0 Conv 2 {3,5} 1 Swish 24
1 Conv 1 {3,5} 2 RELU 24
2 Fused-IRB 2 {3,5} 3 Swish 64 {0,1}
3 Fused-IRB 2 {3,5} 3 Swish 96 {0,1}
4 IRB 2 {3,5} 2 Swish 160 {0,1}
5 IRB 1 {3,5} 5 RELU 288 {0,1}
6 IRB 2 {3,5} 6 RELU 448 {0,1}
7 Conv + Pool + FC 1 1 1 RELU 1280
Table 1: PruNet search space inspired by GPUNet-1

3.2 Search Method

Our search algorithm is similar to wu2019fbnet. We focus on multi-objective optimization


where are the weights of the network and

are the architecture weights. The goal is to minimize the following latency-aware loss function


where is the standard cross entropy loss, and is the latency of the network. Coefficient defines the trade-off between accuracy and latency. Higher results in finding networks with lower latency; lower results in finding networks with higher accuracy. Coefficient scales the magnitude of the latency. The loss function is inference-aware, meaning we optimize the latency of the networks for a specific hardware.

We train the SuperNet using continuous relaxation. Output of -th layer is equal to


where is the output of the layer , represents the output of the -th block of the

-th layer. Every block in a layer is assumed to have the same input and output tensor shapes. The coefficients

come from the Gumbel-Softmax distributionjang2016categorical


where is the architecture weight of the block. is sampled from . Parameter controls the temperature of the Gumbel-Softmax function. In contrast to wu2019fbnet, we set to have a constant value throughout the training. Architecture weights are differentiable and trained using gradient descent alongside weights.

The total latency is the sum of latencies of all layers in the SuperNet. Similarly, the latency of a layer is a sum of latencies of its blocks weighted by Gumbel-Softmax coefficients


Our method allows to choose the target device for inference. For chosen hardware, we pre-compute and store in the lookup table the latency for every permissible block, the same way as in wu2019fbnet. Then the search does not need to compute any further latencies and can be run on arbitrary hardware (e.g. for embedded target hardware used in autonomous vehicles, one could still train the network on a super computer).

The training is conducted on a smaller proxy dataset to make the SuperNet training faster. We empirically confirmed that the performance of the found architecture on the proxy problem correlates with the performance on the original problem. Once the final architecture is obtained from the SuperNet, it is evaluated and re-trained from scratch on the full dataset.

3.2.1 Pruning internal structure of a block




Figure 2: (a) SuperNet trimming procedure: For the first epochs, train only regular weights. After the warmup phase, start training architecture weights and increase the threshold linearly. Remove blocks whose architecture weights are below the threshold. After removing the penultimate block of the layer put skip connection block in its place. Multiply the output of this skip connection block by , and the output of the other block by as stated in Section 3.2.3. At the end of the training, every layer contains exactly one block, thus the result of the training is a non-stochastic network.
(b) Procedure for channel pruning in case of an IRB: The inner tensor in IRB has the shape of , where is the input number of channels of the block, is the maximal expansion ratio, and are the width and the height of the feature map. Small candidate uses channels, and large candidate uses channels, where and . The optimal candidate that we try to find (marked with the blue dashed line) uses channels and it is assumed that . Both candidates mask out unused channels. The weights are shared between the candidates. The number of channels used by both candidates ( and ) dynamically changes throughout the training.
  // Initialize architecture weight used by Gumbel-Softnax with 0
  // Small mask initially masks out half of the channels
  // Large mask initially contains all of the channels
Procedure update_masks (progress) // Called after each training iteration
          if  then // Update masks until consecutive choices are reached
                    // Momentum speeds up convergence
                    // Calculate distance between masks; progress
                    // Update small mask
                   if  then
                             // Reset architecture weight if not a corner case
                             // Prevent premature convergence
                             // Make small mask non-negative
                   end if
                    // Update large mask
                    // Ensure divisibility by granularity
                    // Ensure small_mask is in bounds
                    // Ensure divisibility by granularity
                    // Ensure large_mask is in bounds
          end if
Constants , max_distance, granularity, and momentum were set to , , , and respectively.
Algorithm 1 Prunode masking
Constants , max_distance, granularity, and momentum were set to , , , and respectively.

We use a particular procedure for finding final values of discrete inner hidden dimensions of a block. It is required that the impact of discrete parameters on the objective function (in our case, a function of latency and accuracy) is predictable, e.g., small parameter values mean a negative impact on accuracy but a positive impact on latency, and large parameter values mean a positive impact on accuracy but negative impact on latency. We also require it to be regular – the impacts described above are monotonic with regard to the parameter value. A good example of such a parameter is the expansion ratio in the IRB.

Prunode consists of two copies of the same block. Unlike in wu2019fbnet, both blocks share the weights pham2018efficient, and the only difference between them is masking. In contrast to wan2020fbnetv2 masks are not applied to the outputs of blocks but only to a hidden dimension. The larger mask is initialized with all s, thus using all the channels. The smaller mask is initialized in a way to mask out half of the channels.

We try to ensure that the optimal mask value is in between the candidates. More precisely, if a candidate with a larger mask has a larger likelihood of being chosen, i.e., has a larger architecture weight compared to a candidate with a smaller mask , then both masks should be expanded. In the opposite case, both masks should be reduced. The distance between masks should decrease as the training progresses, and at the end of the search, the should be close to zero. Due to the rules above, by the end of the search, we expect that the final values obtained are close to each other and should tend to an optimal solution. After the search, we sample a single candidate (from those two modified candidates), cf. wu2019fbnet. Fig. (b)b visualizes the masks of two candidates in the case of using IRB. The exact procedure of prunode masking is presented in Algorithm 1.

Our routine extensively searches through discrete inner hidden dimension parameter values while keeping memory usage and computation costs low. Our proposed method has a computation cost and memory usage of with respect to the number of all possible values, as only two candidates are evaluated every time. Further, as we prune most of the suboptimal candidates, the SuperNet architecture tends to the sampled one. As a result, co-adaptation among operations is minimized.

3.2.2 Pruning blocks within a stochastic layer

Each block in layer has its architecture weight initialized as , where is the number of blocks in the layer . Architecture weights are not modified for the first

epochs. After the warmup phase is finished, the SuperNet can be pruned, i.e., when a block in a layer has a probability of being chosen below a threshold, it is removed from the layer. This threshold changes linearly during the training and depends only on the value of the current epoch of the training. During epoch

the threshold is equal to


The initial threshold should not be higher than , where is the highest number of blocks in a layer, so no blocks are immediately removed after the warmup phase. The final threshold should not be lower than 0.5, thus, during the training, all blocks but one are removed from each layer. An example of block pruning is presented in Fig. (a)a.

3.2.3 Pruning unnecessary stochastic layers

Some layers might have the same input and output tensor sizes. For such layers, it is possible to remove them completely to obtain a shorter network. Removing a layer is equivalent to choosing a skip connection block. However, adding a skip connection block to the search space increases memory consumption. Moreover, using this approach we observed a bias towards finding shallower architectures than optimal. Instead of adding a skip connection block to the set of possible blocks of a layer, we introduce a GPU memory optimization. When the penultimate block is to be removed from a layer, it is replaced by a skip connection block. From that point, the output of the skip connection block is multiplied by , and the output of the other remaining block is multiplied by the parameter , where and are coefficients fixed for the whole training. Each layer uses the same values of these parameters. Once the skip connection block or the other block is removed from the layer, the output is no longer multiplied by any parameter. Selecting different and values allows to reduce the bias towards shallower networks and thus gives more control to find an even better architecture.

4 Experiments

We test our methodology on Imagenet-1k imagenet-1k as there are many good networks that can potentially be tuned. In particular, recent GPUNet networks wang2022gpunet set up a SOTA Pareto frontier in terms of latency and inference accuracy for small networks. Their architecture scheme is perfect to showcase the value of our proposed method. The networks created during these experiments are called PruNet, and Fig. 3 presents their structure. In this section, we delve into details of how we found PruNet family and analyze the search cost. We also show that our networks transfer well to COCO object detection task.


Figure 3: PruNet

networks: Image input resolution is set to 288x288 for all networks. Thick border represents block using Squeeze and Excitation. Darker shade of color represents kernel size = 5. Width is proportional to expansion ratio value. Each architecture is followed by Convolution 1x1 with 1280 output channels, RELU activation function, Adaptive Average Pool, and Linear layer.

4.1 Image Classification on ImageNet

4.1.1 Finding PruNet network family

Inspired by GPUNet-1 architecture, we define our SuperNet, as described in section 3.1 and Table 1, with an image resolution of 288x288, 2D convolution with kernel size of 3, stride of 2, and Swish ramachandran2017searching activation function as the prologue, and then define 6 stages followed by an epilogue. At each stage, we use the same type of building block (convolution, Fused Inverted Residual Block (Fused-IRB), or IRB), activation function, and a number of channels as in GPUNet-1. Within these constraints in stages from 2 to 6, we define stochastic layers that consist of four multi-blocks (kernel size {3,5}, SE {True, False}), all with maximum expansion of 8 and granularity of 32 channels. SE stands for Squeeze and Excitation block, cf. hu2018squeeze. For stage 1, we define only two choices – kernel size {3,5}. Each stage contains one additional layer that can be skipped during the training compared to GPUNet-1. Thanks to such a defined SuperNet, we can sample GPUNet-1; hence the final result is expected to be improved.

To generate the entire Pareto frontier, we experiment with 7 different values of . Parameter is defined in equation 2 and changes the trade-off between latency and accuracy. Since proper layer skipping is crucial for finding a good network, and the last 130 epochs are relatively inexpensive due to the progressive pruning of the search space, we test 3 variants of layer skipping for each . On the basis of preliminary experiments, we chose and = 1.1. For each value, we decide which value was the best based on the final loss from the SuperNet search. The architecture with the best loss was then trained from scratch. Table 2 compares the PruNet results against other baselines. For all the considered networks of comparable accuracy, PruNet has significantly lower latency. It is worth to note that it does not necessarily have lower number of parameters or FLOPs. In particular, comparing PruNets and GPUNets (both among distilled and non-distilled networks), we observe that the obtained networks are uniformly better, meaning we get higher accuracy with lower latency.

PruNet Without Distillation Comparison
Top1 TensorRT Latency #Params #FLOPS PruNet PruNet
Models ImageNet FP16 V100 (ms) (10e6) (10e9) Speedup Accuracy
EfficientNet-B0 tan2019efficientnet 77.1 0.94 5.28 0.38 1.96x 1.3
EfficientNetX-B0-GPU li2021searching 77.3 0.96 7.6 0.91 2x 1.1
REGNETY-1.6GF radosavovic2020designing 78 2.55 11.2 1.6 5.31x 0.4
PruNet -0 78.4 0.48 11.3 2.10
OFA 389 cai2019once 79.1 0.94 8.4 0.39 1.68x 0.5
EfficientNetX-B1-GPU 79.4 1.37 9.6 1.58 2.45x 0.2
OFA 482 79.6 1.06 9.1 0.48 1.89x 0.0
PruNet -1 79.6 0.56 14.8 2.57
FBNetV3-B dai2020fbnetv3 79.8 1.5 8.6. 0.46 2.27x 0.9
EfficientNetX-B2-GPU 80.0 1.46 10 2.3 2.21x 0.7
OFA 595 80.0 1.09 9.1 0.6 1.65x 0.7
EfficientNet-B2 80.3 1.65 9.2 1 2.5x 0.4
ResNet-50 he2016deep 80.3 1.1 28.09 4. 1.67x 0.4
GPUNet-1 wang2022gpunet 80.5 0.69 12.7 3.3 1.04x 0.2
PruNet -2 80.7 0.66 18.6 3.31
REGNETY-32GF 81 4.97 145 32.3 6.45x 0.2
PruNet -3 81.2 0.77 20.8 4.05
EfficientNetX-B3-GPU 81.4 1.9 13.3 4.3 2.09x 0.1
PruNet -4 81.5 0.91 24.0 4.25
EfficientNet-B3 81.6 2.04 12 1.8 1.91x 0.3
PruNet -5 81.9 1.07 27.5 5.29
ResNet-101 he2016deep 82.0 1.7 45 7.6 1.33x 0.3
FBNetV3-F 82.1 1.97 13.9 1.18 1.54x 0.2
GPUNet-2 wang2022gpunet 82.2 1.57 25.8 8.38 1.23x 0.1
PruNet -6 82.3 1.28 31.1 7.36
PruNet With Distillation Comparison
PruNet -0 (distilled) 80.0 0.48 11.3 2.10
GPUNet-0 (distilled) 80.7 0.61 11.9 3.25 1.09x 0.3
PruNet -1 (distilled) 81.0 0.56 14.8 2.57
GPUNet-1 (distilled) 81.9 0.69 12.7 3.3 1.04x 0.3
PruNet -2 (distilled) 82.2 0.66 18.5 3.31
PruNet -3 (distilled) 82.6 0.77 20.8 4.05
PruNet -4 (distilled) 83.1 0.91 24.0 4.25
PruNet -5 (distilled) 83.4 1.07 27.5 5.29
GPUNet-2 (distilled) 83.5 1.57 25.8 8.38 1.23x 0.5
EfficientNetV2-S tan2021efficientnetv2 83.9 2.67 22 8.8 2.09x 0.1
PruNet -6 (distilled) 84.0 1.28 31.1 7.36
Table 2: PruNet image classification results. All latency measurements are made using batch size 1.

4.1.2 Ablation study

Each of the optimizations has a different impact on search performance. Main advantage of the first one (3.2.1) is memory efficiency. Prunode allows significant increase of the size of the search space without increasing the demand for compute and memory resources. The second optimization (3.2.2) can save a lot of computational time without compromising the quality of the results. Last optimization (3.2.3) saves memory that would be needed for skip connection in standard FBNetV1 approach and increases quality of the final results. Table 3 visualizes the cost impact in two scenarios: fixed GPU memory budget and fixed (PruNet ) search space. Appendix B contain analysis on the quality of the results.

Scenario FBNetV1 1st opt (3.2.1) 2nd opt (3.2.2) 3rd opt (3.2.3) All opts
hard search space size
memory memory needed
limit time to train 140h 140h 62h 140h 62h
same search space size
search memory needed
space time to train ——- 190h ——- ——- 62h
Table 3: Cost analysis

4.2 Object Detection on COCO


Figure 4: PruNet as a backbone outperforms GPUNet and EfficientNet on COCO object detection task on inference latency relative to mean Average Precision (mAP)

The experiments were conducted on the COCO 2017 detection datasetlin2014microsoft. We chose EfficientDet tan2020efficientdet as a baseline model. We replaced the original EfficientNet backbone with GPUNet and PruNet for broad comparison. All backbones were pretrained without distillation. Figure 4 shows that our architectures can be successfully transferred to other Computer Vision tasks. PruNet turned out to be faster and more accurate, similarly as it was on the image classification task.

5 Conclusions

We introduced Prunode a stochastic bi-path building block which can be used to search the inner hidden dimension of blocks in any differentiable NAS with cost. Together with two novel layer pruning techniques, we show that our proposed inference-aware method establishes a new SOTA Pareto frontier (PruNet) in TensorRT inference latency and ImageNet-1K top-1 accuracy, and enables fine granularity sampling of the Pareto frontier to better fit external deployment constraints. Further our results presented in Table LABEL:table:nas_results confirm that FLOP and number of parameters are not the right proxy for latency, highlighting the importance of the correct metric for architectural evaluation. Although we only present results with computer vision tasks, our methods can be generalized to searching models for natural language processing (NLP), speech, and recommendation system tasks.


Appendix A Experiments details

a.1 Machine and software setup

The experiments were performed on nodes equipped with 8 NVIDIA Tesla A100 GPUs (DGX-A100). Python 3.6.8 and Pytorch Image Models 

rw2019timm v0.4.12 were used inside pytorch/pytorch:1.9.1-cuda11.1-cudnn8-runtime docker container.

By inference latency we mean Median GPU Compute Time measured via trtexec command with FP16 precision and batch size of 1 using TensorRT version on a single NVIDIA V100 GPU with driver version 450.51.06 inside docker container.

a.2 ImageNet experiments

The experiments were conducted on the Imagenet-1k imagenet-1k (which has a custom non-commercial use license) image classification dataset. It consists of million training samples and thousand validation samples, which span over classes. For all experiments we used a weight decay of 1e-5 and an AutoAugment cubuk2018autoaugment with an augmentation magnitude of

and standard deviation

(corresponding to the probability of applying the operation) for both architecture search and training from scratch. The experiments were performed in automatic mixed precision (AMP).

a.2.1 Architecture search details

An architecture search was performed on of randomly selected classes from the original dataset. The input images were scaled to a resolution of . The search lasted epochs, with a total batch size of and a cosine learning rate scheduler that had an initial value of . We used the Adam kingma2014adam

optimizer for the architecture parameters and the RMSprop optimizer with an initial learning rate of

for the weights. We divided the search into two phases. In the first phase we train only the regular weights for the first epochs, and we only do it once to save computational time. In the second phase, which lasts the remaining 130 epochs, in each epoch 80% of the training dataset was used to train the regular weights and the remaining was used to train the architecture weights. The second phase has been computed in many variants but always starts from a common checkpoint after epochs. During this phase, we progressively prune the search space using a pruning threshold (see Section 3.2.2) that increases linearly from to . At the end of each epoch, we removed blocks below the threshold from the search space. The momentum used in the pruning internal structure of a Prunode was equal to . The coefficient in the loss function varied across different runs with analyzed values of , but all runs used the same coefficient value of . The latency term LAT of the loss function was measured in s.

a.2.2 Training details

The hyperparameters’ values were based on GPUNet-1 

wang2022gpunet. For fair comparison we decided to train GPUNet-0, GPUNet-1 and GPUNet-2 using exactly the same hyperparameters (including batch size) as we used for PruNets .

After the architecture search, the sampled network was again trained from scratch. The training lasted for epochs with a total batch size of and an initial learning rate of . The learning rate decays by times for every epochs. crop_pct was set to . Exponential Moving Average (EMA) was used with a decay factor of . We used the drop path larsson2016fractalnet with a base drop path rate of . All of the PruNet and GPUNet networks have been trained with and without distillation hinton2015distilling. Knowledge distillation is a technique that transfers the knowledge from a large pre-trained model to a smaller one which can be deployed under real-world limited constrains. For the training with distillation, we used different teachers and different crop_pct as it is presented in Table 4.

model resolution teacher architecture teacher resolution crop_pct
GPUNet1 & PruNet EfficientNet-B3 0.904
GPUNet0 EfficientNet-B4 0.922
GPUNet2 EfficientNet-B5 0.934
Table 4: For each image resolution a different teacher was selected. Additionally crop_pct was changed to match the crop_pct of the teacher.

a.3 COCO experiments

The object detection experiments were conducted on the MS COCO 2017 dataset lin2014microsoft (licensed under a Creative Commons Attribution 4.0 License). We used EfficientDet tan2020efficientdet as a baseline model and replaced the original EfficientNet tan2019efficientnet backbone with PruNet and GPUNet. The training lasted for epochs with batch size of . The learning rate was warmed-up for the first epochs with the value set to 1e-4. Then, the cosine learning rate scheduler was used with an initial learning rate of . The optimizer was SGD with a momentum of

and a weight decay of 4e-5. Gradient clipping of value 10.0 was introduced. The training was performed in automatic mixed precision (AMP). We used the Exponential Moving Average (EMA) with a decay factor of


Appendix B Detailed analysis of pruning unnecessary stochastic layers

Proper selection of parameters and (which multiply the output of skip connection and the output of a block; for more details see Section 3.2.3) can influence the final length of the sampled network. In particular Table 5 shows that for large enough, the number of layers is inversely related to for considered cases. To further analyze the impact of our methodology, we run additional searches with – these values correspond to a base approach without our modification at all. From Table 5 it is clear that for , the base approach samples networks with fewer layers than the searches we performed. To prove that our approach is justified, we train all 28 of the architectures found during the searches described above and we draw Pareto Frontiers defined by all four sets of and . Figure 5 shows that our search technique can find networks with up to better latency with the same accuracy compared to the base approach. Our methodology uses the final loss of the search as a zero-cost filter to select architectures that are good candidates to be evaluated. Figure 6 presents the Pareto Frontier of Search Loss with relation to Final Latency. Looking at both figures, it is directly visible that if search loss is significantly better for similar searches (in this case for similar values), we can also expect better results in terms of the final accuracy; however, the correlation is not strong. Therefore, without additional compute overhead, our methodology can find networks that are almost optimal, possibly still leaving room for improvement.

Loss #IRBs Loss #IRBs Loss #IRBs Loss #IRBs
6.7149 6 6.6316 7 6.5968 9 6.6241 11
4.2938 9 4.1876 10 4.2139 12 4.2039 13
3.6352 11 3.6015 13 3.6074 13 3.6138 14
3.0027 13 2.9932 14 2.9762 14 2.9692 15
2.3528 15 2.3394 15 2.316 15 2.3723 15
1.7154 16 1.7201 17 1.7296 16 1.6946 16
1.0573 18 1.0573 18 1.0687 18 1.0531 18
Table 5: Search results for different and .
                                                                                                                                                 For each combination of we run a single search and for each search the total loss is reported. The total loss is a sum of cross entropy loss and latency loss (). #IRBs indicates the number of Fused-IRB and IRB in the final architecture. The number of the other layers is the same for all architectures. PruNet architectures are shown in bold


Figure 5: Graph focuses on the architectures being searched using


Figure 6: Graph focuses on the architectures being searched using , which shows the relationship between the final CE of the search and the final latency of the architecture