1 Introduction
Neural architecture search (NAS) has attracted lots of attention recently [48, 16, 29, 13]
. However, its prohibitive time and computational resource cost is a remarkable problem that prevents its deployment in many realistic scenarios. For example, the reinforcement learning (RL) based NAS method
[49]requires 2000 GPU days and the evolutionary algorithm based method
[32] requires 3150 GPU days. Recent differentiable search methods, , DARTS [29], reduce the cost to some extent. However, DARTS still requires 96 GPU hours to search on a small proxy dataset CIFAR10 and it is impractical to search on large scale datasets like ImageNet [8] directly.The inefficiency of DARTS results from its strategy of aggregating multiple features generated by different candidate operations. Following [29, 45], we use directed acyclic graph (DAG) to represent the network work architecture and let the node/edge terminology denote the latent representation/candidate operation respectively in the network. As illustrated in Figure 2 (left), multiple convolution candidates operate on the same input feature map and generate feature maps respectively. The final output is the aggregation of these feature maps via a weighted sum. Conducting multiple convolution operations and storing the generated features bring the huge computation burden and memory cost.
In this work, we propose a novel and simple search method that reduces the cost of searching procedure significantly. The key idea of our method is to calculate the weighted sum of convolution kernels rather than output features, as illustrated in Figure 2 (right). We propose this strategy by exploiting the “additivity” of convolution, which has been discussed in ACNet [9] recently. The “additivity” states that if several 2D kernels with compatible sizes operate on the same input to produce outputs of the same resolution and their outputs are summed up, we can add up these kernels on the corresponding positions to obtain an equivalent kernel that will produce the same output [9]. However, kernels with different shapes cannot be added up directly. To solve this problem, we propose a novel strategy, named as probabilistic kernel mask, which masks off the invalid area of a bigger kernel to present a smaller kernel, as illustrated in following section.
Based on the above “additivity” property, we develop a novel design of encoding the supernet that enables us to conduct the convolution operation on the input feature only once and get a single feature map between two intermediate nodes. Thus, the computation and memory cost can be reduced significantly compared to the previous search methods [29, 25, 4]. Our work suggests a new transition of searching for appropriate architectures, from evaluating feature maps to evaluating convolution kernels that generate feature maps.
Because we take the convolution kernels as our direct search target, we can search in a more finegrained way. Specifically, we can search for a convolution consisting of different kernel sizes. Using different kernel sizes within a single convolution, we can increase the range of receptive fields, which means we can incorporate the multiscale information within a single layer without changing the model’s macro architecture. The idea of multiscale representation has drawn great interest in the computer vision community and been applied to many vision tasks, such as classification
[37, 20, 6], object detection [26, 35] and semantic segmentation [2, 27]. Most of these works obtain multiscale information by fusing multiple feature maps with different resolutions. In this work, we use the multiscale information from the perspective on convolution kernel sizes.This work makes contributions as following:

We propose a novel searching strategy that directly encodes the supernet from the perspective on convolution kernels and “shrink” multiple convolution kernel candidates into a single one before these candidates operate on the input feature. The memory for storing intermediate features and the resource budget for conducting convolution operations are both reduced remarkably.

Our search strategy is able to search in a more finegrained way than previous methods and mix up multiple kernel sizes within a single convolution without the constraint in MixNet [42]. The search space and the capacity for representing possible networks are significantly enlarged.

Extensive experiments on both classification and object detection tasks are conducted. The results show the proposed searching method can discover new stateoftheart lightweight CNNs while successfully reducing the searching cost by about three orders of magnitude than existing SOTA.
2 Related Work
Efficient Search methods
Neural architecture search (NAS) has relieved substantial handcraft efforts for designing a neural network architecture and has been explored in many computer vision tasks, such as classification
[48, 49, 29], detection [15, 5], semantic segmentation [27, 15] and GAN [16]. However, the prohibitive cost for NAS is still a remarkable problem. For example, the reinforcement learning (RL) based NAS method [49] requires 2000 GPU days and the evolutionary algorithm based method [32] requires 3150 GPU days.The gradientbased methods relax a discrete architecture choice to a continuous search space, allowing search of the architecture using gradient descent [29, 44, 12, 10]. Although gradientbased methods are more efficient than RL and evolutionary based ones, the adopted relaxation still brings heavy computation and memory burden for calculating and storing the multiple features generated by all possible candidates. ProxylessNAS [1]
propose to binarize the architecture parameters and force only one path to be active during running, which reduces the required memory through requires GPU memory management. However, at least 200 GPU hours are still needed in
[1].SinglePath NAS [36] is a differentiable search algorithm with only a single path between two intermediate nodes. It views the small kernel as the core of the large kernel. SinglePath NAS chooses a candidate based on the L2 norm of the convolution weight. Specifically, it formulates a condition function in which the L2 norm of the convolution weight is compared to a threshold that controls the choice of convolution kernels. Our proposed method is very different from SinglePath NAS. First, we directly use explicit architecture parameters to represent the importance of all candidates, while SinglePath NAS uses the comparison results of convolution weights and threshold value. Second, the complexity of the condition function used in SinglePath NAS increases linearly with the number of kernel candidates, while our method can be applicable to any number of candidates readily. Third, SinglePath NAS searches for a single kernel size within a convolution, rather than multiple kernel sizes as done in ours.
MultiScale Representation
Multiscale representation has been widely explored in computer vision [26, 20, 6, 14]. Some works introduce multiscale information from the perspective on macro architecture and design model architectures with multibranch topology [20, 43, 37]. Others propose to redesign the convolution operation [6, 14] and combine multiscale information in a single convolutional layer, without modifying the macro architecture.
In recent MixConv [42], Tan also proposed to use similar convolutions. However, all the convolution candidates in their method are always split uniformly. Thus, the search space used in [42] is limited due to such fixed allocation of different kernels. It is reasonable that the convolutional layers’ preferences for kernel size differ across the network. So keeping a fixed ratio for the network at different depths would not achieve optimal performance. Nonetheless, it is not practicable to manually finetune an adhoc ratio for a specific layer because of the nontrivial burden introduced by endless trial and error. In this work, we remove the “uniform partition” constraint and search for every kernel independently, which means the search space is significantly enlarged and specifically, the search space used in MixNet [42] is a subset of ours. Benefiting from no constraint, the convolution operations are mixed up more robustly and flexibly than MixNet [42].
3 Method
We start with preliminaries on the additivity property of convolutions [9]
, which is the theoretical basis for our efficient search strategy. We then introduce the Probability Masks, which is designed for representing the supernet from the perspective on convolution kernels. Finally, in order to discover appropriate models with different computation resource budgets, we employ a resourceaware search objective function following
[10].3.1 Additivity of Convolution
Consider there are 2D convolutional kernels that operate on the same input I
separately. If these 2D kernels have the same stride and
compatible sizes, the sum of their outputs can be obtained in an equivalent way: adding up these kernels on the corresponding positions to formulate a single kernel, and then conducting convolution operation on the input with this generated single kernel to get the final output. Here compatible means that the smaller kernel can be generated by slicing the larger kernel. For example, kernels are compatible with [9]. Such ‘additivity’ of convolution can be formally represented as(1) 
where denotes the elementwise addition of the kernel parameters and denotes elementwise addition of the resulted features.
To the best of our knowledge, this is the first work that introduce the additivity of convolution to the NAS fields. We experimentally show that using this property can help reduce the searching time remarkably.
3.2 Meta Convolution Kernels
We propose an efficient search algorithm that can significantly improve efficiency of the search process based on the additivity of convolution discussed above. The key part of our search strategy is that we use the weighted sum of kernels, rather than weighted sum of feature maps used in previous works [29, 4, 10], to represent the aggregation of multiple outputs generated by all edges (candidates operations).
Let denote a set of candidate kernels, where represents the width and height of the th kernel, respectively. We use the architecture parameters to encode the overparameterized kernel at the search stage. The represents the probability of selecting as the candidate, correspondingly.
3.2.1 Continuous Relaxation and Reformulation
The previous gradientbased NAS methods [29, 4, 10] relax the categorical choice to a softmax one over multiple candidate operations. It can be formulated as:
(2) 
where O denotes the output, a weighted sum of features from multiple operations. As stated above, multiple output features need to be calculated and stored between two nodes and the weighted sum over these multiple features is taken as the final output of a node.
Based on additivity of convolution in Eq. (1), we reformulate Eq. (2) as:
(3) 
where the outer denotes the elementwise addition operation of the kernel parameters on the corresponding position. Through such reformulation, we can combine multiple candidate kernels into a single one before they operate on the features. Thus, we just need conduct the convolution operation once and generate a single output feature between two intermediate nodes, avoiding the intrinsic inefficient problems introduced by multipaths.
3.2.2 Candidate Kernel Formulation
Now, we introduce details of our search strategy. It consists of three steps. The first step is to determine the meta kernels; the second step is to generate probabilistic masks over the meta kernels; and the third step is to sample all the candidate kernels from meta kernels by the probabilistic masks.
Step 1: Build meta kernels
We first build a special kernel with the shape of
(4) 
where . This implies that all kernels in the set are compatible to . We name the kernel as meta kernels because all of the candidates in the set originate from it. For example, for a candidate set of kernels, the corresponding meta kernel has the shape of .
Step 2: Learn the probability mask
Given the kernel candidate set , there is another corresponding mask set, , which serves as the intermediary of overparameterizing the candidate kernels with architecture parameters . Each mask has the same shape as the KermelMatrix, . The elements of are defined as:
(5) 
where is the sampling probability of at search stage. And we define the RoI of as the mapping of the in mask , as illustrated in Figure 3. The mapping area in the is determined following two principles: (1) The center of the RoI is located at the center of and (2) the shape of the RoI is same as its corresponding kernel candidates . Note that extra memory introduced by is negligible compared to that introduced by feature maps from multipaths, as used in previous works.
Step 3: Generate all the candidate kernels
Now, every candidate kernel, , can be generated by multiplying its corresponding mask and the KernelMaster . Based on the above formulation, we add an extra into the mask set , which serves as controlling the total number of the filters in a layer. We name as None as all elements in are equal to zero. With the help of , some redundant filters can be pruned at the search stage.
Note that in the above discussion, for the sake of simplicity, we take the search process of a single filter within a convolution layer as an example. However, it is easy to extend to all filters because every kernel is treated independently at the search stage. Furthermore, benefiting from our finegrained search strategy, the vanilla depthwise convolution and the mixed convolution proposed in [42] are three special cases of our search space.
3.3 Search with Costaware Objective
In order to let our proposed method generate models adaptively under different circumstances, we incorporate the costaware constraint into our objective to formulate a multiobjective search algorithm. Formally, we use the FLOPs as the proxy of the computation consumption and the corresponding searching loss is defined as
(6) 
where is the computation cost budget, which can be adapted according to different needs. The function counts the FLOPs of a specific architecture sampled from the search space at the search stage. is a slack variable. As the FLOPs of a sampled network is a discrete value so it is reasonable to confine the FLOPs in a small range rather than a single point. We regard FLOPs as our costaware supervision in this work and other metrics such as latency as used in [44, 40, 24] can replace FLOPs as the objective readily.
3.4 Differentiable Search Algorithm
With introducing the costaware loss , we search for the network architectures to minimize the following multiobjective loss:
(7) 
where represents an architecture in the search space and denotes the convolution weights of the corresponding model. We adopt the differentiable search method to solve the problem of finding the optimal kernels.
The probability of sampling the th kernel candidate in the Eq. (5) is computed as
(8) 
Instead of directly relaxing the categorical choice of a particular kernel to a softmax over all possible candidates as Eq. (3), we formulate the search stage as the sampling process, as done in [10, 44].
Although the objective function is differentiable with respect to the weight of kernel K, it is not differentiable to the architecture parameters due to the sampling process. In order to sidestep this problem, we adopt the Gumbel Softmax function [31, 22], as used in recently NAS related works [44, 1, 10, 46]. The sampling probability in Eq. (8) can be rewritten as
(9) 
where is sampled form the distribution of Gumbel (0,1), and is the class probabilities of categorical distribution calculated by Eq. (8).
After the searching process, we can derive the architecture from the architecture parameters . Our pipeline is summarized in Algorithm 1. We will show in the next experiment section that our proposed search algorithm costs orders of magnitude less search time than previous RL based NAS and gradientbased multipaths NAS while achieving better performance.
4 Experiments
In this section, we aim to validate effectiveness of our proposed search method. We first conduct ablation studies to investigate the effectiveness of mixing multiple kernel sizes without any constraints, which is more flexible than MixConv [42]. Then, we compare our searched models with stateoftheart , both manually designed and discovered by NAS methods. Besides, we further conduct object detection experiments to show the advantage of our models as a backbone feature extractor.
4.1 Implementation Details
We conduct experiments on the widely used ImageNet [8] benchmark. We use the normal data augmentation including random horizontal flipping with 0.5 probability, scaling hue/saturation/brightness, resizing and cropping, following [17]. We do not use the mixup [47] or AutoAugment [7]
for a fair comparison. The models are trained for 250 epochs from scratch as done in
[29, 10, 4]. We train the models on 8 Nvidia 2080Ti GPUs with a total batch size 1024. The learning rate is initialized as 0.65 and decayed to 0 at the end of the training stage, following the cosine rule. We use a weight decay of . At the evaluation phase, we adopt the popular settings, i.e. resizing the image into and then center cropping a single patch. We set as 2.0 and as 0.1.4.2 Ablation Study
Because we search for an appropriate ratio of different kernel sizes within a single depthwise convolution, we conduct a series of ablation studies to demonstrate that our proposed method can achieve better FLOPsAccuracy tradeoff than both vanilla depthwise convolution and MixConv [42] where multiple kernels are mixed up in a manually designed partition way.
4.2.1 Settings
Following MixConv [42], we design three kinds of baseline settings to implement the depthwise convolution:

Single kernel size within a depthwise convolution.

Multiple kernel sizes in an uniform partition way.

Multiple kernel sizes in an exponential partition way.
Note that the above three baseline models are three special cases of our search space because our search algorithm aims to find the proper ratio of different kernels within a single convolution operation.
To perform appletoapple comparison, we reproduce all baseline methods under the same training/testing setting for internal ablation studies. Following MixConv [42], we conduct all experiments on the widely used MobileNetv1 [18] networks.
For baseline A, we start with the original MobileNetv1 and then replace depthwise convolution with and ones, respectively. For baselines B and C, we adjust the number of kernel types from 1 to 6. The kernel sizes increase from , with a step of 2. For example, when the number of kernel types is 6, the corresponding kernel candidate set is . The candidate sets on which we conduct our search method are the same as baselines B and C for fair comparison.
4.2.2 Results
The experimental results are illustrated in Figure 4. For baseline A, similar to [42], we find that the model top1 accuracy goes up when enlarging kernel size from to but starts dropping when kernel size is larger than . This can be explained that when kernel size is equal to the input feature in the extreme case, the convolutional layer simply becomes a fullyconnected network, which is known to be harmful for performance [20, 42].
For baselines B and C, we observe that the depthwise convolution with multiple kernels achieves better FLOPsaccuracy tradeoff than vanilla depthwise convolution and the performances of baselineB and C are similar under the same FLOPs. Besides, baseline B can be seen as a special case of uniform sampling.
Furthermore, with the same kernel candidate set, our discovered models outperform both uniform and exponential allocation of kernels of different sizes, under the same FLOPs constraint. We regard the performance gain is from the finer granularity of our search approach, which can choose a suitable ratio of kernel sizes at different depth of the architecture.
4.3 Comparing with SOTAs
To further demonstrate effectiveness of our search method, we compare it with stateoftheart NAS methods.
4.3.1 Settings
Following [44, 36, 1], we adopt the inverted residual bottleneck [34] (MBConv) as our macro structure. The MBConv block is a sequence of a pointwise convolution, depthwise convolution, convolution. Different from previous works that search for a single kernel size in a depthwise convolution layer, our method searches multiple kernel sizes.
The recent MixNet [42] also proposes to search among the mixed kernels. However, the ratio of different kernel sizes in their search space is fixed as uniform. In our search space, there is no constraints for the ratio of different kernel sizes so the search space is further enlarged. The search space in MixNet [42] is a subset to ours. The experimental results also show that our searched models achieve better performancecost tradeoff than MixNet.
4.3.2 FLOPs vs. Accuracy
Evaluation results of our proposed metaKernel and comparison with stateoftheart approaches are summarized in Table 1 and Figure 5. The metaKernelA and metaKernelB are obtained by setting different resource target , T in Eq. (6). We set our target value as 260M, 370M respectively, which are set around the FLOPs of stateoftheart model MixNet [42] intentionally for fair comparison under the similar FLOPs.
As shown in the Table. 1, our metaKernelA achieves 75.9% Top1/92.9% Top5 accuracy with 254M FLOPs and metaKernelB achieves 77.0% Top1/93.4% Top5 accuracy with 357 FLOPs. They outperform stateoftheart manually designed models by a large margin. Specifically, our metaKernelA is better than MobileNetV2 (+3.8%) and ShuffleNetV2 (+3.2%), with less FLOPs.
Compared to recently proposed automated models generated by NAS methods, our metaKernel models perform better under similar FLOPs. Specifically, compared to RL based methods, our metaKernelA achieves 0.7% higher Top1 accuracy than MnasNetA1 [40] with 58M less FLOPs; 0.3% higher Top1 accuracy than MnasNetA2 with 86M less FLOPs; 1.2% higher Top1accuracy than ProxylessNASR [1] with 66M less FLOPs. Compared With gradientbased methods, metaKernelA is better than ProxylessNASG (+1.6%), SinglePath NAS [36] (+0.8%), FBNetA/B/C (+2.8%/+1.7%/+0.9%), respectively.
4.3.3 Searching Hours vs. Accuracy
The comparison Results of GPU hours used for searching process is illustrated in Figure 1. Our search method is faster than most of the methods by a large margin. Compared to the recent stateoftheart model MixNet [42] that also uses multiscale representation, our model metaKernelA achieves a slightly higher accuracy (+0.1%) than MixNetS and the metaKernelB achieves same accuracy as MixNetM, while costing 3M less FLOPs. Remarkably, for achieving very similar results, our search method needs about GPU hours less than MixNet.
As mentioned in [44, 1], MnasNet [40]
does not report the exact GPU hours for searching stage. In this work we adopt the the search cost data of MnasNet estimated in ProxylessNAS
[1]. For MobileNetv3 [19] and MixNet [42], as they use the same search framework as MnasNet, we roughly estimate the search cost of MobileNetv3 and MixNet to be similar to that of MnasNet.Model 




#Params  #FLOPs 




MobileNetV2 [34]  manual        3.4M  300M  72.0  91.0  
MobileNetV2()  manual        6.9M  585M  74.7  92.5  
ShuffleNetV2() [30]  manual        3.5M  299M  72.6    
CondenseNet(G=C=4) [21]  manual        2.9M  274M  71.0  90.0  
CondenseNet(G=C=8)  manual        4.8M  529M  73.8  91.7  
EfficientNetB0 [41]  manual        5.3M  390M  76.3  93.2  
NASNetA [49]  RL  cell  CIFAR10  48K  5.3M  564M  74.0  91.6  
PNASNet [28]  SMBO  cell  CIFAR10  6K  5.1M  588M  74.2  91.9  
AmoebaNetA [32]  evolution  cell  CIFAR10  75k  5.1M  555M  74.5  92.0  
DARTS [29]  gradient  cell  CIFAR10  96  4.7M  574M  73.3  91.3  
PDARTS [4]  gradient  cell  CIFAR10  7.2  4.9M  557M  75.6  92.6  
GDAS [10]  gradient  cell  CIFAR10  4.08  4.4M  497M  72.5  90.9  
MnasNetA1 [40]  RL  stagewise  ImageNet  40K  3.9M  312M  75.2  92.5  
MnasNetA2  RL  stagewise  ImageNet  40K  4.8M  340M  75.6  92.7  
SinglePath NAS [36]  gradient  layerwise  ImageNet  30  4.3M  365M  75.0  92.2  
ProxylessNASR [1]  RL  layerwise  ImageNet  200  4.1M  320M  74.6  92.2  
ProxylessNASG  gradient  layerwise  ImageNet  200      74.2  91.7  
FBNetA [44]  gradient  layerwise  ImageNet  216  4.3M  249M  73.0    
FBNetB [44]  gradient  layerwise  ImageNet  216  4.5M  295M  74.1    
FBNetC [44]  gradient  layerwise  ImageNet  216  5.5M  375M  74.9    
MobileNetV3Large [19]  RL  stagewise  ImageNet  40K  5.4M  219M  75.2    
MobileNetV3Large()  RL  stagewise  ImageNet    7.5M  356M  76.2    
MobileNetV3Small  RL  stagewise  ImageNet  40K  2.9M  66M  67.4    
MixNetS [42]  RL  kernelwise  ImageNet  40K  4.1M  256M  75.8  92.8  
MixNetM  RL  kernelwise  ImageNet  40K  5.0M  360M  77.0  93.3  
metaKernelA(ours)  gradient  kernelwise  ImageNet  40  5.8M  254M  75.9  92.9  
metaKernelB(ours)  gradient  kernelwise  ImageNet  40  7.2M  357M  77.0  93.4 
backbone  FLOPs  mAP  aero  bike  bird  boat  bottle  bus  car  cat  chair  cow  table  dog  horse  mbike  persn  plant  sheep  sofa  train  tv 
MBv1  10.16G  75.9  83.9  79.3  75.1  65.8  55.9  84.3  85.7  85.4  58.4  80.9  70.4  82.0  84.9  84.5  79.7  48.3  77.8  76.6  84.1  74.3 
MBv2  9.10G  75.8  84.5  83.4  76.1  68.3  58.7  78.9  84.8  86.5  54.4  80.7  70.9  84.0  85.0  83.6  76.8  48.7  78.7  72.8  85.0  73.7 
MBv3Small  8.19G  69.3  77.4  76.9  67.0  62.0  43.7  76.3  79.1  82.1  47.2  75.4  65.2  78.4  81.0  79.7  72.5  39.4  68.7  67.1  76.1  70.0 
MBv3Large  8.78G  76.7  84.1  84.1  77.0  69.9  57.9  84.8  85.1  88.1  56.3  84.8  64.8  84.3  87.9  84.7  77.0  46.3  80.5  73.9  86.2  76.8 
MBv3Large  9.28G  77.7  86.1  84.1  76.8  71.6  60.9  85.6  86.8  88.8  55.6  84.2  70.2  85.7  86.2  85.0  77.6  47.5  81.6  73.5  88.1  77.6 
metaKerneltiny  8.73G  76.2  79.6  83.5  76.4  68.1  54.3  83.2  85.6  87.1  56.0  82.8  71.9  84.9  85.0  85.7  76.4  47.1  81.7  75.0  85.5  74.7 
metaKernelA  8.91G  77.3  86.1  84.8  76.8  68.6  59.2  83.6  86.3  87.1  56.9  85.2  67.2  86.6  87.2  86.0  77.7  49.1  80.8  74.5  86.3  76.7 
metaKernelB  9.27G  78.0  86.8  84.5  77.5  69.8  58.2  85.3  86.4  88.3  60.4  84.5  72.9  85.8  86.7  86.8  78.0  51.3  80.0  73.9  86.4  77.4 
4.4 Object Detection
To further validate the effectiveness of our the metaKernel models, we conduct object detection experiments on the PascalVOC [11] dataset. Following the broadly used strategy, we combine the VOC2007 trainval and VOC2012 trainval as the training data and test the performance of our model on VOC2007 test. We adopt our metaKernel as a dropin backbone feature extractor in YOLOV3 [33]. All backbone models are pretrained on ImageNet and finetuned on PascalVOC for 200 epochs.
We first train the model with for 160 epochs, and then continue the training with for 20 epochs and for another 20 epochs. The results on VOC2007 test are shown in Table 2. Our metaKernelA model outperforms the MobileNetv1 [18], MobileNetv2 [34] feature extractors by 0.3 mAP, 0.4 mAP, respectively, while consuming less FLOPs. And our metaKernelB performs better than MobileNetV3() that has a nearly same FLOPs as our model.
4.5 Visualization on Kernel Size Distribution
As our search algorithm can determine the number of different kernel sizes automatically, we wonder what is the intrinsic preference on kernel sizes for CNNs. We plot the distribution of each kernel size in Figure 6. We observe that at the shallow layers, the network tends to choose smaller kernel size. With the layer going deeper, large kernels begin to occupy a larger proportion. Interestingly, our findings are consistent with the MixConv [42]. This interesting findings may inspire future works for understanding CNN.
5 Discussion
In this work, we treat every kernel within a single depthwise convolution independently and search for a mixed convolution. In this way, our proposed search strategy can search in a more finegrained way. Furthermore, our method is also compatible with atrous convolution [3], asymmetric convolution [23, 38, 39] following the same rule as discussed in Sec 3.2.2, not requiring any other adaptation. So, our method can be equipped to existing works [29, 46, 10] to further improve searching efficiency.
6 Conclusion
In this work, we propose an efficient search strategy to reduce the search cost dramatically. We encode the supernet from the perspective on convolution kernels rather than on feature maps, which could drop the requirement for memory and computation resources remarkably. Specifically, our search process is about faster than MnasNet by three orders of magnitude. Our proposed method digs deep into the more finegrained search space, i.e, convolutional kernels. We demonstrate experimentally that our discovered models achieve better performance on ImageNet under the same computation resource constraints. We hope that our research will be beneficial in accelerating the search procedure and further promote the development of NAS.
References
 [1] (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §2, §3.4, §4.3.1, §4.3.2, §4.3.3, Table 1.
 [2] (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1.
 [3] (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §5.
 [4] (2019) Progressive differentiable architecture search: bridging the depth gap between search and evaluation. arXiv preprint arXiv:1904.12760. Cited by: §1, §3.2.1, §3.2, §4.1, Table 1.
 [5] (2019) Detnas: neural architecture search on object detection. arXiv preprint arXiv:1903.10979. Cited by: §2.

[6]
(2019)
Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution
. arXiv preprint arXiv:1904.05049. Cited by: §1, §2.  [7] (2018) Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: §4.1.

[8]
(2009)
Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: §1, §4.1.  [9] (2019) ACNet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1911–1920. Cited by: §1, §3.1, §3.
 [10] (2019) Searching for a robust neural architecture in four gpu hours. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1761–1770. Cited by: §2, §3.2.1, §3.2, §3.4, §3.4, §3, §4.1, Table 1, §5.
 [11] (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §4.4.
 [12] (2019) Densely connected search space for more flexible neural architecture search. arXiv preprint arXiv:1906.09607. Cited by: §2.
 [13] (2019) AdversarialNAS: adversarial neural architecture search for gans. arXiv preprint arXiv:1912.02037. Cited by: §1.
 [14] (2019) Res2Net: a new multiscale backbone architecture. arXiv preprint arXiv:1904.01169. Cited by: §2.
 [15] (2019) Nasfpn: learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7036–7045. Cited by: §2.
 [16] (2019) AutoGAN: neural architecture search for generative adversarial networks. arXiv preprint arXiv:1908.03835. Cited by: §1, §2.
 [17] (2019) Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 558–567. Cited by: §4.1.
 [18] (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §4.2.1, §4.4.
 [19] (2019) Searching for mobilenetv3. arXiv preprint arXiv:1905.02244. Cited by: §4.3.3, Table 1.
 [20] (2017) Multiscale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844. Cited by: §1, §2, §4.2.2.
 [21] (2018) Condensenet: an efficient densenet using learned group convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2752–2761. Cited by: Table 1.
 [22] (2016) Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144. Cited by: §3.4.
 [23] (2014) Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474. Cited by: §5.
 [24] (2019) Partial order pruning: for best speed/accuracy tradeoff in neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9145–9153. Cited by: §3.3.
 [25] (2019) Darts+: improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035. Cited by: §1.
 [26] (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1, §2.
 [27] (2019) Autodeeplab: hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 82–92. Cited by: §1, §2.
 [28] (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: Table 1.
 [29] (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: Figure 2, §1, §1, §1, §2, §2, §3.2.1, §3.2, §4.1, Table 1, §5.
 [30] (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: Table 1.

[31]
(2016)
The concrete distribution: a continuous relaxation of discrete random variables
. arXiv preprint arXiv:1611.00712. Cited by: §3.4. 
[32]
(2019)
Regularized evolution for image classifier architecture search
. InProceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 4780–4789. Cited by: §1, §2, Table 1.  [33] (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §4.4.
 [34] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §4.3.1, §4.4, Table 1.
 [35] (2018) SNIPER: efficient multiscale training. In Advances in Neural Information Processing Systems, pp. 9310–9320. Cited by: §1.
 [36] (2019) Singlepath nas: designing hardwareefficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877. Cited by: §2, §4.3.1, §4.3.2, Table 1.

[37]
(2019)
Deep highresolution representation learning for human pose estimation
. arXiv preprint arXiv:1902.09212. Cited by: §1, §2.  [38] (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §5.
 [39] (2015) Convolutional neural networks with lowrank regularization. arXiv preprint arXiv:1511.06067. Cited by: §5.
 [40] (2019) Mnasnet: platformaware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §3.3, §4.3.2, §4.3.3, Table 1.
 [41] (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: Table 1.
 [42] (2019) MixConv: mixed depthwise convolutional kernels. CoRR, abs/1907.09595. Cited by: 2nd item, §2, §3.2.2, Figure 4, §4.2.1, §4.2.1, §4.2.2, §4.2, §4.3.1, §4.3.2, §4.3.3, §4.3.3, §4.5, Table 1, §4.
 [43] (2019) Deep highresolution representation learning for visual recognition. arXiv preprint arXiv:1908.07919. Cited by: §2.
 [44] (2019) Fbnet: hardwareaware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: §2, §3.3, §3.4, §3.4, §4.3.1, §4.3.3, Table 1.
 [45] (2019) Exploring randomly wired neural networks for image recognition. arXiv preprint arXiv:1904.01569. Cited by: §1.
 [46] (2018) SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: §3.4, §5.
 [47] (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §4.1.
 [48] (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.
 [49] (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §1, §2, Table 1.
Comments
There are no comments yet.