Efficient Differentiable Neural Architecture Search with Meta Kernels

12/10/2019 ∙ by Shoufa Chen, et al. ∙ National University of Singapore 0

The searching procedure of neural architecture search (NAS) is notoriously time consuming and cost prohibitive.To make the search space continuous, most existing gradient-based NAS methods relax the categorical choice of a particular operation to a softmax over all possible operations and calculate the weighted sum of multiple features, resulting in a large memory requirement and a huge computation burden. In this work, we propose an efficient and novel search strategy with meta kernels. We directly encode the supernet from the perspective on convolution kernels and "shrink" multiple convolution kernel candidates into a single one before these candidates operate on the input feature. In this way, only a single feature is generated between two intermediate nodes. The memory for storing intermediate features and the resource budget for conducting convolution operations are both reduced remarkably. Despite high efficiency, our search strategy can search in a more fine-grained way than existing works and increases the capacity for representing possible networks. We demonstrate the effectiveness of our search strategy by conducting extensive experiments. Specifically, our method achieves 77.0 outperforming both EfficientNet and MobileNetV3 under the same FLOPs constraints. Compared to models discovered by the start-of-the-art NAS method, our method achieves the same (sometimes even better) performance, while faster by three orders of magnitude.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural architecture search (NAS) has attracted lots of attention recently [48, 16, 29, 13]

. However, its prohibitive time and computational resource cost is a remarkable problem that prevents its deployment in many realistic scenarios. For example, the reinforcement learning (RL) based NAS method 

[49]

requires 2000 GPU days and the evolutionary algorithm based method 

[32] requires 3150 GPU days. Recent differentiable search methods, , DARTS [29], reduce the cost to some extent. However, DARTS still requires 96 GPU hours to search on a small proxy dataset CIFAR-10 and it is impractical to search on large scale datasets like ImageNet [8] directly.

The inefficiency of DARTS results from its strategy of aggregating multiple features generated by different candidate operations. Following [29, 45], we use directed acyclic graph (DAG) to represent the network work architecture and let the node/edge terminology denote the latent representation/candidate operation respectively in the network. As illustrated in Figure 2 (left), multiple convolution candidates operate on the same input feature map and generate feature maps respectively. The final output is the aggregation of these feature maps via a weighted sum. Conducting multiple convolution operations and storing the generated features bring the huge computation burden and memory cost.

Figure 2: Illustration of the original DARTS [29] supernet formulation (left) and our proposed efficient search formulation strategy (right).

In this work, we propose a novel and simple search method that reduces the cost of searching procedure significantly. The key idea of our method is to calculate the weighted sum of convolution kernels rather than output features, as illustrated in Figure 2 (right). We propose this strategy by exploiting the “additivity” of convolution, which has been discussed in ACNet [9] recently. The “additivity” states that if several 2D kernels with compatible sizes operate on the same input to produce outputs of the same resolution and their outputs are summed up, we can add up these kernels on the corresponding positions to obtain an equivalent kernel that will produce the same output [9]. However, kernels with different shapes cannot be added up directly. To solve this problem, we propose a novel strategy, named as probabilistic kernel mask, which masks off the invalid area of a bigger kernel to present a smaller kernel, as illustrated in following section.

Based on the above “additivity” property, we develop a novel design of encoding the supernet that enables us to conduct the convolution operation on the input feature only once and get a single feature map between two intermediate nodes. Thus, the computation and memory cost can be reduced significantly compared to the previous search methods [29, 25, 4]. Our work suggests a new transition of searching for appropriate architectures, from evaluating feature maps to evaluating convolution kernels that generate feature maps.

Because we take the convolution kernels as our direct search target, we can search in a more fine-grained way. Specifically, we can search for a convolution consisting of different kernel sizes. Using different kernel sizes within a single convolution, we can increase the range of receptive fields, which means we can incorporate the multi-scale information within a single layer without changing the model’s macro architecture. The idea of multi-scale representation has drawn great interest in the computer vision community and been applied to many vision tasks, such as classification 

[37, 20, 6], object detection [26, 35] and semantic segmentation [2, 27]. Most of these works obtain multi-scale information by fusing multiple feature maps with different resolutions. In this work, we use the multi-scale information from the perspective on convolution kernel sizes.

This work makes contributions as following:

  • We propose a novel searching strategy that directly encodes the supernet from the perspective on convolution kernels and “shrink” multiple convolution kernel candidates into a single one before these candidates operate on the input feature. The memory for storing intermediate features and the resource budget for conducting convolution operations are both reduced remarkably.

  • Our search strategy is able to search in a more fine-grained way than previous methods and mix up multiple kernel sizes within a single convolution without the constraint in MixNet [42]. The search space and the capacity for representing possible networks are significantly enlarged.

  • Extensive experiments on both classification and object detection tasks are conducted. The results show the proposed searching method can discover new state-of-the-art light-weight CNNs while successfully reducing the searching cost by about three orders of magnitude than existing SOTA.

2 Related Work

Efficient Search methods

Neural architecture search (NAS) has relieved substantial handcraft efforts for designing a neural network architecture and has been explored in many computer vision tasks, such as classification 

[48, 49, 29], detection [15, 5], semantic segmentation [27, 15] and GAN [16]. However, the prohibitive cost for NAS is still a remarkable problem. For example, the reinforcement learning (RL) based NAS method [49] requires 2000 GPU days and the evolutionary algorithm based method [32] requires 3150 GPU days.

The gradient-based methods relax a discrete architecture choice to a continuous search space, allowing search of the architecture using gradient descent [29, 44, 12, 10]. Although gradient-based methods are more efficient than RL and evolutionary based ones, the adopted relaxation still brings heavy computation and memory burden for calculating and storing the multiple features generated by all possible candidates. ProxylessNAS [1]

propose to binarize the architecture parameters and force only one path to be active during running, which reduces the required memory through requires GPU memory management. However, at least 200 GPU hours are still needed in 

[1].

Single-Path NAS [36] is a differentiable search algorithm with only a single path between two intermediate nodes. It views the small kernel as the core of the large kernel. Single-Path NAS chooses a candidate based on the L2 norm of the convolution weight. Specifically, it formulates a condition function in which the L2 norm of the convolution weight is compared to a threshold that controls the choice of convolution kernels. Our proposed method is very different from Single-Path NAS. First, we directly use explicit architecture parameters to represent the importance of all candidates, while Single-Path NAS uses the comparison results of convolution weights and threshold value. Second, the complexity of the condition function used in Single-Path NAS increases linearly with the number of kernel candidates, while our method can be applicable to any number of candidates readily. Third, Single-Path NAS searches for a single kernel size within a convolution, rather than multiple kernel sizes as done in ours.

Multi-Scale Representation

Multi-scale representation has been widely explored in computer vision [26, 20, 6, 14]. Some works introduce multi-scale information from the perspective on macro architecture and design model architectures with multi-branch topology [20, 43, 37]. Others propose to re-design the convolution operation [6, 14] and combine multi-scale information in a single convolutional layer, without modifying the macro architecture.

In recent MixConv [42], Tan also proposed to use similar convolutions. However, all the convolution candidates in their method are always split uniformly. Thus, the search space used in [42] is limited due to such fixed allocation of different kernels. It is reasonable that the convolutional layers’ preferences for kernel size differ across the network. So keeping a fixed ratio for the network at different depths would not achieve optimal performance. Nonetheless, it is not practicable to manually fine-tune an ad-hoc ratio for a specific layer because of the non-trivial burden introduced by endless trial and error. In this work, we remove the “uniform partition” constraint and search for every kernel independently, which means the search space is significantly enlarged and specifically, the search space used in MixNet [42] is a subset of ours. Benefiting from no constraint, the convolution operations are mixed up more robustly and flexibly than MixNet [42].

3 Method

We start with preliminaries on the additivity property of convolutions [9]

, which is the theoretical basis for our efficient search strategy. We then introduce the Probability Masks, which is designed for representing the supernet from the perspective on convolution kernels. Finally, in order to discover appropriate models with different computation resource budgets, we employ a resource-aware search objective function following 

[10].

3.1 Additivity of Convolution

Consider there are 2D convolutional kernels that operate on the same input I

separately. If these 2D kernels have the same stride and

compatible sizes, the sum of their outputs can be obtained in an equivalent way: adding up these kernels on the corresponding positions to formulate a single kernel, and then conducting convolution operation on the input with this generated single kernel to get the final output. Here compatible means that the smaller kernel can be generated by slicing the larger kernel. For example, kernels are compatible with  [9]. Such ‘additivity’ of convolution can be formally represented as

(1)

where denotes the element-wise addition of the kernel parameters and denotes element-wise addition of the resulted features.

To the best of our knowledge, this is the first work that introduce the additivity of convolution to the NAS fields. We experimentally show that using this property can help reduce the searching time remarkably.

3.2 Meta Convolution Kernels

We propose an efficient search algorithm that can significantly improve efficiency of the search process based on the additivity of convolution discussed above. The key part of our search strategy is that we use the weighted sum of kernels, rather than weighted sum of feature maps used in previous works [29, 4, 10], to represent the aggregation of multiple outputs generated by all edges (candidates operations).

Let denote a set of candidate kernels, where represents the width and height of the -th kernel, respectively. We use the architecture parameters to encode the over-parameterized kernel at the search stage. The represents the probability of selecting as the candidate, correspondingly.

3.2.1 Continuous Relaxation and Reformulation

The previous gradient-based NAS methods [29, 4, 10] relax the categorical choice to a softmax one over multiple candidate operations. It can be formulated as:

(2)

where O denotes the output, a weighted sum of features from multiple operations. As stated above, multiple output features need to be calculated and stored between two nodes and the weighted sum over these multiple features is taken as the final output of a node.

Based on additivity of convolution in Eq. (1), we reformulate Eq. (2) as:

(3)

where the outer denotes the element-wise addition operation of the kernel parameters on the corresponding position. Through such reformulation, we can combine multiple candidate kernels into a single one before they operate on the features. Thus, we just need conduct the convolution operation once and generate a single output feature between two intermediate nodes, avoiding the intrinsic inefficient problems introduced by multi-paths.

3.2.2 Candidate Kernel Formulation

Now, we introduce details of our search strategy. It consists of three steps. The first step is to determine the meta kernels; the second step is to generate probabilistic masks over the meta kernels; and the third step is to sample all the candidate kernels from meta kernels by the probabilistic masks.

Step 1: Build meta kernels

We first build a special kernel with the shape of

(4)

where . This implies that all kernels in the set are compatible to . We name the kernel as meta kernels because all of the candidates in the set originate from it. For example, for a candidate set of kernels, the corresponding meta kernel has the shape of .

Figure 3: Illustration of two mask examples. The area with green color is the mapping the corresponding kernel in mask. We denote this mapping area as RoI of . For a specific mask, all elements in the RoI area are filled with the same probability value. Elements in diagonally hatched grey area are filled with zero. Best viewed in color.
Step 2: Learn the probability mask

Given the kernel candidate set , there is another corresponding mask set, , which serves as the intermediary of over-parameterizing the candidate kernels with architecture parameters . Each mask has the same shape as the KermelMatrix, . The elements of are defined as:

(5)

where is the sampling probability of at search stage. And we define the RoI of as the mapping of the in mask , as illustrated in Figure 3. The mapping area in the is determined following two principles: (1) The center of the RoI is located at the center of and (2) the shape of the RoI is same as its corresponding kernel candidates . Note that extra memory introduced by is negligible compared to that introduced by feature maps from multi-paths, as used in previous works.

Step 3: Generate all the candidate kernels

Now, every candidate kernel, , can be generated by multiplying its corresponding mask and the KernelMaster . Based on the above formulation, we add an extra into the mask set , which serves as controlling the total number of the filters in a layer. We name as None as all elements in are equal to zero. With the help of , some redundant filters can be pruned at the search stage.

Note that in the above discussion, for the sake of simplicity, we take the search process of a single filter within a convolution layer as an example. However, it is easy to extend to all filters because every kernel is treated independently at the search stage. Furthermore, benefiting from our fine-grained search strategy, the vanilla depthwise convolution and the mixed convolution proposed in [42] are three special cases of our search space.

3.3 Search with Cost-aware Objective

In order to let our proposed method generate models adaptively under different circumstances, we incorporate the cost-aware constraint into our objective to formulate a multi-objective search algorithm. Formally, we use the FLOPs as the proxy of the computation consumption and the corresponding searching loss is defined as

(6)

where is the computation cost budget, which can be adapted according to different needs. The function counts the FLOPs of a specific architecture sampled from the search space at the search stage. is a slack variable. As the FLOPs of a sampled network is a discrete value so it is reasonable to confine the FLOPs in a small range rather than a single point. We regard FLOPs as our cost-aware supervision in this work and other metrics such as latency as used in [44, 40, 24] can replace FLOPs as the objective readily.

3.4 Differentiable Search Algorithm

With introducing the cost-aware loss , we search for the network architectures to minimize the following multi-objective loss:

(7)

where represents an architecture in the search space and denotes the convolution weights of the corresponding model. We adopt the differentiable search method to solve the problem of finding the optimal kernels.

The probability of sampling the -th kernel candidate in the Eq. (5) is computed as

(8)

Instead of directly relaxing the categorical choice of a particular kernel to a softmax over all possible candidates as Eq. (3), we formulate the search stage as the sampling process, as done in [10, 44].

  Input: Search space , FLOPs target T, randomly initialized architecture parameters and convolution kernel parameters K, dataset
  while not converge do
     1. Generate all kernel candidates from the meta kernel by means of the probability mask set
     2. Aggregate multi-paths into a single one based on Eq. (3)
     3. Calculate based on Eq. (7)
     4. Update weights:
     5. Update probability parameter:
  end while
  Derive the final kernel combination from the learned .
Algorithm 1 The metaKernel algorithm

Although the objective function is differentiable with respect to the weight of kernel K, it is not differentiable to the architecture parameters due to the sampling process. In order to sidestep this problem, we adopt the Gumbel Softmax function [31, 22], as used in recently NAS related works [44, 1, 10, 46]. The sampling probability in Eq. (8) can be rewritten as

(9)

where is sampled form the distribution of Gumbel (0,1), and is the class probabilities of categorical distribution calculated by Eq. (8).

After the searching process, we can derive the architecture from the architecture parameters . Our pipeline is summarized in Algorithm 1. We will show in the next experiment section that our proposed search algorithm costs orders of magnitude less search time than previous RL based NAS and gradient-based multi-paths NAS while achieving better performance.

4 Experiments

In this section, we aim to validate effectiveness of our proposed search method. We first conduct ablation studies to investigate the effectiveness of mixing multiple kernel sizes without any constraints, which is more flexible than MixConv [42]. Then, we compare our searched models with state-of-the-art , both manually designed and discovered by NAS methods. Besides, we further conduct object detection experiments to show the advantage of our models as a backbone feature extractor.

4.1 Implementation Details

We conduct experiments on the widely used ImageNet [8] benchmark. We use the normal data augmentation including random horizontal flipping with 0.5 probability, scaling hue/saturation/brightness, resizing and cropping, following [17]. We do not use the mixup [47] or AutoAugment [7]

for a fair comparison. The models are trained for 250 epochs from scratch as done in

[29, 10, 4]. We train the models on 8 Nvidia 2080Ti GPUs with a total batch size 1024. The learning rate is initialized as 0.65 and decayed to 0 at the end of the training stage, following the cosine rule. We use a weight decay of . At the evaluation phase, we adopt the popular settings, i.e. resizing the image into and then center cropping a single patch. We set as 2.0 and as 0.1.

4.2 Ablation Study

Because we search for an appropriate ratio of different kernel sizes within a single depthwise convolution, we conduct a series of ablation studies to demonstrate that our proposed method can achieve better FLOPs-Accuracy trade-off than both vanilla depthwise convolution and MixConv [42] where multiple kernels are mixed up in a manually designed partition way.

4.2.1 Settings

Following MixConv [42], we design three kinds of baseline settings to implement the depthwise convolution:

  1. Single kernel size within a depthwise convolution.

  2. Multiple kernel sizes in an uniform partition way.

  3. Multiple kernel sizes in an exponential partition way.

Note that the above three baseline models are three special cases of our search space because our search algorithm aims to find the proper ratio of different kernels within a single convolution operation.

To perform apple-to-apple comparison, we reproduce all baseline methods under the same training/testing setting for internal ablation studies. Following MixConv [42], we conduct all experiments on the widely used MobileNetv1 [18] networks.

For baseline A, we start with the original MobileNetv1 and then replace depthwise convolution with and ones, respectively. For baselines B and C, we adjust the number of kernel types from 1 to 6. The kernel sizes increase from , with a step of 2. For example, when the number of kernel types is 6, the corresponding kernel candidate set is . The candidate sets on which we conduct our search method are the same as baselines B and C for fair comparison.

Figure 4: Comparison results on ImageNet. The searched model from our method achieves higher accuracy than both vanilla depthwise conv and the MixConv [42] under the same FLOPs requirements.

4.2.2 Results

The experimental results are illustrated in Figure 4. For baseline A, similar to [42], we find that the model top-1 accuracy goes up when enlarging kernel size from to but starts dropping when kernel size is larger than . This can be explained that when kernel size is equal to the input feature in the extreme case, the convolutional layer simply becomes a fully-connected network, which is known to be harmful for performance [20, 42].

For baselines B and C, we observe that the depthwise convolution with multiple kernels achieves better FLOPs-accuracy trade-off than vanilla depthwise convolution and the performances of baseline-B and C are similar under the same FLOPs. Besides, baseline B can be seen as a special case of uniform sampling.

Furthermore, with the same kernel candidate set, our discovered models outperform both uniform and exponential allocation of kernels of different sizes, under the same FLOPs constraint. We regard the performance gain is from the finer granularity of our search approach, which can choose a suitable ratio of kernel sizes at different depth of the architecture.

4.3 Comparing with SOTAs

To further demonstrate effectiveness of our search method, we compare it with state-of-the-art NAS methods.

4.3.1 Settings

Following  [44, 36, 1], we adopt the inverted residual bottleneck [34] (MBConv) as our macro structure. The MBConv block is a sequence of a pointwise convolution, depthwise convolution, convolution. Different from previous works that search for a single kernel size in a depthwise convolution layer, our method searches multiple kernel sizes.

The recent MixNet [42] also proposes to search among the mixed kernels. However, the ratio of different kernel sizes in their search space is fixed as uniform. In our search space, there is no constraints for the ratio of different kernel sizes so the search space is further enlarged. The search space in MixNet [42] is a subset to ours. The experimental results also show that our searched models achieve better performance-cost trade-off than MixNet.

4.3.2 FLOPs vs. Accuracy

Evaluation results of our proposed metaKernel and comparison with state-of-the-art approaches are summarized in Table 1 and Figure 5. The metaKernel-A and metaKernel-B are obtained by setting different resource target , T in Eq. (6). We set our target value as 260M, 370M respectively, which are set around the FLOPs of state-of-the-art model MixNet [42] intentionally for fair comparison under the similar FLOPs.

As shown in the Table. 1, our metaKernel-A achieves 75.9% Top-1/92.9% Top-5 accuracy with 254M FLOPs and metaKernel-B achieves 77.0% Top-1/93.4% Top-5 accuracy with 357 FLOPs. They outperform state-of-the-art manually designed models by a large margin. Specifically, our metaKernel-A is better than MobileNetV2 (+3.8%) and ShuffleNetV2 (+3.2%), with less FLOPs.

Figure 5: FLOPs versus Top-1 accuracy on ImageNet.

Compared to recently proposed automated models generated by NAS methods, our metaKernel models perform better under similar FLOPs. Specifically, compared to RL based methods, our metaKernel-A achieves 0.7% higher Top-1 accuracy than MnasNet-A1 [40] with 58M less FLOPs; 0.3% higher Top-1 accuracy than MnasNet-A2 with 86M less FLOPs; 1.2% higher Top1-accuracy than ProxylessNAS-R [1] with 66M less FLOPs. Compared With gradient-based methods, metaKernel-A is better than ProxylessNAS-G (+1.6%), Single-Path NAS [36] (+0.8%), FBNet-A/B/C (+2.8%/+1.7%/+0.9%), respectively.

4.3.3 Searching Hours vs. Accuracy

The comparison Results of GPU hours used for searching process is illustrated in Figure 1. Our search method is faster than most of the methods by a large margin. Compared to the recent state-of-the-art model MixNet [42] that also uses multi-scale representation, our model metaKernel-A achieves a slightly higher accuracy (+0.1%) than MixNet-S and the metaKernel-B achieves same accuracy as MixNet-M, while costing 3M less FLOPs. Remarkably, for achieving very similar results, our search method needs about GPU hours less than MixNet.

As mentioned in [44, 1], MnasNet [40]

does not report the exact GPU hours for searching stage. In this work we adopt the the search cost data of MnasNet estimated in ProxylessNAS 

[1]. For MobileNetv3 [19] and MixNet [42], as they use the same search framework as MnasNet, we roughly estimate the search cost of MobileNetv3 and MixNet to be similar to that of MnasNet.

Model
Search
method
Search
space
Search
Dataset
GPU
hours
#Params #FLOPs
Top-1
acc (%)
Top-5
acc (%)
MobileNetV2 [34] manual - - - 3.4M 300M 72.0 91.0
MobileNetV2() manual - - - 6.9M 585M 74.7 92.5
ShuffleNetV2([30] manual - - - 3.5M 299M 72.6 -
CondenseNet(G=C=4) [21] manual - - - 2.9M 274M 71.0 90.0
CondenseNet(G=C=8) manual - - - 4.8M 529M 73.8 91.7
EfficientNet-B0 [41] manual - - - 5.3M 390M 76.3 93.2
NASNet-A [49] RL cell CIFAR-10 48K 5.3M 564M 74.0 91.6
PNASNet [28] SMBO cell CIFAR-10 6K 5.1M 588M 74.2 91.9
AmoebaNet-A [32] evolution cell CIFAR-10 75k 5.1M 555M 74.5 92.0
DARTS [29] gradient cell CIFAR-10 96 4.7M 574M 73.3 91.3
P-DARTS [4] gradient cell CIFAR-10 7.2 4.9M 557M 75.6 92.6
GDAS [10] gradient cell CIFAR-10 4.08 4.4M 497M 72.5 90.9
MnasNet-A1 [40] RL stage-wise ImageNet 40K 3.9M 312M 75.2 92.5
MnasNet-A2 RL stage-wise ImageNet 40K 4.8M 340M 75.6 92.7
Single-Path NAS [36] gradient layer-wise ImageNet 30 4.3M 365M 75.0 92.2
ProxylessNAS-R [1] RL layer-wise ImageNet 200 4.1M 320M 74.6 92.2
ProxylessNAS-G gradient layer-wise ImageNet 200 - - 74.2 91.7
FBNet-A [44] gradient layer-wise ImageNet 216 4.3M 249M 73.0 -
FBNet-B [44] gradient layer-wise ImageNet 216 4.5M 295M 74.1 -
FBNet-C [44] gradient layer-wise ImageNet 216 5.5M 375M 74.9 -
MobileNetV3-Large [19] RL stage-wise ImageNet 40K 5.4M 219M 75.2 -
MobileNetV3-Large() RL stage-wise ImageNet - 7.5M 356M 76.2 -
MobileNetV3-Small RL stage-wise ImageNet 40K 2.9M 66M 67.4 -
MixNet-S [42] RL kernel-wise ImageNet 40K 4.1M 256M 75.8 92.8
MixNet-M RL kernel-wise ImageNet 40K 5.0M 360M 77.0 93.3
metaKernel-A(ours) gradient kernel-wise ImageNet 40 5.8M 254M 75.9 92.9
metaKernel-B(ours) gradient kernel-wise ImageNet 40 7.2M 357M 77.0 93.4
Table 1: Performance of and metaKernel and state-of-the-art baseline architectures on ImageNet. For baseline models, we directly cite the parameter size, FLOPs, Top-1, Top-5 accuracy on the ImageNet validation set from their original papers. The data of search cost without additional superscript is also obtained from the original paper, while the superscripts means: * The search cost for MnasNet is obtained from  [1], in which Han tested on V100 GPUs with the configuration described in  [40]. Both MobileNetv3 [19] and MixNet [42] use the same search framework as MnasNet [40] so the search cost is roughly estimated based on MnasNet [40]. denotes TPU hours. The data is estimated in  [44].
backbone FLOPs mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike persn plant sheep sofa train tv
MBv1 10.16G 75.9 83.9 79.3 75.1 65.8 55.9 84.3 85.7 85.4 58.4 80.9 70.4 82.0 84.9 84.5 79.7 48.3 77.8 76.6 84.1 74.3
MBv2 9.10G 75.8 84.5 83.4 76.1 68.3 58.7 78.9 84.8 86.5 54.4 80.7 70.9 84.0 85.0 83.6 76.8 48.7 78.7 72.8 85.0 73.7
MBv3-Small 8.19G 69.3 77.4 76.9 67.0 62.0 43.7 76.3 79.1 82.1 47.2 75.4 65.2 78.4 81.0 79.7 72.5 39.4 68.7 67.1 76.1 70.0
MBv3-Large 8.78G 76.7 84.1 84.1 77.0 69.9 57.9 84.8 85.1 88.1 56.3 84.8 64.8 84.3 87.9 84.7 77.0 46.3 80.5 73.9 86.2 76.8
MBv3-Large 9.28G 77.7 86.1 84.1 76.8 71.6 60.9 85.6 86.8 88.8 55.6 84.2 70.2 85.7 86.2 85.0 77.6 47.5 81.6 73.5 88.1 77.6
metaKernel-tiny 8.73G 76.2 79.6 83.5 76.4 68.1 54.3 83.2 85.6 87.1 56.0 82.8 71.9 84.9 85.0 85.7 76.4 47.1 81.7 75.0 85.5 74.7
metaKernel-A 8.91G 77.3 86.1 84.8 76.8 68.6 59.2 83.6 86.3 87.1 56.9 85.2 67.2 86.6 87.2 86.0 77.7 49.1 80.8 74.5 86.3 76.7
metaKernel-B 9.27G 78.0 86.8 84.5 77.5 69.8 58.2 85.3 86.4 88.3 60.4 84.5 72.9 85.8 86.7 86.8 78.0 51.3 80.0 73.9 86.4 77.4
Table 2: Results on VOC2007 test. denotes MobileNetV3 with the multiplier=1.25. Note that here the FLOPs of the total detection network is dominated by the detection header.

4.4 Object Detection

To further validate the effectiveness of our the metaKernel models, we conduct object detection experiments on the PascalVOC [11] dataset. Following the broadly used strategy, we combine the VOC2007 trainval and VOC2012 trainval as the training data and test the performance of our model on VOC2007 test. We adopt our metaKernel as a drop-in backbone feature extractor in YOLOV3 [33]. All backbone models are pre-trained on ImageNet and fine-tuned on PascalVOC for 200 epochs.

We first train the model with for 160 epochs, and then continue the training with for 20 epochs and for another 20 epochs. The results on VOC2007 test are shown in Table 2. Our metaKernel-A model outperforms the MobileNetv1 [18], MobileNetv2 [34] feature extractors by 0.3 mAP, 0.4 mAP, respectively, while consuming less FLOPs. And our metaKernel-B performs better than MobileNetV3() that has a nearly same FLOPs as our model.

4.5 Visualization on Kernel Size Distribution

As our search algorithm can determine the number of different kernel sizes automatically, we wonder what is the intrinsic preference on kernel sizes for CNNs. We plot the distribution of each kernel size in Figure 6. We observe that at the shallow layers, the network tends to choose smaller kernel size. With the layer going deeper, large kernels begin to occupy a larger proportion. Interestingly, our findings are consistent with the MixConv [42]. This interesting findings may inspire future works for understanding CNN.

Figure 6: Illustration of the distribution of different kernel sizes across the network. The small kernels account for a large proportion in the shadow layer, and lager kernels are preferred in the deep layer.

5 Discussion

In this work, we treat every kernel within a single depthwise convolution independently and search for a mixed convolution. In this way, our proposed search strategy can search in a more fine-grained way. Furthermore, our method is also compatible with atrous convolution [3], asymmetric convolution [23, 38, 39] following the same rule as discussed in Sec 3.2.2, not requiring any other adaptation. So, our method can be equipped to existing works [29, 46, 10] to further improve searching efficiency.

6 Conclusion

In this work, we propose an efficient search strategy to reduce the search cost dramatically. We encode the supernet from the perspective on convolution kernels rather than on feature maps, which could drop the requirement for memory and computation resources remarkably. Specifically, our search process is about faster than MnasNet by three orders of magnitude. Our proposed method digs deep into the more fine-grained search space, i.e, convolutional kernels. We demonstrate experimentally that our discovered models achieve better performance on ImageNet under the same computation resource constraints. We hope that our research will be beneficial in accelerating the search procedure and further promote the development of NAS.

References

  • [1] H. Cai, L. Zhu, and S. Han (2018) Proxylessnas: direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332. Cited by: §2, §3.4, §4.3.1, §4.3.2, §4.3.3, Table 1.
  • [2] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1.
  • [3] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §5.
  • [4] X. Chen, L. Xie, J. Wu, and Q. Tian (2019) Progressive differentiable architecture search: bridging the depth gap between search and evaluation. arXiv preprint arXiv:1904.12760. Cited by: §1, §3.2.1, §3.2, §4.1, Table 1.
  • [5] Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, and J. Sun (2019) Detnas: neural architecture search on object detection. arXiv preprint arXiv:1903.10979. Cited by: §2.
  • [6] Y. Chen, H. Fang, B. Xu, Z. Yan, Y. Kalantidis, M. Rohrbach, S. Yan, and J. Feng (2019)

    Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution

    .
    arXiv preprint arXiv:1904.05049. Cited by: §1, §2.
  • [7] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2018) Autoaugment: learning augmentation policies from data. arXiv preprint arXiv:1805.09501. Cited by: §4.1.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    ,
    pp. 248–255. Cited by: §1, §4.1.
  • [9] X. Ding, Y. Guo, G. Ding, and J. Han (2019) ACNet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1911–1920. Cited by: §1, §3.1, §3.
  • [10] X. Dong and Y. Yang (2019) Searching for a robust neural architecture in four gpu hours. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1761–1770. Cited by: §2, §3.2.1, §3.2, §3.4, §3.4, §3, §4.1, Table 1, §5.
  • [11] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §4.4.
  • [12] J. Fang, Y. Sun, Q. Zhang, Y. Li, W. Liu, and X. Wang (2019) Densely connected search space for more flexible neural architecture search. arXiv preprint arXiv:1906.09607. Cited by: §2.
  • [13] C. Gao, Y. Chen, S. Liu, Z. Tan, and S. Yan (2019) AdversarialNAS: adversarial neural architecture search for gans. arXiv preprint arXiv:1912.02037. Cited by: §1.
  • [14] S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. Torr (2019) Res2Net: a new multi-scale backbone architecture. arXiv preprint arXiv:1904.01169. Cited by: §2.
  • [15] G. Ghiasi, T. Lin, and Q. V. Le (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7036–7045. Cited by: §2.
  • [16] X. Gong, S. Chang, Y. Jiang, and Z. Wang (2019) AutoGAN: neural architecture search for generative adversarial networks. arXiv preprint arXiv:1908.03835. Cited by: §1, §2.
  • [17] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li (2019) Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 558–567. Cited by: §4.1.
  • [18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §4.2.1, §4.4.
  • [19] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. arXiv preprint arXiv:1905.02244. Cited by: §4.3.3, Table 1.
  • [20] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger (2017) Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844. Cited by: §1, §2, §4.2.2.
  • [21] G. Huang, S. Liu, L. Van der Maaten, and K. Q. Weinberger (2018) Condensenet: an efficient densenet using learned group convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2752–2761. Cited by: Table 1.
  • [22] E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §3.4.
  • [23] J. Jin, A. Dundar, and E. Culurciello (2014) Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474. Cited by: §5.
  • [24] X. Li, Y. Zhou, Z. Pan, and J. Feng (2019) Partial order pruning: for best speed/accuracy trade-off in neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9145–9153. Cited by: §3.3.
  • [25] H. Liang, S. Zhang, J. Sun, X. He, W. Huang, K. Zhuang, and Z. Li (2019) Darts+: improved differentiable architecture search with early stopping. arXiv preprint arXiv:1909.06035. Cited by: §1.
  • [26] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1, §2.
  • [27] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 82–92. Cited by: §1, §2.
  • [28] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34. Cited by: Table 1.
  • [29] H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: Figure 2, §1, §1, §1, §2, §2, §3.2.1, §3.2, §4.1, Table 1, §5.
  • [30] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: Table 1.
  • [31] C. J. Maddison, A. Mnih, and Y. W. Teh (2016)

    The concrete distribution: a continuous relaxation of discrete random variables

    .
    arXiv preprint arXiv:1611.00712. Cited by: §3.4.
  • [32] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    .
    In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 4780–4789. Cited by: §1, §2, Table 1.
  • [33] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §4.4.
  • [34] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §4.3.1, §4.4, Table 1.
  • [35] B. Singh, M. Najibi, and L. S. Davis (2018) SNIPER: efficient multi-scale training. In Advances in Neural Information Processing Systems, pp. 9310–9320. Cited by: §1.
  • [36] D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, B. Priyantha, J. Liu, and D. Marculescu (2019) Single-path nas: designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877. Cited by: §2, §4.3.1, §4.3.2, Table 1.
  • [37] K. Sun, B. Xiao, D. Liu, and J. Wang (2019)

    Deep high-resolution representation learning for human pose estimation

    .
    arXiv preprint arXiv:1902.09212. Cited by: §1, §2.
  • [38] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §5.
  • [39] C. Tai, T. Xiao, Y. Zhang, X. Wang, et al. (2015) Convolutional neural networks with low-rank regularization. arXiv preprint arXiv:1511.06067. Cited by: §5.
  • [40] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §3.3, §4.3.2, §4.3.3, Table 1.
  • [41] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: Table 1.
  • [42] M. Tan and Q. V. Le (2019) MixConv: mixed depthwise convolutional kernels. CoRR, abs/1907.09595. Cited by: 2nd item, §2, §3.2.2, Figure 4, §4.2.1, §4.2.1, §4.2.2, §4.2, §4.3.1, §4.3.2, §4.3.3, §4.3.3, §4.5, Table 1, §4.
  • [43] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, et al. (2019) Deep high-resolution representation learning for visual recognition. arXiv preprint arXiv:1908.07919. Cited by: §2.
  • [44] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019) Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: §2, §3.3, §3.4, §3.4, §4.3.1, §4.3.3, Table 1.
  • [45] S. Xie, A. Kirillov, R. Girshick, and K. He (2019) Exploring randomly wired neural networks for image recognition. arXiv preprint arXiv:1904.01569. Cited by: §1.
  • [46] S. Xie, H. Zheng, C. Liu, and L. Lin (2018) SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: §3.4, §5.
  • [47] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §4.1.
  • [48] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §1, §2.
  • [49] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §1, §2, Table 1.