Fine-Grained Neural Architecture Search

11/18/2019 ∙ by Heewon Kim, et al. ∙ Qualcomm Seoul National University 0

We present an elegant framework of fine-grained neural architecture search (FGNAS), which allows to employ multiple heterogeneous operations within a single layer and can even generate compositional feature maps using several different base operations. FGNAS runs efficiently in spite of significantly large search space compared to other methods because it trains networks end-to-end by a stochastic gradient descent method. Moreover, the proposed framework allows to optimize the network under predefined resource constraints in terms of number of parameters, FLOPs and latency. FGNAS has been applied to two crucial applications in resource demanding computer vision tasks—large-scale image classification and image super-resolution—and demonstrates the state-of-the-art performance through flexible operation search and channel pruning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks (CNNs) have recently achieved great success in various fields including computer vision, natural language processing, pattern recognition, bioinformatics, and many others. However, the arbitrary complexity of target problems and the requirement of extensive hyperparameter search make it inevitable to manually explore the ideal deep network architectures customized for the given tasks. Consequently, neural architecture search (NAS) approaches have been studied actively, and the models identified by the NAS techniques 

[57, 39, 47, 18] started to surpass the performance of the traditional deep neural networks [43, 16, 24] designed by human. Despite such successful results, it is still a challenging problem to optimize deep neural networks even by sophisticated AutoML techniques because the search space of the existing NAS methods is limited while their search cost is high.


(a) Flexibility Comparison of Searched Architectures

(b) Our Definition of Convolution Operation

Figure 1: Overview of our search method. (a) FGNAS performs operation search for each of channels, which leads to more flexible network architectures than other approaches. The example illustrates the architecture search results in a single layer with 5 channels, where color encodes an operation. Note that the flexibility of the output models depends on search granularity of individual algorithms. (b) In this paper, we define an operation

as a sequence of a convolution, a normalization, and an activation function, denoted by

. Note that the operation search even includes channel pruning, which is equivalent to no-operation.
AMC [18] NetAdapt [52] Huang et al. [26] MnasNet [47] ProxylessNAS & FBNet [5, 50] FGNAS (Ours)
Structure search Prune channels
Operation search Find efficient operations
Layer-wise optimization
Channel-wise optimization
Optimization method RL trial-and-error policy-gradient RL gradient-based gradient-based
Table 1: Comparisons of automated neural architecture search (or pruning) techniques. Our algorithm conceptually has the largest search space among all the compared methods, and efficiently optimizes candidate models by gradient-based search strategies.

Researchers have aimed to develop flexible and scalable NAS techniques with large search spaces and identify the unique models different from the manually designed structures [57]. However, NAS methods often suffer from huge computational cost and reduce their search space significantly for practical reasons. For example, [57, 3, 39, 33, 35] search for two cells as basic building blocks to construct full models by stacking them. To tackle the redundancy between the cells and increase the diversity of full models, MnasNet [47] adopts smaller search units, blocks, than cells. Recently, FBNet [50] and ProxylessNAS [5] reduce their search units further to individual layers. Although the resulting models are more flexible by decreasing the granularity of search units and increasing the diversity of the generated models through their composition, those methods are limited to allocating a single operation per layer and the operation configurations of the whole network are proportional to the number of layers.

On the contrary, we present a flexible and scalable neural architecture search algorithm. The search unit of our algorithm is channel, which is even smaller than layer; each channel chooses a different operation111We define a series of a convolution, normalization and activation function application by an operation., which also includes no-operation, equivalent to channel pruning. This kind of search strategy improves the flexibility of resulting models because it is possible to generate a large number of configurations even within a single layer, which increase exponentially by adding layers. Such an extremely flexible framework incurs small overhead, which allows to maintain various operations for search and increase search space significantly. Figure 1 illustrates the proposed fine-grained neural architecture search (FGNAS) approach, where our per-channel search algorithm generates a feature map given by a composition of multiple operations and also reduces the number of channels by pruning.

FGNAS is trained to maximize the validation accuracy efficiently and stably by a stochastic gradient descent method. Moreover, it is convenient to regularize individual channels by incorporating FLOPs and latency into the training objective. Therefore, the proposed algorithm has a great deal of flexibility and scalability to maximize the accuracy of searched models while facilitating to consider various aspects for optimization. Our overall contribution is summarized as follows:

  • We propose a flexible and scalable fine-grained neural architecture search algorithm, which allows to perform per-channel operation search including channel pruning efficiently and optimize end-to-end by a stochastic gradient descent method.

  • Our framework deals with diverse objectives of neural architecture search such as number of parameters, FLOPs and latency, in addition to accuracy, conveniently.

  • The resulting models from our algorithm achieve outstanding performance improvements with respect to various evaluation metrics in image classification and single image super-resolution problems.

The rest of this paper is organized as follows. We first discuss existing works related to deep neural network optimization and neural architecture search in Section 2. Section 3 describes the proposed algorithm in details including training methods and Section 4 presents experimental results in comparison to the existing methods.

2 Related Work

This section describes existing efficient convolution network designs and neural architecture search techniques in details. Table 1 presents the snapshot of the algorithms discussed in this section.

Efficient Convolution Networks

Designing compact convolutional neural networks has been an active research problem in the last few years. While the hand-crafted models achieve efficient convolutional operations by revising network structures [27, 21, 54, 42, 23], the simple rule-based network quantization [14] and pruning techniques [15, 9, 13, 14, 38, 36, 31, 19]

reduce the redundancy of deep and complex pretrained models successfully. Recent pruning methods automatically remove filters and/or activations using reinforcement learning 

[18], trial-and-error [52], and policy-gradient [26]. They optimize a network in a layer-by-layer fashion, which is inefficient in dealing with inter-layer relationships, while our FGNAS optimizes all layers jointly using a gradient-based method.

Figure 2: The proposed efficient per-channel operation search framework using binary mask . The -th operation

produces a tensor with

channels from the input feature map in the

-th layer. The binary vector

selects a subset of channels in the output tensor , and each masked tensor and the optional skip-connection feature map of the -th layer are aggregated to the final output feature map of the -th layer. This framework facilitates efficient search for flexible architectures. Note that we just need to learn the binary masks for neural architecture search, which are implemented by gating functions. Figure 4 illustrates the details about forward and backward passes along the gating functions.

Neural Architecture Search (NAS)

Automatic architecture search techniques conceptually have more flexibility in the identified models than the hand-crafted methods. NASNet [57] and MetaQNN [3] adopt reinforcement learning for non-differential optimization. ENAS [39] employs a RNN controller to search for the optimal model by drawing a series of sample models and maximizing their expected reward, while PNAS [33] performs a progressive architecture search by predicting accuracy of candidate models. Evolutionary search [40] employs a tournament selection; although it is the first algorithm to surpass the state-of-the-art classification accuracy, it requires significantly more computational resources. DARTS [35] relaxes the discrete architecture representation to a continuous one and addresses scalability issue by making the objective function differentiable. MnasNet [47] and DPP-Net [12] are optimized with respect to the accuracy and run-time via reinforcement learning and performance predictor. EfficientNet [48] improves network efficiency by simply scaling depth, width, and resolution of backbone network. MobileNetV3 [20] adopts block-wise search [47] with layer-wise pruning [52] and presents a novel architecture design with Squeeze-and-Excitation [22]. Recently, multiple choice gating function is often adopted for differentiable and multi-objective search techniques. ProxylessNAS [5] and FBNet [50] search for efficient convolution operations in each layer. MixConv [49] finds a new depth-wise convolution operation that has multiple kernel sizes within a layer. Our FGNAS presents per-channel convolution operation search, which constructs maximally flexible layer configurations as illustrated in Figure 1 and runs efficiently through a differentiable optimization.

3 Proposed Algorithm

This section first presents our efficient search formulation via binary masking, and discusses our gating function that allows to perform the end-to-end differentiable search. Then, we present the objective function of our algorithm based on resource regularizer, which directly penalizes each channel, and describes the exact search space.

3.1 Formulation of Operation Search

Figure 3: The structure of multiple operations from a convolution. The multiple operations share the feature map of normalization to reduce search cost. Since an operation consists of a sequence of a convolution, a normalization, and an activation function, dashed boxes may be omitted depending on the backbone networks.
Figure 4: Illustration of forward and backward passes to optimize the gating function parameters . (a) Gating function

produces a binary value in the forward pass and a softmax probability in the backward pass for gradient-decent optimization. (b) The collection of gating functions

is a relaxed version of in (2). (c) controls searched architectures by determining active channels in the forward pass. During the gradient-decent optimization procedure in the backward pass, resource regularizer plays a role to penalize a channel with high resource consumption, while the task-specific loss attempts to keep the channel alive if it performs well in the target task.

Although FGNAS has a large search space and generates flexible output models, a critical concern is how to perform NAS efficiently through proper configuration of the search space. To tackle this challenge, FGNAS constructs a feature map using a composition of multiple operations as illustrated in Figure 2, where the composition allows to generate a large number of virtual operations and increase the flexibility of searched models. Given an input tensor in the -th layer, denoted by , the output of the layer, , is expressed as

(1)

where is the number of operations at -th layer considered in our search and

(2)

Note that is a binary vector, represents the -th operation producing a tensor with channels, and denotes the channel-wise binary masking operator. In other words, the output tensor is given by the average of masked tensors, where the mask of each tensor is learned by our search algorithm, which also allows channel pruning by masking out the same channels in all output tensors. In addition to the operation search, we consider the identity connections from a preceding layer optionally, which derives the modification of (1) as

(3)

where denotes the feature map from which the identify connection is originated.

In our algorithm, each operation is defined by a series of convolution, normalization, and activation function application as illustrated in Figure 1 (b). Figure 3 presents our efficient operation structure to increase the number of operations with little additional cost because all the three operations in Figure 3 share the previous feature map of a normalization. For some parts of backbone networks that convolutional layers are not followed by normalization and activation function layers, an operation is actually equivalent to a convolution.

3.2 Per-Channel Differentiable Gating Functions

To relax the binary mask in (2), we introduce a relaxed gating function , and define a collection of the gating functions, denoted by , as

(4)

where and denote the layer and operation index, respectively, and is the number of channels. A relaxed gating function for each channel parametrized by is given by

(5)

where is an indicator function that returns 1 when its input is true and 0 otherwise, and denotes the value corresponding to dimension after applying a softmax function. Figure 4 (a) and (b) illustrate and , respectively.

Using the relaxed gating function, we reformulate the channel-wise tensor masking in (2) as

(6)

This relaxed gating function allows to update the architecture by a gradient-decent optimization method because the backward function is differentiable.

3.3 Resource Regularizer on Channels

The proposed approach aims to maximize the accuracy of a target task and minimize the resource usage of the identified model. Hence, our objective function is composed of two terms; one is the task-specific loss and the other is a regularizer penalizing overhead of networks such as parameters, FLOPs, and latency. To search for operations per channel, the proposed regularizer computes the amount of resource usage of a channel, which changes over iterations due to the gradual update of architectures. Figure 4 (c) illustrates overview of the resource regularizer, and the rest of this section discusses the details.

Let

denote a loss function for an arbitrary task

222In our work, the tasks are image classification and super-resolution. and

be a differentiable regularizer that estimates the resources of the current model identified by our search algorithm. Then, the objective function is formally given by

(7)

where and are learnable parameters in the neural networks and the gating functions , respectively, and is the hyper-parameter balancing the two terms. Specifically, the regularizer is given by

(8)

where is a resource measurement function of the -th operation, indicates the type of the resources, and is the number of layers. Note that and are the number of input/output channels of the -th operation, respectively, and they are differentiable via gating function, defined as

(9)

and

(10)

where denotes norm of a vector. The function produces a binary vector, valued 1 for non-zero elements of the input vector in the forward pass, but is an identity function in the backward pass. The skip connection from an earlier layer affects (10) because we need to consider an extra term for the summation.

On the other hand, and are well-defined functions of convolution kernel sizes, number of channels, feature map resolution, etc. They are differentiable with respect to the number of active channels by the definitions in (9) and (10). However, it is not straightforward how to define the latency measurement function on specific devices such as Google Pixel 1 and Samsung Galaxy S8. We address this problem by fitting affine functions of the relation between latency and FLOPs, which are parameterized by

; it turns out that the convolution operations present strong correlations between latency and FLOPs in a particular condition provided by the combination of input feature map size, kernel size, stride, convolutional groups and so on. By approximating latency as a function of FLOPs, (

8) with naturally penalizes all channels to minimize the run-time of networks.

3.4 Search Space

FGNAS searches for an operation in each channel; the granularity of architecture search is as small as a channel. Consequently, the possible combinations of operations in FGNAS is significantly more than those of any other NAS techniques. Specifically, the search space in a single layer is , where is the number of operations and is the number of channels at -th layer, while it has minor variations depending on the network configurations (e.g., existence of skip connections). This is truly beyond the comparable range to other approaches because most of the NAS techniques are limited to adopting a per-layer search strategy and exploring few building blocks instead of directly optimizing the whole model.

Table 2 illustrates the search space of operations in our search algorithm. The backbone networks for image classification include VGG, ResNet, DenseNet, EfficientNet, and MobileNetV2, while EDSR is employed for image super-resolution. Note that we insert a 11 convolution operation after an identity connection to reduce the number of input channels to the first convolution operation of a residual (or dense) block.

Factor Search Space
Convolution types Normal, Depth-wise
Convolution kernel sizes 1, 3, 5, 7, 9, 11
Normalization method BN
Activation functions ReLU, PReLU, tanh
The number of channels 0, 1, 2, , ,

Table 2: The search space of operations.
Model Type Search Cost (GPU-days) Top-1 Acc. Parameters
DenseNet-BC [24] manual - 96.5 % 25.6 M
Hierarchical Evolution [34] evolution 300 96.3 % 15.7 M
P-DARTS (large) [6] + cutout gradient-based 0.3 97.8 % 10.5 M
ProxylessNAS-G [5] + cutout gradient-based 4.0 97.9 % 5.7 M
ENAS [39] + cutout RL 0.5 97.1 % 4.6 M
EfficientNet-B0 [48] model scaling - 98.1 % 4.0 M
EfficientNet-B0-FGNAS (Large) + cutout gradient-based 0.1 98.2 % 3.9 M
P-DARTS [6] + cutout gradient-based 0.3 97.5 % 3.4 M
NASNet-A [58] + cutout RL 1800 97.4 % 3.3 M
DARTS [35] (first order) + cutout gradient-based 1.5 97.0 % 3.3 M
DARTS [35] (second order) + cutout gradient-based 4 97.2 % 3.3 M
AmoebaNet-A [40] + cutout evolution 3150 96.6 % 3.2 M
PNAS [33] SMBO 225 96.6 % 3.2 M
SNAS [51] + mild constraint + cutout gradient-based 1.5 97.0 % 2.9 M
SNAS [51] + moderate constraint + cutout gradient-based 1.5 97.2 % 2.8 M
AmoebaNet-B [40] + cutout evolution 3150 97.5 % 2.8 M
EfficientNet-B0-FGNAS (Small) + cutout gradient-based 0.5 97.8 % 2.7 M
Table 3: Comparison with state-of-the-art architectures on CIFAR-10.
Model Search Space Method Type Top-1 Acc. Parameters FLOPs CPU
MobileNetV2 (224) No Search Baseline manual 72.0 % 3.4 M 600 M 75 ms
+ Channel Pruning Multiplier (0.75) [42] manual 69.8 % 2.6 M 418 M 56 ms
NetAdapt [52] trial-and-error 70.9 % - - 64 ms
FGNAS (P) gradient-based 70.9 % 3.5 M 410 M 53 ms
+ 55 DConv FGNAS gradient-based 71.4 % 3.1 M 378 M 53 ms
Table 4:

Comparison with channel pruning methods on ImageNet.

is a reported result and similar latency with Multiplier (0.75) in [52].

4 Experiment

This section first presents the benchmark datasets for image classification and super-resolution tasks, and describe the implementation details of our algorithm. Then, we present the experimental results including performance analysis.

4.1 Dataset

CIFAR-10 [30] and ILSVRC2012 [41] are popular datasets for image classification. The former contains 50K and 10K 3232 images for training and testing in 10 classes. The latter consists of 1.2M training and 50K validation images in 1,000 object categories, which are a subset of ImageNet [8]. DIV2K [1] is a training dataset for image super-resolution, which contains 800 2K images while we evaluate super-resolution algorithms on Set5 [4], Set14 [53], B100 [37], and Urban100 [25]

4.2 Implementation Details

Search steps

The proposed algorithm searches for architectures with 4 steps; (1) determine a backbone network and operations for each layer, (2) pre-train the network without gating functions, (3) search for architectures by learning gating function parameters until the resource of searched architecture reaches target resource, (4) fine-tune the searched architectures with fixed gating function parameters.

Cifar-10

The backbone network is EfficientNet-B0 [48]

, of which the architecture is designed for ImageNet and transferred to CIFAR-10. The search space is 1, 3, and 5 kernel sizes in depth-wise convolution layers and the number of channels in all layers. We train the model for 160 epochs with mini-batch size 128 and initial learning rate 0.01. The resource of interest

is number of parameters of networks and the hyper-parameter is set to for resource regularizer. We use the standard SGD optimizer with nesterov [45] and Cutout augmentation [10]. We use weight decay and the momentum of 0.0001 and 0.9, respectively.

Model Search Space Method Top-1 Acc. FLOPs
VGG-16 No Search Baseline 93.7 % 627 M
+ Channel pruning FGNAS (P) 93.6 % 149 M
+ 11 1111 Conv. FGNAS 93.6 % 119 M
+ ReLU, PReLU, Tanh FGNAS 93.6 % 110 M

Table 5: Ablation study of search space on CIFAR-10.

ImageNet

MobileNetV2 [42] is the backbone network, of which the architecture has compact designed for ImageNet classification. The search space is 3 and 5 kernel sizes in depth-wise convolution layers and the number of channels in all layers. We train models using mini-batch size 256 with the initial learning rates are set to 0.01. The training epochs are 400 and the learning rates are divided by 10 at 50% and 75% of the total number of training epochs. The resource of interest is latency of networks and the hyper-parameter is set to 0.0012 for resource regularizer. We evaluate our models on Google Pixel 1 CPU using Google’s Tensor-Flow Lite engine.

Div2k

The backbone network is a small version of EDSR [32], of which each layer and the architecture have 64 channels and 16 residual blocks, respectively. The search space is ReLU, PReLU, and tanh in activation layers and the number of channels in all layers. The model is pre-trained for 300 epochs using Adam [29], where minibatch size is 16 with learning rate , patch size 9696 pixels, , , . The resource of interest is FLOPs of networks and the hyper-parameter is set to for resource regularizer. The image restoration performance measures are PSNR and SSIM on Y channel of YCbCr color space with the scaling factor 2.

Figure 5: Performance comparison between our algorithms from VGG-16 on CIFAR-10. FGNAS searches more efficient networks than channel pruned networks by FGNAS (P).

              (a) #Channels per Layer                    (b) #Operation Types per Layer                (c) Frequency of Conv. Kernel Sizes

Figure 6: Searched architecture analysis from VGG-16 on CIFAR-10 by FGNAS. Blue, red, and yellow colors denote 627M (Baseline), 250M, 110M FLOPs networks, respectively. (a) The number of channels at each layer. (b) The number of operation types at each layer. If more than two operations produce a channel, we account them as a new operation type for visualization. (c) Frequency of convolution kernel sizes in operations.

                    (a) 11 Conv.                                   (b) 33 Conv.                                   (c) 55 Conv.

Figure 7: Per-layer analysis of Figure 6 (c). (a), (b), and (c) present the number of channels produced by convolutions of 11, 33, and 55 kernel sizes, respectively. The baseline network has only convolutions of 33 kernel size.

4.3 Image Classification

Results on CIFAR-10

Table 3 illustrates the performance comparison with the state-of-the-art architectures. FGNAS (Large) outperforms the backbone network EfficientNet-B0 [48] with smaller number of parameters, and FGNAS (Small) has 2.1 smaller parameters than ProxylessNAS-G [5] with the comparable accuracy. The search cost of the proposed algorithm is small, but requires more time to find smaller networks.

Model Method Type Top-1 Acc. FLOPs
VGG-16 Baseline manual 93.7 % 627 M
Huang et al. [26] policy-gradient 90.9 % 222 M
Slimming [36] rule-based 93.6 % 211 M
FGNAS (P) gradient-based 93.6 % 149 M
VGG-19 Baseline manual 94.0 % 797 M
Slimming [36] rule-based 93.8 % 391 M
DCP [56] gradient-based 94.2 % 398 M
FGNAS (P) gradient-based 94.3 % 348 M
ResNet-18 Baseline manual 91.5 % 26.0 G
Huang et al. [26] policy-gradient 90.7 % 6.2 G
FGNAS (P) gradient-based 92.5 % 1.3 G
ResNet-20 Baseline manual 92.2 % 81 M
Soft Filter [17] rule-based 91.2 % 57 M
FGNAS (P) gradient-based 91.7 % 34 M
DenseNet-40 Baseline manual 94.3 % 566 M
Slimming [36] rule-based 93.5 % 188 M
FGNAS (P) gradient-based 93.6 % 149 M
Table 6: Channel pruning performance comparison on CIFAR-10.
Type Channel Pruning Multiple-operation Top-1 Acc. FLOPs
(1) 91.0 % 278 M
(2) 91.6 % 131 M
Ours 92.5 % 61 M

Table 7: Ablation study of per-channel gating function with VGG-16 on CIFAR-10. Multiple-operation indicates that more than two operations can produce a channel.
Model Type Set5 Set14 B100 Urban100 Parameters FLOPs
(PSNR/SSIM) (PSNR/SSIM) (PSNR/SSIM) (PSNR/SSIM)
SRCNN [11] manual 36.66 dB / 0.9542 32.42 dB / 0.9063 31.36 dB / 0.8879 29.50 dB / 0.8946 57 K 105.4 G
VDSR [28] manual 37.53 dB / 0.9587 33.03 dB / 0.9124 31.90 dB / 0.8960 30.76 dB / 0.9140 665 K 1,225.2 G
CARN-M [2] manual 37.53 dB / 0.9583 33.26 dB / 0.9141 31.92 dB / 0.8960 31.23 dB / 0.9144 412 K 182.4 G
CARN [2] manual 37.76 dB / 0.9590 33.52 dB / 0.9166 32.09 dB / 0.8978 31.92 dB / 0.9256 1,592 K 445.6 G
MemNet [46] manual 37.78 dB / 0.9597 33.28 dB / 0.9142 32.08 dB / 0.8978 31.51 dB / 0.9312 677 K 5,324.8 G
EDSR [32] manual 38.11 dB / 0.9601 33.92 dB / 0.9198 32.32 dB / 0.9013 32.93 dB / 0.9351 40,712 K 18,769.5 G
RDN [55] manual 38.24 dB / 0.9614 34.01 dB / 0.9212 32.34 dB / 0.9017 32.89 dB / 0.9353 22,114 K 10,192.4 G
FALSR-B [7] evolution 37.61 dB / 0.9585 33.29 dB / 0.9143 31.97 dB / 0.8967 31.28 dB / 0.9191 326 K 149.4 G
ESRN-V [44] evolution 37.85 dB / 0.9600 33.42 dB / 0.9161 32.10 dB / 0.8987 31.79 dB / 0.9248 324 K 146.8 G
EDSR-FGNAS gradient-based 37.86 dB / 0.9593 33.44 dB / 0.9157 32.11 dB / 0.8987 31.85 dB / 0.9254 212 K 97.6 G
Table 8: The image super-resolution benchmark for NAS approaches in scaling factor 2. FLOPs is measured to produce a HD image.

Results on ImageNet

Table 4 presents the performance comparison with MobileNetV2 Multiplier [42] and NetAdapt [52], which successfully prunes channels of efficiently designed networks [42, 20]. For the fair comparison, we evaluate the proposed algorithm as a channel pruning method, referred as FGNAS (P), of which search space is only the number of channels in all layers. FGNAS (P) is faster in the both of FLOPs and latency than other channel pruning methods and FGNAS achieves 1.6% higher Top-1 accuracy than Multiplier. The model latency reaches the target latency within 40 epochs at the search stage, which indicate the search cost of the proposed algorithm.

Ablation study of search space

Our search method easily enlarges search space by adding operations to the layers of backbone networks for more efficient architectures. Table 5 shows that the proposed algorithm finds faster networks in large search space with the same Top-1 Accuracy. Figure 5 draws FLOPs/Accuracy graphs of our search methods. FGNAS consistently outperforms FANAS (P) while reducing the network run-time, and finds the 5.7 smaller FLOPs architecture than original VGG-16 on CIFAR-10.

Searched architecture analysis

To analyze the performance improvement from flexible architectures, we visualize two FGNAS architectures, which have 250M and 110M FLOPs from VGG-16 on CIFAR-10. The search space is 1, 3, 5, 7, 9, and 11 kernel sizes in convolutions and ReLU, PReLU, and tanh in activation functions, and the number of channels in all layers. The searched networks by FGNAS and original VGG-16 have less than 0.3% accuracy differences. Figure 6 (a) shows that 3, 5, 8, and 10-th layers, located at after pooling operation, remains more channels than next layers and 110M FLOPs network prunes most of channels at 1012-th layers of 250M FLOPs network. As illustrated in Figure 6 (b), 110M FLOPs network has much higher numbers of operation types within a layer which lead complex layer configurations. Note that 5-th layer has 31 different operation types. Figure 6 (c) shows that 11 convolutions appear more frequently for the network efficiency. Figure 7 (a) shows convolutions of 11 kernel size produce more channels at 813-th layers, where the feature map resolutions are 44 and 22 pixels. On the other hand, 18-th layers prefer 33 convolutions than 11 and prune most channels at 10-th layer, as illustrated in Figure 7 (b). The channels from convolutions of 55 kernel sizes mainly remain at 3, 5, and 8-th layers, located at after pooling operation.

Channel pruning results on CIFAR-10

We evaluate the channel pruning performance of our algorithm FGNAS (P) based on diverse backbone networks of VGGNet [43], ResNet [16], and DenseNet [24]. Since original standard CNN networks are designed for ImageNet, we adopt the modified networks for CIFAR-10 [36, 26]. Table 6 shows that the proposed algorithm outperforms the existing pruning methods [26, 36, 56, 17] even with less FLOPs. Huang et al. [26] removes channels layer-by-layer with RL-based policy gradient estimation, of which search cost is 30 GPU days using Nvidia K40. Since FGNAS (P) searches over all layers simultaneously using differentiable gating functions, the search cost is 1 GPU hour using GeForce 1080 Ti on CIFAR-10. We reproduced the DenseNet-40 result of Slimming [36] for fair comparison.

Ablation study of gating function

We evaluate the proposed search algorithm with the modifications of gating function, which exclude its advantages one by one. Table 7 shows that each advantage significantly improves the performance of searched architectures. Note that Type (2) gating function in Table 7 search for an operation per channel, while the gating functions in ProxylessNAS [5] and FBNet [50] choose one operation per layer.

4.4 Image Super-Resolution

To verify the more practical effectiveness of our approach, we evaluate our search method on image super-resolution (SR) tasks. The primary metric of this task is FLOPs of networks because the FLOPs are easy to calculate regardless of input image resolutions, which are arbitrary in SR problems.

Results

Table 8 shows FLOPs of networks producing an HD image (1280720 resolution) by scaling factor 2. Since SR networks require substantially large amount of FLOPs comparing to conventional image classification networks, our search algorithm aims to find faster networks. FGNAS achieves 1.5 reduced FLOPs and the number of parameters than the state-of-the-art NAS approaches [7, 44] as illustrated in Table 8. Note that FGNAS is even faster than SRCNN [11], which consists of 3 convolution layers. The searched residual blocks have large number of channels and operations for activation. The number of channels for skip connections gradually increases in the depth of networks. The search cost is 0.5 GPU day with GeForce 2080 Ti.

5 Conclusion

We presented a novel architecture search technique, referred to as FGNAS, which provides a unified framework of structure and operation search via channel pruning. The proposed approach can be optimized by a gradient-based method, and we formulate a differentiable regularizer of neural networks with respect to resources, which facilitates efficient and stable optimization with the diverse tasks-specific and resource-aware loss functions.

References

  • [1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In CVPRW, 2017.
  • [2] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, accurate, and lightweight super-resolution with cascading residual network. arXiv:1803.08664, 2018.
  • [3] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. ICLR, 2017.
  • [4] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie-Line Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, 2012.
  • [5] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019.
  • [6] Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In ICCV, 2019.
  • [7] Xiangxiang Chu, Bo Zhang, Hailong Ma, Ruijun Xu, Jixiang Li, and Qingyuan Li. Fast, accurate and lightweight super-resolution with neural architecture search. ArXiv, abs/1901.07261, 2019.
  • [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
  • [9] Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
  • [10] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552, 2017.
  • [11] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. TPAMI, 38:295–307, 2014.
  • [12] Jin-Dong Dong, An-Chieh Cheng, Da-Cheng Juan, Wei Wei, and Min Sun. Dpp-net: Device-aware progressive search for pareto-optimal neural architectures. In ECCV, 2018.
  • [13] Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan. More is less: A more complicated network with less inference complexity. In CVPR, 2017.
  • [14] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. ICLR, 2016.
  • [15] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NIPS. 2015.
  • [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR, 2016.
  • [17] Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerating deep convolutional neural networks. In IJCAI, 2018.
  • [18] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In ECCV, 2018.
  • [19] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In ICCV, Oct 2017.
  • [20] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3. In ICCV, 2019.
  • [21] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint, arXiv:1704.04861, 2017.
  • [22] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018.
  • [23] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. CVPR, 2018.
  • [24] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CVPR, 2017.
  • [25] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In CVPR. IEEE, 2015.
  • [26] Q. Huang, K. Zhou, S. You, and U. Neumann. Learning to prune filters in convolutional neural networks. In WACV, 2018.
  • [27] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. arXiv preprint, arXiv:1602.07360, 2016.
  • [28] Jiwon Kim, Jungkwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.
  • [29] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
  • [30] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research).
  • [31] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. ICLR, 2017.
  • [32] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. arXiv preprint, arXiv:1707.02921, 2017.
  • [33] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In ECCV, 2018.
  • [34] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. In ICLR, 2018.
  • [35] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In ICLR, 2019.
  • [36] Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, Oct 2017.
  • [37] David R. Martin, Charless C. Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.
  • [38] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz.

    Pruning convolutional neural networks for resource efficient transfer learning.

    ICLR, 2017.
  • [39] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. PMLR, 2018.
  • [40] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le.

    Regularized evolution for image classifier architecture search.

    AAAI, 2019.
  • [41] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
  • [42] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
  • [43] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint, arXiv:1409.1556, 2014.
  • [44] Dehua Song, Chang Xu, Xu Jia, Yiyi Chen, Chunjing Xu, and Yunhe Wang. Efficient residual dense block search for image super-resolution. ArXiv, abs/1909.11409, 2019.
  • [45] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton.

    On the importance of initialization and momentum in deep learning.

    In ICML, 2013.
  • [46] Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu. Memnet: A persistent memory network for image restoration. In ICCV, 2017.
  • [47] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V. Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint, arXiv:1807.11626, 2018.
  • [48] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
  • [49] Mingxing Tan and Quoc V. Le, editors. MixConv: Mixed Depthwise Convolutional Kernels, 2019.
  • [50] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In CVPR, 2019.
  • [51] Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: stochastic neural architecture search. In ICLR, 2019.
  • [52] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Netadapt: Platform-aware neural network adaptation for mobile applications. In ECCV, 2018.
  • [53] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. In ICCS, 2010.
  • [54] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.
  • [55] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In CVPR, 2018.
  • [56] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jin-Hui Zhu. Discrimination-aware channel pruning for deep neural networks. In NIPS, 2018.
  • [57] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.
  • [58] Barret Zoph, V. Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. CVPR, 2018.