Deep convolutional neural networks (CNNs) have recently achieved great success in various fields including computer vision, natural language processing, pattern recognition, bioinformatics, and many others. However, the arbitrary complexity of target problems and the requirement of extensive hyperparameter search make it inevitable to manually explore the ideal deep network architectures customized for the given tasks. Consequently, neural architecture search (NAS) approaches have been studied actively, and the models identified by the NAS techniques[57, 39, 47, 18] started to surpass the performance of the traditional deep neural networks [43, 16, 24] designed by human. Despite such successful results, it is still a challenging problem to optimize deep neural networks even by sophisticated AutoML techniques because the search space of the existing NAS methods is limited while their search cost is high.
|AMC ||NetAdapt ||Huang et al. ||MnasNet ||ProxylessNAS & FBNet [5, 50]||FGNAS (Ours)|
|Structure search||Prune channels||✓||✓||✓||✓|
|Operation search||Find efficient operations||✓||✓||✓|
Researchers have aimed to develop flexible and scalable NAS techniques with large search spaces and identify the unique models different from the manually designed structures . However, NAS methods often suffer from huge computational cost and reduce their search space significantly for practical reasons. For example, [57, 3, 39, 33, 35] search for two cells as basic building blocks to construct full models by stacking them. To tackle the redundancy between the cells and increase the diversity of full models, MnasNet  adopts smaller search units, blocks, than cells. Recently, FBNet  and ProxylessNAS  reduce their search units further to individual layers. Although the resulting models are more flexible by decreasing the granularity of search units and increasing the diversity of the generated models through their composition, those methods are limited to allocating a single operation per layer and the operation configurations of the whole network are proportional to the number of layers.
On the contrary, we present a flexible and scalable neural architecture search algorithm. The search unit of our algorithm is channel, which is even smaller than layer; each channel chooses a different operation111We define a series of a convolution, normalization and activation function application by an operation., which also includes no-operation, equivalent to channel pruning. This kind of search strategy improves the flexibility of resulting models because it is possible to generate a large number of configurations even within a single layer, which increase exponentially by adding layers. Such an extremely flexible framework incurs small overhead, which allows to maintain various operations for search and increase search space significantly. Figure 1 illustrates the proposed fine-grained neural architecture search (FGNAS) approach, where our per-channel search algorithm generates a feature map given by a composition of multiple operations and also reduces the number of channels by pruning.
FGNAS is trained to maximize the validation accuracy efficiently and stably by a stochastic gradient descent method. Moreover, it is convenient to regularize individual channels by incorporating FLOPs and latency into the training objective. Therefore, the proposed algorithm has a great deal of flexibility and scalability to maximize the accuracy of searched models while facilitating to consider various aspects for optimization. Our overall contribution is summarized as follows:
We propose a flexible and scalable fine-grained neural architecture search algorithm, which allows to perform per-channel operation search including channel pruning efficiently and optimize end-to-end by a stochastic gradient descent method.
Our framework deals with diverse objectives of neural architecture search such as number of parameters, FLOPs and latency, in addition to accuracy, conveniently.
The resulting models from our algorithm achieve outstanding performance improvements with respect to various evaluation metrics in image classification and single image super-resolution problems.
The rest of this paper is organized as follows. We first discuss existing works related to deep neural network optimization and neural architecture search in Section 2. Section 3 describes the proposed algorithm in details including training methods and Section 4 presents experimental results in comparison to the existing methods.
2 Related Work
This section describes existing efficient convolution network designs and neural architecture search techniques in details. Table 1 presents the snapshot of the algorithms discussed in this section.
Efficient Convolution Networks
Designing compact convolutional neural networks has been an active research problem in the last few years. While the hand-crafted models achieve efficient convolutional operations by revising network structures [27, 21, 54, 42, 23], the simple rule-based network quantization  and pruning techniques [15, 9, 13, 14, 38, 36, 31, 19]
reduce the redundancy of deep and complex pretrained models successfully. Recent pruning methods automatically remove filters and/or activations using reinforcement learning, trial-and-error , and policy-gradient . They optimize a network in a layer-by-layer fashion, which is inefficient in dealing with inter-layer relationships, while our FGNAS optimizes all layers jointly using a gradient-based method.
Neural Architecture Search (NAS)
Automatic architecture search techniques conceptually have more flexibility in the identified models than the hand-crafted methods. NASNet  and MetaQNN  adopt reinforcement learning for non-differential optimization. ENAS  employs a RNN controller to search for the optimal model by drawing a series of sample models and maximizing their expected reward, while PNAS  performs a progressive architecture search by predicting accuracy of candidate models. Evolutionary search  employs a tournament selection; although it is the first algorithm to surpass the state-of-the-art classification accuracy, it requires significantly more computational resources. DARTS  relaxes the discrete architecture representation to a continuous one and addresses scalability issue by making the objective function differentiable. MnasNet  and DPP-Net  are optimized with respect to the accuracy and run-time via reinforcement learning and performance predictor. EfficientNet  improves network efficiency by simply scaling depth, width, and resolution of backbone network. MobileNetV3  adopts block-wise search  with layer-wise pruning  and presents a novel architecture design with Squeeze-and-Excitation . Recently, multiple choice gating function is often adopted for differentiable and multi-objective search techniques. ProxylessNAS  and FBNet  search for efficient convolution operations in each layer. MixConv  finds a new depth-wise convolution operation that has multiple kernel sizes within a layer. Our FGNAS presents per-channel convolution operation search, which constructs maximally flexible layer configurations as illustrated in Figure 1 and runs efficiently through a differentiable optimization.
3 Proposed Algorithm
This section first presents our efficient search formulation via binary masking, and discusses our gating function that allows to perform the end-to-end differentiable search. Then, we present the objective function of our algorithm based on resource regularizer, which directly penalizes each channel, and describes the exact search space.
3.1 Formulation of Operation Search
produces a binary value in the forward pass and a softmax probability in the backward pass for gradient-decent optimization. (b) The collection of gating functionsis a relaxed version of in (2). (c) controls searched architectures by determining active channels in the forward pass. During the gradient-decent optimization procedure in the backward pass, resource regularizer plays a role to penalize a channel with high resource consumption, while the task-specific loss attempts to keep the channel alive if it performs well in the target task.
Although FGNAS has a large search space and generates flexible output models, a critical concern is how to perform NAS efficiently through proper configuration of the search space. To tackle this challenge, FGNAS constructs a feature map using a composition of multiple operations as illustrated in Figure 2, where the composition allows to generate a large number of virtual operations and increase the flexibility of searched models. Given an input tensor in the -th layer, denoted by , the output of the layer, , is expressed as
where is the number of operations at -th layer considered in our search and
Note that is a binary vector, represents the -th operation producing a tensor with channels, and denotes the channel-wise binary masking operator. In other words, the output tensor is given by the average of masked tensors, where the mask of each tensor is learned by our search algorithm, which also allows channel pruning by masking out the same channels in all output tensors. In addition to the operation search, we consider the identity connections from a preceding layer optionally, which derives the modification of (1) as
where denotes the feature map from which the identify connection is originated.
In our algorithm, each operation is defined by a series of convolution, normalization, and activation function application as illustrated in Figure 1 (b). Figure 3 presents our efficient operation structure to increase the number of operations with little additional cost because all the three operations in Figure 3 share the previous feature map of a normalization. For some parts of backbone networks that convolutional layers are not followed by normalization and activation function layers, an operation is actually equivalent to a convolution.
3.2 Per-Channel Differentiable Gating Functions
To relax the binary mask in (2), we introduce a relaxed gating function , and define a collection of the gating functions, denoted by , as
where and denote the layer and operation index, respectively, and is the number of channels. A relaxed gating function for each channel parametrized by is given by
where is an indicator function that returns 1 when its input is true and 0 otherwise, and denotes the value corresponding to dimension after applying a softmax function. Figure 4 (a) and (b) illustrate and , respectively.
Using the relaxed gating function, we reformulate the channel-wise tensor masking in (2) as
This relaxed gating function allows to update the architecture by a gradient-decent optimization method because the backward function is differentiable.
3.3 Resource Regularizer on Channels
The proposed approach aims to maximize the accuracy of a target task and minimize the resource usage of the identified model. Hence, our objective function is composed of two terms; one is the task-specific loss and the other is a regularizer penalizing overhead of networks such as parameters, FLOPs, and latency. To search for operations per channel, the proposed regularizer computes the amount of resource usage of a channel, which changes over iterations due to the gradual update of architectures. Figure 4 (c) illustrates overview of the resource regularizer, and the rest of this section discusses the details.
denote a loss function for an arbitrary task222In our work, the tasks are image classification and super-resolution. and
be a differentiable regularizer that estimates the resources of the current model identified by our search algorithm. Then, the objective function is formally given by
where and are learnable parameters in the neural networks and the gating functions , respectively, and is the hyper-parameter balancing the two terms. Specifically, the regularizer is given by
where is a resource measurement function of the -th operation, indicates the type of the resources, and is the number of layers. Note that and are the number of input/output channels of the -th operation, respectively, and they are differentiable via gating function, defined as
where denotes norm of a vector. The function produces a binary vector, valued 1 for non-zero elements of the input vector in the forward pass, but is an identity function in the backward pass. The skip connection from an earlier layer affects (10) because we need to consider an extra term for the summation.
On the other hand, and are well-defined functions of convolution kernel sizes, number of channels, feature map resolution, etc. They are differentiable with respect to the number of active channels by the definitions in (9) and (10). However, it is not straightforward how to define the latency measurement function on specific devices such as Google Pixel 1 and Samsung Galaxy S8. We address this problem by fitting affine functions of the relation between latency and FLOPs, which are parameterized by
; it turns out that the convolution operations present strong correlations between latency and FLOPs in a particular condition provided by the combination of input feature map size, kernel size, stride, convolutional groups and so on. By approximating latency as a function of FLOPs, (8) with naturally penalizes all channels to minimize the run-time of networks.
3.4 Search Space
FGNAS searches for an operation in each channel; the granularity of architecture search is as small as a channel. Consequently, the possible combinations of operations in FGNAS is significantly more than those of any other NAS techniques. Specifically, the search space in a single layer is , where is the number of operations and is the number of channels at -th layer, while it has minor variations depending on the network configurations (e.g., existence of skip connections). This is truly beyond the comparable range to other approaches because most of the NAS techniques are limited to adopting a per-layer search strategy and exploring few building blocks instead of directly optimizing the whole model.
Table 2 illustrates the search space of operations in our search algorithm. The backbone networks for image classification include VGG, ResNet, DenseNet, EfficientNet, and MobileNetV2, while EDSR is employed for image super-resolution. Note that we insert a 11 convolution operation after an identity connection to reduce the number of input channels to the first convolution operation of a residual (or dense) block.
|Convolution types||Normal, Depth-wise|
|Convolution kernel sizes||1, 3, 5, 7, 9, 11|
|Activation functions||ReLU, PReLU, tanh|
|The number of channels||0, 1, 2, , ,|
|Model||Type||Search Cost (GPU-days)||Top-1 Acc.||Parameters|
|DenseNet-BC ||manual||-||96.5 %||25.6 M|
|Hierarchical Evolution ||evolution||300||96.3 %||15.7 M|
|P-DARTS (large)  + cutout||gradient-based||0.3||97.8 %||10.5 M|
|ProxylessNAS-G  + cutout||gradient-based||4.0||97.9 %||5.7 M|
|ENAS  + cutout||RL||0.5||97.1 %||4.6 M|
|EfficientNet-B0 ||model scaling||-||98.1 %||4.0 M|
|EfficientNet-B0-FGNAS (Large) + cutout||gradient-based||0.1||98.2 %||3.9 M|
|P-DARTS  + cutout||gradient-based||0.3||97.5 %||3.4 M|
|NASNet-A  + cutout||RL||1800||97.4 %||3.3 M|
|DARTS  (first order) + cutout||gradient-based||1.5||97.0 %||3.3 M|
|DARTS  (second order) + cutout||gradient-based||4||97.2 %||3.3 M|
|AmoebaNet-A  + cutout||evolution||3150||96.6 %||3.2 M|
|PNAS ||SMBO||225||96.6 %||3.2 M|
|SNAS  + mild constraint + cutout||gradient-based||1.5||97.0 %||2.9 M|
|SNAS  + moderate constraint + cutout||gradient-based||1.5||97.2 %||2.8 M|
|AmoebaNet-B  + cutout||evolution||3150||97.5 %||2.8 M|
|EfficientNet-B0-FGNAS (Small) + cutout||gradient-based||0.5||97.8 %||2.7 M|
|Model||Search Space||Method||Type||Top-1 Acc.||Parameters||FLOPs||CPU|
|MobileNetV2 (224)||No Search||Baseline||manual||72.0 %||3.4 M||600 M||75 ms|
|+ Channel Pruning||Multiplier (0.75) ||manual||69.8 %||2.6 M||418 M||56 ms|
|NetAdapt ||trial-and-error||70.9 %||-||-||64 ms|
|FGNAS (P)||gradient-based||70.9 %||3.5 M||410 M||53 ms|
|+ 55 DConv||FGNAS||gradient-based||71.4 %||3.1 M||378 M||53 ms|
Comparison with channel pruning methods on ImageNet.is a reported result and similar latency with Multiplier (0.75) in .
This section first presents the benchmark datasets for image classification and super-resolution tasks, and describe the implementation details of our algorithm. Then, we present the experimental results including performance analysis.
CIFAR-10  and ILSVRC2012  are popular datasets for image classification. The former contains 50K and 10K 3232 images for training and testing in 10 classes. The latter consists of 1.2M training and 50K validation images in 1,000 object categories, which are a subset of ImageNet . DIV2K  is a training dataset for image super-resolution, which contains 800 2K images while we evaluate super-resolution algorithms on Set5 , Set14 , B100 , and Urban100 
4.2 Implementation Details
The proposed algorithm searches for architectures with 4 steps; (1) determine a backbone network and operations for each layer, (2) pre-train the network without gating functions, (3) search for architectures by learning gating function parameters until the resource of searched architecture reaches target resource, (4) fine-tune the searched architectures with fixed gating function parameters.
The backbone network is EfficientNet-B0 
, of which the architecture is designed for ImageNet and transferred to CIFAR-10. The search space is 1, 3, and 5 kernel sizes in depth-wise convolution layers and the number of channels in all layers. We train the model for 160 epochs with mini-batch size 128 and initial learning rate 0.01. The resource of interestis number of parameters of networks and the hyper-parameter is set to for resource regularizer. We use the standard SGD optimizer with nesterov  and Cutout augmentation . We use weight decay and the momentum of 0.0001 and 0.9, respectively.
|Model||Search Space||Method||Top-1 Acc.||FLOPs|
|VGG-16||No Search||Baseline||93.7 %||627 M|
|+ Channel pruning||FGNAS (P)||93.6 %||149 M|
|+ 11 1111 Conv.||FGNAS||93.6 %||119 M|
|+ ReLU, PReLU, Tanh||FGNAS||93.6 %||110 M|
MobileNetV2  is the backbone network, of which the architecture has compact designed for ImageNet classification. The search space is 3 and 5 kernel sizes in depth-wise convolution layers and the number of channels in all layers. We train models using mini-batch size 256 with the initial learning rates are set to 0.01. The training epochs are 400 and the learning rates are divided by 10 at 50% and 75% of the total number of training epochs. The resource of interest is latency of networks and the hyper-parameter is set to 0.0012 for resource regularizer. We evaluate our models on Google Pixel 1 CPU using Google’s Tensor-Flow Lite engine.
The backbone network is a small version of EDSR , of which each layer and the architecture have 64 channels and 16 residual blocks, respectively. The search space is ReLU, PReLU, and tanh in activation layers and the number of channels in all layers. The model is pre-trained for 300 epochs using Adam , where minibatch size is 16 with learning rate , patch size 9696 pixels, , , . The resource of interest is FLOPs of networks and the hyper-parameter is set to for resource regularizer. The image restoration performance measures are PSNR and SSIM on Y channel of YCbCr color space with the scaling factor 2.
4.3 Image Classification
Results on CIFAR-10
Table 3 illustrates the performance comparison with the state-of-the-art architectures. FGNAS (Large) outperforms the backbone network EfficientNet-B0  with smaller number of parameters, and FGNAS (Small) has 2.1 smaller parameters than ProxylessNAS-G  with the comparable accuracy. The search cost of the proposed algorithm is small, but requires more time to find smaller networks.
|VGG-16||Baseline||manual||93.7 %||627 M|
|Huang et al. ||policy-gradient||90.9 %||222 M|
|Slimming ||rule-based||93.6 %||211 M|
|FGNAS (P)||gradient-based||93.6 %||149 M|
|VGG-19||Baseline||manual||94.0 %||797 M|
|Slimming ||rule-based||93.8 %||391 M|
|DCP ||gradient-based||94.2 %||398 M|
|FGNAS (P)||gradient-based||94.3 %||348 M|
|ResNet-18||Baseline||manual||91.5 %||26.0 G|
|Huang et al. ||policy-gradient||90.7 %||6.2 G|
|FGNAS (P)||gradient-based||92.5 %||1.3 G|
|ResNet-20||Baseline||manual||92.2 %||81 M|
|Soft Filter ||rule-based||91.2 %||57 M|
|FGNAS (P)||gradient-based||91.7 %||34 M|
|DenseNet-40||Baseline||manual||94.3 %||566 M|
|Slimming ||rule-based||93.5 %||188 M|
|FGNAS (P)||gradient-based||93.6 %||149 M|
|Type||Channel Pruning||Multiple-operation||Top-1 Acc.||FLOPs|
|(1)||✓||91.0 %||278 M|
|(2)||✓||91.6 %||131 M|
|Ours||✓||✓||92.5 %||61 M|
|SRCNN ||manual||36.66 dB / 0.9542||32.42 dB / 0.9063||31.36 dB / 0.8879||29.50 dB / 0.8946||57 K||105.4 G|
|VDSR ||manual||37.53 dB / 0.9587||33.03 dB / 0.9124||31.90 dB / 0.8960||30.76 dB / 0.9140||665 K||1,225.2 G|
|CARN-M ||manual||37.53 dB / 0.9583||33.26 dB / 0.9141||31.92 dB / 0.8960||31.23 dB / 0.9144||412 K||182.4 G|
|CARN ||manual||37.76 dB / 0.9590||33.52 dB / 0.9166||32.09 dB / 0.8978||31.92 dB / 0.9256||1,592 K||445.6 G|
|MemNet ||manual||37.78 dB / 0.9597||33.28 dB / 0.9142||32.08 dB / 0.8978||31.51 dB / 0.9312||677 K||5,324.8 G|
|EDSR ||manual||38.11 dB / 0.9601||33.92 dB / 0.9198||32.32 dB / 0.9013||32.93 dB / 0.9351||40,712 K||18,769.5 G|
|RDN ||manual||38.24 dB / 0.9614||34.01 dB / 0.9212||32.34 dB / 0.9017||32.89 dB / 0.9353||22,114 K||10,192.4 G|
|FALSR-B ||evolution||37.61 dB / 0.9585||33.29 dB / 0.9143||31.97 dB / 0.8967||31.28 dB / 0.9191||326 K||149.4 G|
|ESRN-V ||evolution||37.85 dB / 0.9600||33.42 dB / 0.9161||32.10 dB / 0.8987||31.79 dB / 0.9248||324 K||146.8 G|
|EDSR-FGNAS||gradient-based||37.86 dB / 0.9593||33.44 dB / 0.9157||32.11 dB / 0.8987||31.85 dB / 0.9254||212 K||97.6 G|
Results on ImageNet
Table 4 presents the performance comparison with MobileNetV2 Multiplier  and NetAdapt , which successfully prunes channels of efficiently designed networks [42, 20]. For the fair comparison, we evaluate the proposed algorithm as a channel pruning method, referred as FGNAS (P), of which search space is only the number of channels in all layers. FGNAS (P) is faster in the both of FLOPs and latency than other channel pruning methods and FGNAS achieves 1.6% higher Top-1 accuracy than Multiplier. The model latency reaches the target latency within 40 epochs at the search stage, which indicate the search cost of the proposed algorithm.
Ablation study of search space
Our search method easily enlarges search space by adding operations to the layers of backbone networks for more efficient architectures. Table 5 shows that the proposed algorithm finds faster networks in large search space with the same Top-1 Accuracy. Figure 5 draws FLOPs/Accuracy graphs of our search methods. FGNAS consistently outperforms FANAS (P) while reducing the network run-time, and finds the 5.7 smaller FLOPs architecture than original VGG-16 on CIFAR-10.
Searched architecture analysis
To analyze the performance improvement from flexible architectures, we visualize two FGNAS architectures, which have 250M and 110M FLOPs from VGG-16 on CIFAR-10. The search space is 1, 3, 5, 7, 9, and 11 kernel sizes in convolutions and ReLU, PReLU, and tanh in activation functions, and the number of channels in all layers. The searched networks by FGNAS and original VGG-16 have less than 0.3% accuracy differences. Figure 6 (a) shows that 3, 5, 8, and 10-th layers, located at after pooling operation, remains more channels than next layers and 110M FLOPs network prunes most of channels at 1012-th layers of 250M FLOPs network. As illustrated in Figure 6 (b), 110M FLOPs network has much higher numbers of operation types within a layer which lead complex layer configurations. Note that 5-th layer has 31 different operation types. Figure 6 (c) shows that 11 convolutions appear more frequently for the network efficiency. Figure 7 (a) shows convolutions of 11 kernel size produce more channels at 813-th layers, where the feature map resolutions are 44 and 22 pixels. On the other hand, 18-th layers prefer 33 convolutions than 11 and prune most channels at 10-th layer, as illustrated in Figure 7 (b). The channels from convolutions of 55 kernel sizes mainly remain at 3, 5, and 8-th layers, located at after pooling operation.
Channel pruning results on CIFAR-10
We evaluate the channel pruning performance of our algorithm FGNAS (P) based on diverse backbone networks of VGGNet , ResNet , and DenseNet . Since original standard CNN networks are designed for ImageNet, we adopt the modified networks for CIFAR-10 [36, 26]. Table 6 shows that the proposed algorithm outperforms the existing pruning methods [26, 36, 56, 17] even with less FLOPs. Huang et al.  removes channels layer-by-layer with RL-based policy gradient estimation, of which search cost is 30 GPU days using Nvidia K40. Since FGNAS (P) searches over all layers simultaneously using differentiable gating functions, the search cost is 1 GPU hour using GeForce 1080 Ti on CIFAR-10. We reproduced the DenseNet-40 result of Slimming  for fair comparison.
Ablation study of gating function
We evaluate the proposed search algorithm with the modifications of gating function, which exclude its advantages one by one. Table 7 shows that each advantage significantly improves the performance of searched architectures. Note that Type (2) gating function in Table 7 search for an operation per channel, while the gating functions in ProxylessNAS  and FBNet  choose one operation per layer.
4.4 Image Super-Resolution
To verify the more practical effectiveness of our approach, we evaluate our search method on image super-resolution (SR) tasks. The primary metric of this task is FLOPs of networks because the FLOPs are easy to calculate regardless of input image resolutions, which are arbitrary in SR problems.
Table 8 shows FLOPs of networks producing an HD image (1280720 resolution) by scaling factor 2. Since SR networks require substantially large amount of FLOPs comparing to conventional image classification networks, our search algorithm aims to find faster networks. FGNAS achieves 1.5 reduced FLOPs and the number of parameters than the state-of-the-art NAS approaches [7, 44] as illustrated in Table 8. Note that FGNAS is even faster than SRCNN , which consists of 3 convolution layers. The searched residual blocks have large number of channels and operations for activation. The number of channels for skip connections gradually increases in the depth of networks. The search cost is 0.5 GPU day with GeForce 2080 Ti.
We presented a novel architecture search technique, referred to as FGNAS, which provides a unified framework of structure and operation search via channel pruning. The proposed approach can be optimized by a gradient-based method, and we formulate a differentiable regularizer of neural networks with respect to resources, which facilitates efficient and stable optimization with the diverse tasks-specific and resource-aware loss functions.
-  Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. In CVPRW, 2017.
-  Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, accurate, and lightweight super-resolution with cascading residual network. arXiv:1803.08664, 2018.
-  Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures using reinforcement learning. ICLR, 2017.
-  Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie-Line Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In BMVC, 2012.
-  Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In ICLR, 2019.
-  Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In ICCV, 2019.
-  Xiangxiang Chu, Bo Zhang, Hailong Ma, Ruijun Xu, Jixiang Li, and Qingyuan Li. Fast, accurate and lightweight super-resolution with neural architecture search. ArXiv, abs/1901.07261, 2019.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
-  Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
-  Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552, 2017.
-  Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. TPAMI, 38:295–307, 2014.
-  Jin-Dong Dong, An-Chieh Cheng, Da-Cheng Juan, Wei Wei, and Min Sun. Dpp-net: Device-aware progressive search for pareto-optimal neural architectures. In ECCV, 2018.
-  Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan. More is less: A more complicated network with less inference complexity. In CVPR, 2017.
-  Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. ICLR, 2016.
-  Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NIPS. 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR, 2016.
-  Yang He, Guoliang Kang, Xuanyi Dong, Yanwei Fu, and Yi Yang. Soft filter pruning for accelerating deep convolutional neural networks. In IJCAI, 2018.
-  Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In ECCV, 2018.
-  Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In ICCV, Oct 2017.
-  Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3. In ICCV, 2019.
-  Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint, arXiv:1704.04861, 2017.
-  Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, 2018.
-  Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. CVPR, 2018.
-  Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CVPR, 2017.
-  Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In CVPR. IEEE, 2015.
-  Q. Huang, K. Zhou, S. You, and U. Neumann. Learning to prune filters in convolutional neural networks. In WACV, 2018.
-  Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size. arXiv preprint, arXiv:1602.07360, 2016.
-  Jiwon Kim, Jungkwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, 2016.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
-  Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research).
-  Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. ICLR, 2017.
-  Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. arXiv preprint, arXiv:1707.02921, 2017.
-  Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In ECCV, 2018.
-  Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical representations for efficient architecture search. In ICLR, 2018.
-  Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. In ICLR, 2019.
-  Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In ICCV, Oct 2017.
-  David R. Martin, Charless C. Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.
Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz.
Pruning convolutional neural networks for resource efficient transfer learning.ICLR, 2017.
-  Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. PMLR, 2018.
Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le.
Regularized evolution for image classifier architecture search.AAAI, 2019.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
-  Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint, arXiv:1409.1556, 2014.
-  Dehua Song, Chang Xu, Xu Jia, Yiyi Chen, Chunjing Xu, and Yunhe Wang. Efficient residual dense block search for image super-resolution. ArXiv, abs/1909.11409, 2019.
Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton.
On the importance of initialization and momentum in deep learning.In ICML, 2013.
-  Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu. Memnet: A persistent memory network for image restoration. In ICCV, 2017.
-  Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V. Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint, arXiv:1807.11626, 2018.
-  Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
-  Mingxing Tan and Quoc V. Le, editors. MixConv: Mixed Depthwise Convolutional Kernels, 2019.
-  Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In CVPR, 2019.
-  Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. SNAS: stochastic neural architecture search. In ICLR, 2019.
-  Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Netadapt: Platform-aware neural network adaptation for mobile applications. In ECCV, 2018.
-  Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. In ICCS, 2010.
-  Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.
-  Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In CVPR, 2018.
-  Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jin-Hui Zhu. Discrimination-aware channel pruning for deep neural networks. In NIPS, 2018.
-  Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.
-  Barret Zoph, V. Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. CVPR, 2018.