Network Slimming by Slimmable Networks: Towards One-Shot Architecture Search for Channel Numbers

03/27/2019
by   Jiahui Yu, et al.
2

We study how to set channel numbers in a neural network to achieve better accuracy under constrained resources (e.g., FLOPs, latency, memory footprint or model size). A simple and one-shot solution, named AutoSlim, is presented. Instead of training many network samples and searching with reinforcement learning, we train a single slimmable network to approximate the network accuracy of different channel configurations. We then iteratively evaluate the trained slimmable model and greedily slim the layer with minimal accuracy drop. By this single pass, we can obtain the optimized channel configurations under different resource constraints. We present experiments with MobileNet v1, MobileNet v2, ResNet-50 and RL-searched MNasNet on ImageNet classification. We show significant improvements over their default channel configurations. We also achieve better accuracy than recent channel pruning methods and neural architecture search methods. Notably, by setting optimized channel numbers, our AutoSlim-MobileNet-v2 at 305M FLOPs achieves 74.2 (301M FLOPs), and even 0.2 AutoSlim-ResNet-50 at 570M FLOPs, without depthwise convolutions, achieves 1.3 better accuracy than MobileNet-v1 (569M FLOPs). Code and models will be available at: https://github.com/JiahuiYu/slimmable_networks

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

12/20/2019

AtomNAS: Fine-Grained End-to-End Neural Architecture Search

Designing of search space is a critical problem for neural architecture ...
10/15/2021

Joint Channel and Weight Pruning for Model Acceleration on Moblie Devices

For practical deep neural network design on mobile devices, it is essent...
05/20/2019

DARC: Differentiable ARchitecture Compression

In many learning situations, resources at inference time are significant...
12/21/2018

Slimmable Neural Networks

We present a simple and general method to train a single neural network ...
07/13/2021

Exploring Channel Probing to Determine Coherent Optical Transponder Configurations in a Long-Haul Network

We use channel probing to determine the best transponder configurations ...
10/06/2021

ParaDiS: Parallelly Distributable Slimmable Neural Networks

When several limited power devices are available, one of the most effici...
04/01/2021

EfficientNetV2: Smaller Models and Faster Training

This paper introduces EfficientNetV2, a new family of convolutional netw...

Code Repositories

slimmable_networks

Slimmable Networks, AutoSlim, and Beyond, ICLR 2019, and ICCV 2019


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The channel configuration (a.k.a. filter numbers or channel numbers) of a neural network plays a critical role in its affordability on resource constrained platforms, such as mobile phones, wearables and Internet of Things (IoT) devices. The most common constraints [1, 2, 3, 4, 5], , latency, FLOPs and runtime memory footprint, are all bound to the number of channels. For example, in a single convolution or fully-connected layer, the FLOPs (number of Multiply-Adds) increases linearly by the output channels. The memory footprint can also be reduced [6] by reducing the number of channels in bottleneck convolutions for most vision applications [6, 7, 8, 9].

Despite its importance, the number of channels has been chosen mostly based on heuristics. LeNet-5 

[10] selected 6 channels in its first convolution layer, which is then projected to 16 channels after sub-sampling. AlexNet [11] adopted five convolutions with channels equal to , , , and . A commonly used heuristic, the “half size, double channel” rule, was introduced in VGG nets [12], if not earlier. The rule is that when spatial size of feature map is halved, the number of filters is doubled. This heuristic has been more-or-less used in followup network architecture designs including ResNets [13, 14], Inception nets [15, 16, 17], MobileNets [6, 7] and networks for many vision applications [18, 19, 20, 21, 22]. Other heuristics have also been explored. For example, the pyramidal rule [23, 24] suggested to gradually increase the channels in all convolutions layer by layer, regardless of spatial size. Figure 1 visually summarizes these heuristics for setting channel numbers in a neural network.

Figure 1: Various heuristics for setting channel numbers across entire network ([12, 23, 24], and inside network building blocks ([6, 13, 23, 24, 25, 26].

Beyond the macro-level heuristics across entire network, recent works [6, 13, 24, 25, 26] have also digged into channel configuration for micro-level building blocks (a network building block is usually composed of several and convolutions). These micro-level heuristics have led to better speed-accuracy trade-offs. The first of its kind, bottleneck residual block, was introduced in ResNet [13]. It is composed of , , and convolutions, where the layers are responsible for reducing and then restoring dimensions, leaving the layer a bottleneck ( reduction). MobileNet v2 [6], however, argued that the bottleneck design is not efficient and proposed the inverted residual block where layers are used for expanding feature first ( expansion) and then projecting back after intermediate depthwise convolution. Furthermore, MNasNet [25] and ProxylessNAS nets [26] included expansion version of inverted residual block into search space, and achieved even better accuracy under similar runtime latency.

Apart from these human-designed heuristics, efforts on automatically optimizing channel configuration have been made explicitly or implicitly. A recent work [27] suggested that many network pruning methods [1, 28, 29, 30, 31, 32] can be thought of as performing network architecture search for channel numbers. Liu  [27] showed that training these pruned architectures from scratch leads to similar or even better performance than fine-tuning and pruning from a large model. More recently, MNasNet [25] proposed to directly search network architectures, including filter sizes, using reinforcement learning algorithms [33, 34]. Although the search is performed on the factorized hierarchical search space, massive network samples and computational cost [25] are required for an optimized network architecture.

In this work, we study how to set channel numbers in a neural network to achieve better accuracy under constrained resources. To start, the first and the most brute-force approach came in mind is the exhaustive search: training all possible channel configurations of a deep neural network for full epochs (, MobileNets 

[6, 7] are trained for approximately 480 epochs on ImageNet). Then we can simply select the best performers that are qualified for efficiency constraints. However, it is undoubtedly impractical since the cost of this brute-force approach is too high. For example, we consider a -layer convolutional networks and a search space limited to 10 candidates of channel numbers (, , , …, ) for each layer. As a result, there are totally candidate network architectures.

To address this challenge, we present a simple and one-shot solution AutoSlim. Our main idea lies in training a slimmable network [35] to approximate the network accuracy of different channel configurations. Yu  [35, 36]

introduced slimmable networks that can run at arbitrary width with equally or even better performance than same architecture trained individually. Although the original motivation is to provide instant and adaptive accuracy-efficiency trade-offs, we find slimmable networks are especially suitable as benchmark performance estimators for several reasons: (1) Training slimmable models (using

the sandwich rule [36]) is much faster than the brute-force approach. (2) A trained slimmable model can execute at arbitrary width, which can be used to approximate relative performance among different channel configurations. (3) The same trained slimmable model can be applied on search of optimal channels for different resource constraints.

In AutoSlim, we first train a slimmable model for a few epochs (, 10% to 20% of full training epochs) to quickly get a benchmark performance estimator. We then iteratively evaluate the trained slimmable model and greedily slim the layer with minimal accuracy drop on validation set (for ImageNet, we randomly hold out samples of training set as validation set). After this single pass, we can obtain the optimized channel configurations under different resource constraints (, network FLOPs limited to 150M, 300M and 600M). Finally we train these optimized architectures individually or jointly (as a single slimmable network) for full training epochs. We experiment with various networks including MobileNet v1, MobileNet v2, ResNet-50 and RL-searched MNasNet on the challenging setting of 1000-class ImageNet classification. We compare our results with two baselines: (1) the default channel configuration of these networks, and (2) channel pruning methods on same network architectures [29, 30, 37, 38].

Our contributions are summarized as follows:

  • We present the first one-shot approach on network architecture search for channel numbers with experiments on large-scale ImageNet classification.

  • We demonstrate the importance of channel configuration in neural networks and the effectiveness of our approach on addressing this challenging problem.

  • We achieve the state-of-the-art speed-accuracy trade-offs by setting the optimized channel configurations using AutoSlim.

2 Related Work

2.1 Architecture Search for Channel Numbers

In this part, we mainly discuss previous methods on automatic architecture search for channel numbers. Human-designed heuristics have been introduced in Section 1 and visually summarized in Figure 1.

Channel Pruning. Channel pruning (a.k.a, network slimming) methods [1, 30, 39, 40, 41] aim at reducing effective channels of a large neural network to speedup its inference. Both training-based, inference-time and initialization-time pruning methods have been proposed [1, 30, 39, 40, 41, 42] in the literature. Here we selectively review two methods [1, 30]. He  [30] proposed an inference-time approach based on an iterative two-step algorithm: the LASSO based channel selection and the least square feature reconstruction. Liu  [1], on the other hand, trained neural networks with a

regularization on the scaling factors in batch normalization (BN) 

[43]. By pushing the factors towards zero, insignificant channels can be identified and removed. In a recent work [27], Liu suggested that many network pruning methods [1, 28, 29, 30, 31, 32] can be thought of as performing network architecture search for channel numbers. In experiments, Liu  [27] showed that training these pruned architectures from scratch leads to similar or even better performance than iteratively fine-tuning and pruning a large model. Thus, Liu  [27] concluded that training a large, over-parameterized model is not necessary to obtain an efficient final model. In our work, we take channel pruning methods [29, 30, 37] as one of baselines.

Neural Architecture Search (NAS). Recently there has been a growing interest in automating the neural network architecture design [25, 26, 44, 45, 46, 47, 48, 49, 50, 51]. Significant improvements have been achieved by these automatically searched architectures in many vision and language tasks [47, 52]. However, most neural architecture search methods [44, 45, 46, 47, 48, 49, 50, 51] did not include channel configuration into search space, and instead applied human-designed heuristics. More recently, the RL-based searching algorithms are also applied to prune channels [37] or search for filter numbers [25] directly. He proposed AutoML for Model Compression (AMC) [37] which leveraged reinforcement learning (deep deterministic policy gradient [53]) to provide the model compression policy. MNasNet [25] proposed to directly search network architectures, including filter sizes, for mobile devices. In the search, each sampled model is trained on epochs using an aggressive learning rate schedule, and evaluated on a validation set. In total, Tan sampled about models during architecture search. Further, ProxylessNAS [26] proposed to directly learn the architectures for large-scale target tasks and target hardware platforms, based on DARTS [50]. For each residual block, ProxylessNAS [26] followed the channel configuration of MNasNet [25], while inside each block, the choices can be or version of inverted residual blocks. The memory consumption issue [26, 50]

was addressed by binarizing the architecture parameters and forcing only one path to be active.

2.2 Slimmable Networks

Slimmable networks were firstly introduced in [35]. A general slimmable training algorithm and the switchable batch normalization were introduced to train a single neural network executable at different widths, permitting instant and adaptive accuracy-efficiency trade-offs at runtime. However, one drawback of the switchable batch normalization is that the width can only be chosen from a predefined widths set. The drawback was addressed in [36], where the authors introduced universally slimmable networks, extending slimmable networks to execute at arbitrary width, and generalizing to networks both with and without batch normalization layers. Meanwhile, two improved training techniques, the sandwich rule and inplace distillation, were proposed [36] to enhance training process and boost testing accuracy. Moreover, with the proposed methods, one can train nonuniform universally slimmable networks, where the width ratio is not uniformly applied to all layers. In other words, each layer in a nonuniform universally slimmable network can adjust its number of channels independently during inference. In this work, we simply refer to nonuniform universally slimmable networks as slimmable networks, if not explicitly noted. While the original motivation [35, 36] of slimmable networks is to provide instant and adaptive accuracy-efficiency trade-offs at runtime for different devices, we present an approach that uses slimmable networks for searching channel configurations of deep neural networks.

3 Network Slimming by Slimmable Networks

Figure 2: The flow diagram of our proposed approach AutoSlim.

In this section, we first present an overview of our proposed approach for searching channel configuration of neural networks. We then discuss and analyze the difference of our approach compared with other baselines, , network pruning methods and network architecture search methods. Afterwards we present each individual module in our proposed solution and discuss its non-trivial details.

3.1 Overview

The goal of channel configuration search is to optimize the number of channels in each layer, such that the network architecture with optimized channel configuration can achieve better accuracy under constrained resources. The constraints can be FLOPs, latency, memory footprint or model size. Our approach is conceptually simple, and it has two essential steps:

(1) Given a network architecture (, MobileNets, ResNets), we first train a slimmable model for a few epochs (, 10% to 20% of full training epochs). During the training, many different sub-networks with diverse channel configurations have been sampled and trained. Thus, after training one can directly sample its sub-network architectures for instant inference, using the correspondent computational graph and same trained weights.

(2) Next, we iteratively evaluate the trained slimmable model on the validation set. In each iteration, we decide which layer to slim by comparing their feed-forward evaluation accuracy on validation set. We greedily slim the layer with minimal accuracy drop, until reaching the efficiency constraints. No training is required in this step.

The flow diagram of our approach is shown in Figure 2. Our approach is also flexible for different resource constraints, since the FLOPs, latency, memory footprint and model size are all deterministic given a channel configuration and a runtime environment. By a single pass of greedy slimming in step (2), we can obtain the (FLOPs, latency, memory footprint, model size, accuracy) tuples of different channel configurations. It is noteworthy that the latency and accuracy are relative values, since the latency may be different across different hardware and the accuracy can be improved by training the network for full epochs. In the setting of optimizing channel numbers, we benefit from these relative values as performance estimators.

Figure 3: The flow diagram of network pruning methods [1].
Figure 4: The flow diagram of network architecture search methods [25, 26, 47, 52].

Discussion. We compare the flow diagram of our approach with the baselines, , network pruning methods and network architecture search methods.

Many network channel pruning methods [1, 4, 29, 32] follow a typical iterative training-pruning-finetuning pipeline, as shown in Figure 3. For example, Liu  [1] trained neural networks with a regularization on the scaling factors in batch normalization (BN). After training, the method obtains channels in which many scaling factors are near zero for pruning. Pruning will temporarily lead to accuracy loss, thus the fine-tuning process and a repetitive multi-pass procedure are introduced for enhancement of final accuracy. Compared with our approach, a notable difference is that most network channel pruning methods are grounded on the importance of trained weights, thus the slimmed layer usually consists channels of discrete index (, the 4th, 7th, 9th channel are left as important channels while all others are pruned). In our approach, after slimmable training, the importance of the weight is implicitly ranked by its index. Thus our approach focuses more on the importance of channel numbers, and we always keep the lower-index channels (, all 1st to 3rd channels are left while 4th to 10th channels are slimmed in step (2)). We demonstrate the advantage of our approach by empirical evidences on ImageNet classification with various network architectures.

Network architecture search methods [25, 26, 47, 52] commonly consist of three major components: search space, search strategy, and performance estimation strategy. A typical pipeline is shown in Figure 4. First the search space is defined, based on which the search agent samples network architectures. The architecture is then passed to a performance estimator, which returns rewards (, predictive accuracy after training and/or network runtime latency) to the search agent. In the process, the search agent learns from the repetitive loop to design better network architectures. One major drawback of network architecture search methods is their high computational cost and time cost [46, 50]. Although recently differentiable architecture search methods [50, 54] were proposed, they cannot be applied on search of channel numbers directly. Most of them [50, 54] were still using human-designed heuristics for setting channel numbers, which may introduce human bias.

3.2 Training Slimmable Networks

Warmup. We warmup by a brief review of training techniques for slimmable networks. More details can be found in [35, 36]. Slimmable networks were firstly introduced and trained with switchable batch normalization [43]

, which employed individual BNs for different sub-networks. During training, features are normalized with current mini-batch mean and variance, thus a simple modification to switchable batch normalization is introduced in 

[36]: re-calibrating BN statistics after training. With this simple modification, one can train universally slimmable networks [36] that can run with arbitrary channel numbers. Moreover, two improved training techniques the sandwich rule and inplace distillation were introduced to enhance training process and boost testing accuracy. We use all these techniques in training slimmable models by default.

Assumption. Our approach lies in the assumption that the slimmable model is a good accuracy estimator of individually trained models given same channel configuration. More specifically, we are interested in the relative ranking of accuracy among networks with different channel configurations. We use the instant inference accuracy of a slimmable model as the performance estimator. We note that assumptions and approximations commonly exist in other related methods. For example, in network channel pruning methods [1, 30], one may assume that weights with smaller norm are less informative and can be pruned, which may not be the case as shown in [39]. Recently the Lottery Ticket Hypothesis [42] was also introduced. In network architecture search methods [25, 26], one may believe the transferability among different datasets, accuracy approximations using aggressive learning rates and fewer training epochs, and approximation in runtime latency modeling.

The Search Space. The executable sub-networks in a slimmable model compose the search space of channel configurations given a network architecture. To train a slimmable model, we simply apply two width multipliers [7, 36] as the upper bound and lower bound of channel numbers. For example, for all mobile networks [6, 7, 25, 26], we train a slimmable model that can execute between and . In each training iteration, we randomly and independently sample the number of channels in each layer. It is noteworthy that in residual networks, we first sample the channel number of residual identity pathway and then randomly and independently sample channel number inside each residual block. Moreover, we make all layers in a neural network slimmable, including the first convolution layer and last fully-connected layer. In each layer, we divide the channels into groups evenly (, 10 groups) to reduce the search space. In other words, during training or slimming, we sample or remove an entire group, instead of an individual channel. We note that even with channel grouping, the search space is still large.

We implement a distributed training framework with synchronized stochastic gradient descent (SGD) on PyTorch 

[55]. We set different random seeds in different processes such that each GPU samples diverse channel configurations in each SGD training step. All other techniques introduced in [35] and distributed training techniques introduced in [56] are used by default. All code will be released.

Group Model Parameters Memory CPU Latency FLOPs Top-1 Err. (gain)
200M FLOPs ShuffleNet v1  [9] 1.8M 4.9M 46ms 138M 32.6
ShuffleNet v2  [8] - - - 146M 30.6
MobileNet v1  [7] 1.3M 3.8M 33ms 150M 36.7
MobileNet v2  [6] 2.6M 8.5M 71ms 209M 30.2
AMC-MobileNet v2  [37] 2.3M 7.3M 68ms 211M 29.2 (1.0)
MNasNet  [25] 3.1M 7.9M 65ms 216M 28.5
AutoSlim-MobileNet v1 1.9M 4.2M 33ms 150M 32.1 (4.6)
AutoSlim-MobileNet v2 4.1M 9.1M 70ms 207M 27.0 (3.2)
AutoSlim-MNasNet 4.0M 7.5M 62ms 217M 26.8 (1.7)
300M FLOPs ShuffleNet v1  [9] 3.4M 8.0M 60ms 292M 28.5
ShuffleNet v2  [8] - - - 299M 27.4
MobileNet v1  [7] 2.6M 6.4M 48ms 325M 31.6
MobileNet v2  [6] 3.5M 10.2M 81ms 300M 28.2
NetAdapt-MobileNet v1  [38] - - - 285M 29.9 (1.7)
AMC-MobileNet v1  [37] 1.8M 5.6M 46ms 285M 29.5 (2.1)
MNasNet  [25] 4.3M 9.8M 76ms 317M 26.0
AutoSlim-MobileNet v1 4.0M 6.8M 43ms 325M 28.5 (3.1)
AutoSlim-MobileNet v2 5.7M 10.9M 77ms 305M 25.8 (2.4)
AutoSlim-MNasNet 6.0M 10.3M 71ms 315M 25.4 (0.6)
500M FLOPs ShuffleNet v1  [9] 5.4M 11.6M 92ms 524M 26.3
ShuffleNet v2  [8] - - - 591M 25.1
MobileNet v1  [7] 4.2M 9.3M 64ms 569M 29.1
MobileNet v2  [6] 5.3M 14.3M 106ms 509M 25.6
MNasNet  [25] 6.8M 14.2M 95ms 535M 24.5
NASNet-A  [47] - - - 564M 26.0
PNASNet-5  [48, 8] - - - 588M 25.8
Graph-HyperNetwork  [57] - - - 569M 27.0
AutoSlim-MobileNet v1 4.6M 9.5M 66ms 572M 27.0 (2.1)
AutoSlim-MobileNet v2 6.5M 14.8M 103ms 505M 24.6 (1.0)
AutoSlim-MNasNet 8.3M 14.2M 95ms 532M 24.6 (-0.1)
Heavy Models ResNet-50  [13] 25.5M 36.6M 197ms 4.1G 23.9
ResNet-50  [13, 35] 14.7M 23.1M 133ms 2.3G 25.1
ResNet-50  [13, 35] 6.8M 12.5M 81ms 1.1G 27.9
ResNet-50  [13, 35] 1.9M 4.8M 44ms 278M 35.0
He-ResNet-50 [30, 27] - - - 2.0G 27.2
ThiNet-ResNet-50 [29, 27] - - - 2.9G 27.0
- - - 2.1G 28.0
- - - 1.2G 30.6
AutoSlim-ResNet-50 23.1M 32.3M 165ms 3.0G 24.0
20.6M 27.6M 133ms 2.0G 24.4
13.3M 18.2M 91ms 1.0G 26.0
7.4M 11.5M 69ms 570M 27.8
Table 1: ImageNet classification results with various network architectures. Blue indicates the network pruning methods [27, 29, 30, 37, 38], Cyan indicates the network architecture search methods [25, 47, 48, 57] and Red indicates our results using AutoSlim.

3.3 Greedy Slimming

After training a slimmable model, we evaluate it on the validation set (on ImageNet [58] we randomly hold out images in training set as validation set). We start with the largest model (, ) and compare the network accuracy among the architectures where each layer is slimmed by one channel group. We then greedily slim the layer with minimal accuracy drop. During the iterative slimming, we obtain optimized channel configurations under different resource constraints. We stop until reaching the strictest constraint (, 50M FLOPs or 30ms CPU latency).

Large Batch Size. During greedy slimming, no training is involved. Thus we directly put the model in evaluation mode (no gradients are required), which enables us to use a larger batch size (for example during slimming we use mini-batch size for each GPU with totally V100 GPUs). Large batch size brings two benefits. First, previous work [36] shows that BN statistics will be accurate if it is calibrated with the batch size larger than . Thus post-statistics of BN in our greedy slimming can be computed online without additional cost. Second, with large batch size we can simply use single feed-forward prediction accuracy as the performance estimator. In practice we find it speeds up greedy slimming and simplifies implementation without affecting final performance.

Training Optimized Networks. Similar to architecture search methods, after the search, we train these optimized network architectures from scratch. By default we search for the network FLOPs at approximately 200M, 300M and 500M, and train a slimmable model.

4 Experiments

4.1 Main Results

Table 1 summarizes our results on ImageNet [58] classification with various network architectures including MobileNet v1 [7], MobileNet v2 [6], MNasNet [25], and one large model ResNet-50 [13]. We compare our results with their default channel configurations and recent channel pruning methods [29, 30, 37]. The top-1 errors of our baselines are from corresponding works [6, 7, 13, 25, 29, 30, 37]. To have a clear view, we divide the network architectures into four groups, namely, 200M FLOPs, 300M FLOPs, 500M FLOPs and heavy models (basically ResNet-50 based models). We evaluate their latency on same hardware environment with single-core CPU to ensure fairness. Device memory is reported as a summary of all feature maps and weights. We note that the memory footprint can be largely optimized by improving memory reusing and implementation of dedicated operators. For example, the inverted residual block can be optimized by splitting channels into groups and performing partial execution for multiple times [6]. For all network architectures we train 50 epochs with squeezed learning rate schedule to obtain a slimmable model for greedy slimming. After search, we train the optimized network architectures for full epochs (300 epochs with linearly decaying learning rate for mobile networks, 100 epochs with step learning rate schedule for ResNet-50 based models) with other training settings following previous works [6, 7, 8, 9, 13, 35, 36] (weight initialization, weight decay, data augmentation, training/testing image resolution, optimizer, hyper-parameters of batch normalization). We exclude the parameters and FLOPs of Batch Normalization layers [43] following common practice since they can be fused into convolution layers.

As shown in Table 1, our models have better top-1 accuracy compared with the default channel configuration of MobileNet v1, MobileNet v2 and ResNet-50 across different computational budgets. We even have improvements over RL-searched MNasNet [25], where the filter numbers are already included in its search space. Notably, by setting optimized channel numbers, our AutoSlim-MobileNet-v2 at 305M FLOPs achieves 74.2% top-1 accuracy, 2.4% better than default MobileNet-v2 (301M FLOPs), and even 0.2% better than RL-searched MNasNet (317M FLOPs). Our AutoSlim-ResNet-50 at 570M FLOPs, without depthwise convolutions, achieves 1.3% better accuracy than MobileNet-v1 (569M FLOPs).

4.2 Visualization and Discussion

In this part, we visualize our optimized channel configurations and discuss some insights from the results.

Comparison with Default Channel Numbers. We first compare our results with default channels in MobileNet v2 [6]. We show the optimized number of channels (left) and the percentage compared with default channels (right) in Figure 5. Compared with default MobileNet v2, our optimized configuration has fewer channels in shallow layers and more channels in deep ones.

Figure 5: The optimized number of channels (left) and the percentage compared with default channels (right) of MobileNet v2. The channels of depthwise convolutions are ignored in the figure, since its output channels are always equal to the previous convolution outputs.

Comparison with Width Multiplier Heuristic. Applying width multiplier [7], a global hyper-parameter across all layers, is a commonly used heuristic to trade off between model accuracy and efficiency [6, 7, 8, 9]. We search optimal channels at 207M, 305M and 505M FLOPs corresponding to MobileNet v2 , and . Figure 6 shows the pattern that under different budgets, AutoSlim applies different width scaling in each layer.

Figure 6: The channel configurations of AutoSlim-MobileNet-v2 at 207M, 305M and 505M FLOPs.

Comparison with Model Pruning Methods. Next, we compare our optimized channel configuration with model pruning method AMC [37]. In Figure 6, we show the number of channels in all layers of optimized MobileNet v2. We observe several characteristics of our optimized channel configurations. First, AutoSlim-MobileNet-v2 has much more channels in deep layers, especially for deep depthwise convolutions. For example, AutoSlim-MobileNet-v2 has channels in the second last layer, compared with channels in AMC-MobileNet-v2. Second, AutoSlim-MobileNet-v2 has fewer channels in shallow layers. For example, AutoSlim-MobileNet-v2 has only channels in first convolution layer, while AMC-MobileNet-v2 has channels. It is noteworthy that although shallow layers have a small number of channels, the spatial size of feature maps is large. Thus overall these layers take up large computational overheads.

4.3 CIFAR10 Experiments

In addition to ImageNet dataset, we also conduct experiments on CIFAR10 [59] dataset. We use same weight decay hyper-parameter, initial learning rate and learning rate schedule as ImageNet experiments. We note that these training settings may not be optimal for CIFAR10 dataset, nevertheless we report ablative study with same hyper-parameters and settings. We first report the performance of MobileNet v2 [6] with the default channel configurations. We then search with proposed AutoSlim to obtain optimized channel configurations at same FLOPs (we hold out images from training set as validation set during the search). Finally we train the optimized architectures individually with same settings as the baselines. Table 2 shows that AutoSlim models have higher accuracy than baselines on CIFAR10 dataset.

Figure 7: The channel configurations of AutoSlim-MobileNet-v2 compared with AMC-MobileNet-v2 [37].
Model Parameters FLOPs Top-1 Err.
MobileNet v2 2.2M 88M 8.1
MobileNet v2 1.3M 59M 8.6
MobileNet v2 0.7M 28M 10.4
AutoSlim-MobileNet v2 1.5M 88M 6.8 (1.3)
AutoSlim-MobileNet v2 0.7M 59M 7.0 (1.6)
AutoSlim-MobileNet v2 0.3M 28M 8.0 (2.4)
Table 2: CIFAR10 classification results with default MobileNet v2 and AutoSlim-MobileNet-v2.
Model Search On FLOPs Top-1 Err.
MobileNet v2 - 59M 8.6
AutoSlim-MobileNet v2 CIFAR10 59M 7.0 (1.6)
AutoSlim-MobileNet v2 ImageNet 63M 9.9 (-1.3)
Table 3: CIFAR10 results with AutoSlim-MobileNet-v2 searched on CIFAR10 or ImageNet.

We further study the transferability of the network architectures learned from ImageNet to CIFAR10 dataset, and compare it with the channel configuration searched on CIFAR10 directly. The results are shown in Table 3. It suggests that the optimized channel configuration on ImageNet cannot generalize to CIFAR10. Compared with the optimized architecture for ImageNet, we observed that the optimized architecture for CIFAR10 have much fewer channels in deep layers, which we guess may lead to better generalization on test set for small datasets like CIFAR10. It may also due to inconsistent image resolutions between ImageNet () and CIFAR10 ().

5 Conclusion

We presented the first one-shot approach on network architecture search for channel numbers, with extensive experiments on large-scale ImageNet classification. Our proposed solution AutoSlim automates the design of efficient network architectures for resource constrained devices.

References