DenseNAS
Densely Connected Search Space for More Flexible Neural Architecture Search (CVPR2020)
view repo
In recent years, neural architecture search (NAS) has dramatically advanced the development of neural network design. While most previous works are computationally intensive, differentiable NAS methods reduce the search cost by constructing a super network in a continuous space covering all possible architectures to search for. However, few of them can search for the network width (the number of filters/channels) because it is intractable to integrate architectures with different widths into one super network following conventional differentiable NAS paradigm. In this paper, we propose a novel differentiable NAS method which can search for the width and the spatial resolution of each block simultaneously. We achieve this by constructing a densely connected search space and name our method as DenseNAS. Blocks with different width and spatial resolution combinations are densely connected to each other. The best path in the super network is selected by optimizing the transition probabilities between blocks. As a result, the overall depth distribution of the network is optimized globally in a graceful manner. In the experiments, DenseNAS obtains an architecture with 75.9 ImageNet and the latency is as low as 24.3ms on a single TITAN-XP. The total search time is merely 23 hours on 4 GPUs.
READ FULL TEXT VIEW PDFDensely Connected Search Space for More Flexible Neural Architecture Search (CVPR2020)
Introducing diverse tasks for NAS
None
Designing deep neural networks has been an important topic for deep learning. Better designed network architectures usually lead to significant performance improvement. In recent years, neural architecture search (NAS)
zoph2016neural ; zoph2017learning ; pham2018efficient ; Real2018Regularized has demonstrated success in designing neural architectures automatically. Many architectures produced by NAS methods have achieved higher accuracy than those manually designed in tasks such as image classification, semantic segmentation and object detection. NAS methods not only boost the model performance, but also liberate human experts from the tedious architecture tweaking work.The more elements in the architecture design process can be searched automatically, the less burden human experts bear. What elements can be searched for depends on how the search space is constructed. In most previous works, there are two main kinds of search space. One is to repeat the cell structure to construct the network zoph2017learning ; pham2018efficient ; Real2018Regularized ; liu2018darts and search for the topological connection between different nodes in each cell. The other one MnasNet ; cai2018proxylessnas ; fbnet ; eatnas stacks the mobile convolution blocks sandler2018mobilenetv2
with more succinct connections in the network. How to best search among operation types and connection patterns are widely explored in many previous works. Searching for network scale (width and depth) is less straightforward. While Reinforcement Learning (RL)
zoph2017learning ; MnasNetand Evolutionary Algorithm (EA)
Real2018Regularized based NAS methods can easily search for the depth and width due to their ability to handle a discrete space, they are extremely computationally expensive. Differentiable liu2018darts ; cai2018proxylessnas ; xie2018snas and one-shot brock2017smash ; Understanding methods produce high-performance architectures with much less search cost, but network scale search is intractable for these methods. Their search process relies on a super network which covers all the possible sub-architectures in the search space. Searching for the network scale requires integrating architectures with different scales into the super network. While the depth search (the number of total layers) of the architecture can be handled by equipping the layer in the super network with the identity connection as a candidate operation cai2018proxylessnas ; fbnet , searching for the width is more difficult to deal with, because once the number of output channels in a layer changes, the number of input channels in the following layer needs to be changed accordingly. Therefore designing a super network supporting width search remains a challenging problem.Yet optimization of the scale of a network is something so crucial that should not be left out of the NAS process and dealt with manually in an ad hoc manner. Inappropriate width or depth choices cause drastic accuracy degradation or unsatisfactory model size. In particular, even slight changes to the width of the architecture can give rise to explosive increase of the model size. In this paper, we aim at solving the width search problem by developing a novel differentiable NAS method: DenseNAS. Our solution is to construct a new densely connected search space and design a super network to be a continuous representation of the search space. As shown in Fig. 1
, each block in the super network is connected into several adjacent blocks. From the beginning to the end of the network, the number of filters (i.e., width) of each block increases gradually with a small stride. The fine-grained width distribution in the network guarantees that the search space can cover as many width values as possible.
In the search space, there are multiple blocks with several widths under the same spatial resolution setting. We relax the search space as we assign a probability parameter to each output path of each block. During the search process, the probability distribution of output paths is optimized. The best width growing path in the super network is selected using this probability distribution to derive the final architecture. Because the spatial resolution in each block is associated with the width, the block widths and the layers to carry out spatial down-sampling are optimized and determined at the same time.
In summary, starting from solving the network width searching problem in differentiable NAS, we propose a new densely connected search space. The novel search space design even enables flexible architecture search beyond network widths, e.g. the number of blocks, the layers to do spatial down-sampling, etc. As a result the overall distribution of depths of the whole network is globally and automatically optimized. With DenseNAS, we obtain an architecture with 75.9% top-1 accuracy on ImageNet with a low latency on the GPU device (24.3 ms on one TITAN-XP). The search cost is only 92 GPU hours, 23 hours on 4 GPUs.
Recently, the emergence of differentiable NAS methods greatly reduces the search cost while achieving superior results. DARTS liu2018darts is the first work to utilize the gradient-based method to search neural architectures. They construct a super network and relax the architecture representation by assigning continuous weights to the candidate operations. They search on a small dataset, e.g., CIFAR-10 krizhevsky2009learning , and then transfer the architecture to a large dataset, e.g., ImageNet DBLP:conf/cvpr/DengDSLL009 . ProxylessNAS cai2018proxylessnas reduces the memory consumption by adopting a dropping path strategy. They only select a set of paths in the super network to update during the search. They carry out search directly on the large scale dataset, i.e., ImageNet. FBNet fbnet searches on the subset of ImageNet and use Gumbel Softmax function JangGP17 ; MaddisonMT17 to better optimize the distribution of architecture probabilities. Although the differentiable NAS methods mentioned above achieve remarkable results, the width of the architecture is manually set. It is challenging to integrate architectures with multiple widths into the super network. Hence adjusting the width of the network still requires many trials by experienced engineers.
NASNet zoph2017learning is the first work that proposes the cell structure to construct the search space. They search for the operation types and the topological connection in the cell and repeat the cell to form the whole architecture. The depth of the architecture (i.e., the number of repetitions of the cell), the widths and the occurrences of down-sampling operations are all set by hand. Afterwards, many works liu2017progressive ; pham2018efficient ; Real2018Regularized ; liu2018darts adopt a similar cell-based search space. MnasNet MnasNet uses a block-wise search space. ProxylessNAS cai2018proxylessnas , FBNet fbnet and ChamNet ChamNet simplify the search space by searching mostly for the expansion ratios and kernel sizes of the mobile inverted bottleneck convolution (i.e. MBConv) sandler2018mobilenetv2 layers. Auto-DeepLab auto_deeplab creatively designs a two-level hierarchical search space for a segmentation network. The search space is also based on the cell structure and contains complicated operations on the spatial resolution. Our work is also fundamentally different from DenseNet huang2017densely . Even though the blocks in our super net are densely connected, only one path will be selected to derive the final architecture which contains no densely connected blocks, as shown in Fig. 4.
In this work, we use the differentiable neural architecture search method liu2018darts ; xie2018snas ; cai2018proxylessnas ; fbnet to solve the architecture design problem. In this section, we first introduce how to design a search space motivated by the width search problem. Secondly, we demonstrate the method of relaxing the search space into a continuous representation. Finally, we explain our search algorithm.
Considering the cell-based search space zoph2017learning ; liu2017progressive ; liu2018darts usually leads to complicated architectures which are not latency-friendly, we design the search space based on the mobile inverted bottleneck convolution (i.e. MBConv) proposed in MobileNetV2 sandler2018mobilenetv2 . As shown in Fig. 3, we define the search space on three different levels (the layer, the block and the network). At the layer level, each layer consists of all the candidate operations. For the block level, one block can be separated into two components: the head layers and the stacking layers. For the network level, the whole network is constructed using blocks with incremental widths. Next we will describe the layer, the block and the network structures in detail.
We define the layer to be the elementary structure in our search space. One layer represents a set of candidate operations. The candidate operations are defined as a set of MBConv layers (as shown in Fig. 2) with kernel sizes of {3, 5, 7} and expansion ratios of {3, 6}. We include the skip connection as a candidate operation for the depth search. If the skip connection is chosen, it means the corresponding layer is removed from the resulting architecture, effectively reducing its depth. The set of operations in our search space is shown in Tab. 2.
Each block is composed of several layers. We divide the block into two parts, the head layers and the stacking layers
. We set a fixed width and a corresponding spatial resolution for one block. For the head layers, they take input tensors with various numbers of channels and spatial resolutions. The head layers transform all the input tensors to one tensor with the set number of channels and spatial resolution. The head layer exclude the skip connection as it is required for all blocks. Following the head layers are a number of stacking layers (in our case, three). The operations in the stacking layers are carried out with the same number of channels and spatial resolution.
Previous works MnasNet ; cai2018proxylessnas ; fbnet that use MBConv-based blocks to form the architecture set a fixed number of blocks, and the resulting architecture contains all the blocks. We instead design more blocks with various widths in our search space, and allow the searched architecture to contain only a subset of the blocks, giving the search algorithm the freedom to choose blocks of certain widths while discarding others.
We define the whole super network architecture as and assume that it includes blocks: . As Fig. 3 shows, we partition the entire network into several stages. Each stage holds a different range of widths and a fixed spatial resolution. From the beginning to the end of the super network, the width of the blocks grows gradually with a small stride. In the early stage of the network, we set a smaller growing stride for the width because large width setting in the early network stage will cause huge computational cost. The growing stride becomes larger for we move to later stages. As shown in Fig. 3, in stage 3, the spatial resolution is set as and the width growing stride is 8. The width growing stride changes to 16 in stage 4 and 64 in stage 5. This design of the super network allows searching over different widths in each block, differentiating our approach from all existing ones.
Each block in the super network connects to subsequent blocks. We define the connection between the block and its subsequent block () as . The spatial resolutions of and are (normally ) and in respectively. We constrain connections to only exist between blocks whose spatial resolutions differ no more than a factor of two. Therefore, exists when and . The search space is constructed based on the densely connected blocks. However only one path will be selected tto derive the final architecture. In our method, not only the number of layers in each block but also the block widths and the number of blocks are searched for. The layers to carry out spatial down-sampling are determined in the meantime. Our goal is to find a good path in the search space which represents the best depth and width configuration of the architecture.
The operations of the first two layers in the network are fixed. The rest layers are all searched for by our method. Inspired by MobileNetV2 sandler2018mobilenetv2 , the first layer is set as a plain convolution which outputs the tensor with the shape of . The second layer is a MBConv with the kernel size of and the expansion ratio of , which outputs a
tensor. The number of the blocks and the distribution of widths in the search space can all be optimized using the loss function. Our design of the search space (i.e. the super network) is illustrated in Fig.
3.As we construct the search space, we relax the architectures into continuous representations. The relaxation is implemented on the layer and block level. After relaxation, we can search for architectures via back propagation.
Let be the set of candidate operations described in 3.1.1. We assign an architecture parameter to the candidate operation in layer . We relax the layer by making it a weighted sum of candidate operations. Each architecture weight of the operation is computed as a softmax of the architecture parameter over all the operations in the layer:
(1) |
The output of layer can be expressed as
(2) |
We set to be the output tensor of the th block . As described in Sec. 3.1.3, each block connects into subsequent blocks. To relax the block connections into a continuous representation, we assign each output path of the block a block-level architecture parameter. Namely for the path from block to , the path between them has a parameter . Similar to how we compute the weight of each operation above, we compute the probability of each path using a softmax function over all paths between the two blocks:
(3) |
For block , we assume it also takes input tensors from the predecessor blocks, which are , , … . As shown in Fig. 3, the input tensors from these blocks differ in number of channels and spatial resolution. Therefore each of the input tensor is transformed by a head layer in to a uniform size and then summed together. Let denote the transformation applied to input tensor from by the th head layer in block , where . The sum of the transformed input tensors can be computed by:
(4) |
It is worth noting that the path probabilities are normalized on the block output dimension but applied on the block input dimension (more specifically on the head layers). The head layer is essentially a weighted-sum mixture of the candidate operations. The layer-level parameter controls which operation to be selected, while the outer block-level parameter determines which block to connect.
Benefiting from the continuously relaxed representation of the search space, we can search for the architecture by updating the architecture parameters (introduced in 3.2
) using gradient descent. We find that in the beginning of the search, all the weights of the operations are under-trained. The operations or architectures which converge faster are more likely to be strengthened, which leads to shallow architectures. And the distribution of architecture parameters in the preliminary training stage has a great impact on the later stage training. To tackle this, we split our search procedure into two stages. In the first stage, we only train the weights of the super network for enough epochs to get operations sufficiently trained until the accuracy of the model is not too low. In the second stage, we activate the architecture optimization. We optimize the operation weights by descending
on the training set, and optimize the architecture parameters by descending on the validation set. We alternate the optimization process of weights and architecture by epoch.When the search procedure is finished, we derive the architecture based on the architecture parameters. At the layer level, we select the candidate operation with the maximum architecture weight, i.e., . At the network level, we use Viterbi algorithm forney1973viterbi to select the path connecting the blocks with the highest total transition probability based on the output path probability of each block.
Similar to cai2018proxylessnas ; fbnet , we integrate multi-objective optimization into the search process. Take the latency as an example, we create a lookup table which records the latency of each operation. The latency of each module is measured separately on the target device. For a stack layer, the latency is computed as:
(5) |
where refers to the pre-measured latency of operation in layer . For a head layer of block , suppose it takes its input tensor from block
’s output, the latency is estimated as:
(6) |
where is the weight of the connection from block to block . The latency of the whole architecture can thus be obtained by:
(7) |
We design a loss function with the latency-based regularization to achieve the multi-objective optimization:
(8) |
where and are the hyper-parameters to control the magnitude of the latency term.
The super network includes all the possible paths and operations in the search space. To decrease the memory consumption and accelerate the search process, we adopt the drop-path training strategy. The one-shot search method Understanding drops out some paths when training the super network. This technique makes the performance prediction of the stand-alone model more accurate. In this work, when training the weights of operations, we sample one path of the candidate operations according to the architecture weight distribution in each layer. The drop-path training not only accelerates the search but also weakens the coupling effect between operation weights for different architectures in the search space. Following ProxylessNAS cai2018proxylessnas , we sample two operations in each layer according to the architecture weight distribution to update the architecture parameters. To keep the architecture weights of the operations not sampled unchanged, we compute a re-balancing bias to adjust the sampled and newly updated parameters.
(9) |
where refers to the set of sampled operations, refers to the originally sampled architecture parameter in layer and refers to the updated architecture parameter.
To demonstrate the effectiveness of our proposed method, we apply it to the ImageNet classification problem DBLP:conf/cvpr/DengDSLL009 to search for a architecture of high accuracy and low latency.
Before the search process, we build a lookup table for the module latency of the super network as described in 3.3.2. We set the input shape as
with the batch size of 32 for the network. Each module of the network is measured on one TITAN-XP for 2000 times and the average latency is recorded. All models and experiments are implemented using PyTorch
paszke2017automatic .For the search process, we randomly choose 100 classes from the original 1000 classes of the ImageNet training set. We sample 20% data in each class of the ImageNet subset to form the validation dataset. The remaining data is used for training. The original validation dataset of ImageNet is only used for testing our final searched architecture. The search process runs 250 epochs in total. In the first search stage, we only train the operation weights in the super network for 150 epochs on the divided training dataset. Only one path of the mixed operation is sampled in each step to update operation weights. For the last 100 epochs, the updating of architecture parameters () and operation weights () alternates for each epoch. For the training data preprocessing, we use the standard GoogleNet DBLP:conf/cvpr/SzegedyLJSRAEVR15 data augmentation. We set the batch size to on GPUs. We use the SGD optimizer with 0.9 momentum and weight decay to update the operation weights. The learning rate decays from to with the cosine annealing schedule DBLP:conf/iclr/LoshchilovH17 . We use the Adam optimizer DBLP:conf/iclr/2015 with weight decay, and fixed learning rate of to update the architecture parameters.
For retraining the final derived architecture, we use the same data augmentation strategy as the search process on the whole ImageNet dataset. We train the model for 240 epochs with a batch size of 1024 on 8 GPUs. The optimizer is SGD with 0.9 momentum and weight decay. The 0.1-weighted label smoothing is used both in the search and retraining process. The learning rate decays from 0.5 to with the cosine annealing schedule.
Our ImageNet results are shown in Tab. 1. We set the GPU latency as our secondary optimization objective. Our models achieve superior accuracy with low latency. They significantly outperform the ones designed manually in terms of accuracy. DenseNAS-A achieves 75.9% top-1 accuracy, better than 1.4-MobileNetV2 (+1.2%) with lower latency (-3.7ms, relative 15.2%). Comparing DenseNAS-B with NASNet-A zoph2017learning , AmoebaNet-A Real2018Regularized and DARTS liu2018darts , we achieve higher accuracy with a lot fewer FLOPs and lower latency. Compared with other NAS methods, DenseNAS achieves superior accuracy with the similar latency, yet the whole search process takes about only 23 hours on 4 GPUs (92 GPU hours in total), which is 522 faster than NASNet, 826 faster than AmoebaNet, 989 faster than MnasNet, around 2.3 faster than FBNet fbnet and ProxylessNAS cai2018proxylessnas . For FBNet and ProxylessNAS, the widths of blocks in the search space are set and adjusted by handcraft. The widths of our networks are all searched automatically.
Model | Latency | Top-1(%) | Top-5(%) |
---|---|---|---|
Fixed-A | 25.3ms | 74.7 | 92.0 |
DenseNAS-A | 24.3ms | 75.9 | 92.6 |
Fixed-B | 22.6ms | 73.4 | 91.2 |
DenseNAS-B | 21.1ms | 74.7 | 92.0 |
To demonstrate the effectiveness and efficiency of our proposed method, we carry out the same search process under the fixed-block search space. The design of the fixed-block search space entirely follows the popular human-designed network MobileNetV2 sandler2018mobilenetv2 . The number of bottleneck blocks in the search space is set to 7 and the widths of the blocks are set as [16, 24, 32, 64, 96, 160, 320], which are the same as MobileNetV2. The block connections are abandoned and all the blocks in the search space are used for deriving the final architecture. The other search settings are the same as our proposed method for fair comparison. The results are shown in Tab. 2.
Our proposed methods can search for architectures according to various latency requirements. We conduct search experiments under different latency settings. For the loss function defined in Eq. (8), we set to 15 and from . For the first 150 epochs, only operation weights are updated. The super network model with operation weights pre-trained can be shared between search processes under different latency settings for saving search time.
The architectures obtained by DenseNAS are shown in Fig. 4. The faster model tends to be shallower. The block widths and the number of blocks are all searched automatically. Fig. 5 further shows that DenseNAS models outperform the models searched under the fixed-block search space and MobileNetV2 models with different latency settings.
The proposed DenseNAS is a differentiable NAS method for searching network widths. DenseNAS can also optimize the spatial down-sampling position and the distribution of depths in the network scale rather than block scale. DenseNAS makes neural architecture design more automatically. The results of large-scale experiments reflect its great efficiency and effectiveness. In future works, we would like to study our DenseNAS method on some dense prediction tasks, such as designing feature pyramid DBLP:conf/cvpr/LazebnikSP06 ; DBLP:journals/pami/HeZR015 ; DBLP:conf/cvpr/LinDGHHB17 and semantic segmentation DBLP:journals/remotesensing/PapadomanolakiV19 ; DBLP:journals/pami/ChenPKMY18 ; DBLP:journals/corr/ChenPSA17 networks, because DenseNAS has the advantage of searching for the spatial down-sampling positions and the performance of dense prediction tasks are more sensitive to the spatial resolution of feature maps.
We thank Liangchen Song for the discussion and assistance.
Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018
, pages 549–558, 2018.2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), 17-22 June 2006, New York, NY, USA
, pages 2169–2178, 2006.SGDR: stochastic gradient descent with warm restarts.
In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.The concrete distribution: A continuous relaxation of discrete random variables.
In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.