Deep convolutional neural networks have achieved great successes in computer vision tasks such as image classification[1, 2, 3], semantic segmentation [4, 5, 6], object detection [7, 8, 9] and pose estimation [10, 11] etc. Image classification has always served as a fundamental task for neural architecture design. It is common to use networks designed and pre-trained on the classification task as the backbone and fine-tune them for segmentation or detection tasks. However, the backbone plays an important role in the performance on these tasks and the difference between these tasks calls for different design principles of the backbones. For example, segmentation tasks require high-resolution features and object detection tasks need to make both localization and classification predictions from each convolutional feature. Such distinctions make neural architectures designed for classification tasks fall short. Some attempts [12, 13] have been made to tackle this problem by manually modifying the architectures designed for classification to better accommodate to the characteristics of new tasks.
Handcrafted neural architecture design is inefficient, requires a lot of human expertise, and may not find the best-performing networks. Recently, neural architecture search (NAS) methods [14, 15, 16]
see a rise in popularity. Automatic deep learning methods aim at helping engineers get rid of tremendous trial and error on architecture designing and further promoting the performance of architectures over manually designed ones. Early NAS works[14, 17, 18] explore the search problem on the classification tasks. As the NAS methods develop, some works [19, 20, 21] propose to use NAS to specialize the backbone architecture design for semantic segmentation or object detection tasks. Nevertheless, backbone pre-training remains an inevitable but costly procedure. Though some works like  recently demonstrate that pre-training is not always necessary for accuracy considerations, training from scratch on the target task still takes more iterations than fine-tuning from a pre-trained model. For NAS methods, the pre-training cost is non-negligible for evaluating the networks in the search space. One-shot search methods [23, 24, 21] integrate all possible architectures in one super network but pre-training the super network and the searched network still bears huge computation cost.
As ImageNet  pre-training has been a standard practice for many computer vision tasks, there are lots of models trained on ImageNet available in the community. To take full advantages of these pre-trained models, we propose a Fast Network Adaptation (FNA++) method based on a novel parameter remapping paradigm. Our method can adapt both the architecture and parameters of one network to a new task with negligible cost. Fig. 1 shows the whole framework. The adaptation is performed on both the architecture- and parameter-level. We adopt the NAS methods [14, 26, 27] to implement the architecture-level adaptation. We select the manually designed network as the seed network, which is pre-trained on ImageNet. Then, we expand the seed network to a super network which is the representation of the search space in FNA++. We initialize new parameters in the super network by mapping those from the seed network using the proposed parameter remapping mechanism. Compared with previous NAS methods [28, 19, 21] for segmentation or detection tasks that search from scratch, our architecture adaptation is much more efficient thanks to the parameter remapped super network. With architecture adaptation finished, we obtain a target architecture for the new task. Similarly, we remap the parameters of the super network which are trained during architecture adaptation to the target architecture. Then we fine-tune the parameters of the target architecture on the target task with no need of backbone pre-training on a large-scale dataset.
We demonstrate FNA++’s effectiveness and efficiency via experiments on semantic segmentation, object detection and human pose estimation tasks. We adapt the manually designed network MobileNetV2  to the semantic segmentation framework DeepLabv3 , object detection framework RetinaNet  and SSDLite [8, 29] and human pose estimation framework SimpleBaseline . Networks adapted by FNA++ surpass both manually designed and NAS networks in terms of both performance and model MAdds. Compared to NAS methods, FNA++ costs 1737 less than DPC , 6.8 less than Auto-DeepLab  and 8.0 less than DetNAS . To demonstrate the generalizability of our method, we implement FNA++ on diverse networks, including ResNets  and NAS networks, i.e., FBNet  and ProxylessNAS , which are searched on the ImageNet classification task. Experimental results show that FNA++ can further promote the performance of ResNets and NAS networks on the new task (object detection in our experiment).
Our main contributions can be summarized as follows:
We propose a novel FNA++ method that automatically fine-tunes both the architecture and the parameters of an ImageNet pre-trained network on target tasks. FNA++ is based on a novel parameter remapping mechanism which is performed for both architecture adaptation and parameter adaptation.
FNA++ promotes the performance on semantic segmentation, object detection and human pose estimation tasks with much lower computation cost than previous NAS methods, e.g. 1737 less than DPC, 6.8 less than Auto-DeepLab and 8.0 less than DetNAS.
Our preliminary version of this manuscript was previously published as a conference paper . We make some improvements to the preliminary version as follows. First, we generalize the paradigm of parameter remapping and now it is applicable to more architectures, e.g., ResNet  and NAS networks with various depths, widths and kernel sizes. Second, we improve the remapping mechanism for parameter adaptation and achieve better results than our former version over different frameworks and tasks with no computation cost increased. Third, we implement FNA++ on one more task (SimpleBaseline for human pose estimation) and achieve great performance.
The remaining part of the paper is organized as follows. In Sec. 2, we describe the related works from three aspects, neural architecture search, backbone design and parameter remapping. Then we introduce our method in Sec. 3, including the proposed parameter remapping mechanism and the detailed adaptation process. In Sec. 4, we evaluate our method on different tasks and frameworks. The method is also implemented on various networks. A series of experiments are performed to study the proposed method comprehensively. We finally conclude in Sec. 5.
2 Related Work
2.1 Neural Architecture Search
Early NAS works automate network architecture design by applying the reinforcement learning (RL)[32, 14, 17]
or evolutionary algorithm (EA)[33, 26] to the search process. The RL/EA-based methods obtain architectures with better performance than handcrafted ones but usually bear tremendous search cost. Afterwards, ENAS  proposes to use parameter sharing to decrease the search cost but the sharing strategy may introduce inaccuracy on evaluating the architectures. NAS methods based on the one-shot model [23, 24, 34] lighten the search procedure by introducing a super network as a representation of all possible architectures in the search space. Recently, differentiable NAS [27, 18, 30, 35, 36] arises great attention in this field which achieves remarkable results with far lower search cost compared with previous ones. Differentiable NAS assigns architecture parameters to the super network and updates the architecture parameters by gradient descent. The final architecture is derived based on the distribution of architecture parameters. We use the differentiable NAS method to implement network architecture adaptation, which adjusts the backbone architecture automatically to new tasks with remapped seed parameters accelerating. In experiments, we perform random search and still achieve great performance, which demonstrates FNA++ is agnostic of NAS methods and can be equipped with diverse NAS methods.
2.2 Backbone Design
As deep neural network designing [37, 38, 2] develops, the backbones of semantic segmentation or object detection networks evolve accordingly. Most previous methods [7, 9, 8, 6] directly reuse the networks designed on classification tasks as the backbones. However, the reused architecture may not meet the demands of the new task characteristics. Some works improve the backbone architectures by modifying existing networks. PeleeNet  proposes a variant of DenseNet  for more real-time object detection on mobile devices. DetNet  applies dilated convolutions  in the backbone to enlarge the receptive field which helps to detect objects more precisely. BiSeNet  and HRNet  design multiple paths to learn both high- and low- resolution representations for better dense prediction. Recently, some works propose to use NAS methods to redesign the backbone networks automatically. Auto-DeepLab  searches for architectures with cell structures of diverse spatial resolutions under a hierarchical search space. The searched resolution change patterns benefit to dense image prediction problems. CAS  proposes to search for the semantic segmentation architecture under a lightweight framework while the inference speed optimization is considered. DetNAS  searches for the backbone of the object detection network under a ShuffleNet [43, 44]-based search space. They use the one-shot NAS method to decrease the search cost. However, pre-training the super network on ImageNet and the final searched network bears a huge cost. Benefiting from the proposed parameter remapping mechanism, our FNA++ adapts the architecture to new tasks with a negligible cost.
2.3 Parameter Remapping
Net2Net  proposes the function-preserving transformations to remap the parameters of one network to a new deeper or wider network. This remapping mechanism accelerates the training of the new larger network and achieves great performance. Following this manner, EAS  uses the function-preserving transformations to grow the network depth or layer width for architecture search. The computation cost can be saved by reusing the weights of previously validated networks. Moreover, some NAS works [15, 47, 48] apply parameter sharing on child models to accelerate the search process while the sharing strategy is intrinsically parameter remapping. Our parameter remapping paradigm extends the mapping dimension to the depth-, width- and kernel- level. Compared to Net2Net which focuses on mapping parameters to a deeper and wider network, the remapping mechanism in FNA++ has more flexibility and can be performed on architectures with various depths, widths and kernel sizes. The remapping mechanism helps both the architecture and parameter adaptation achieve great performance with low computation cost.
In this section, we first introduce the proposed parameter remapping paradigm, which is performed on three levels, i.e., network depth, layer width and convolution kernel size. Then we explain the whole procedure of the network adaptation including three main steps, network expansion, architecture adaptation and parameter adaptation. The parameter remapping paradigm is applied before architecture and parameter adaptation.
3.1 Parameter Remapping
We define parameter remapping as one paradigm which maps the parameters of one seed network to another one. We denote the seed network as and the new network as , whose parameters are denoted as and respectively. The remapping paradigm is illustrated in the following three aspects. The remapping on the depth-level is firstly carried out and then the remapping on the width- and kernel- level is conducted simultaneously. Moreover, we study different remapping strategies in the experiments (Sec. 4.9).
3.1.1 Remapping on Depth-level
We introduce diverse depth settings in our architecture adaptation process. Specifically, we adjust the number of MobileNetV2  or ResNet  blocks in every stage of the network. We assume that one stage in the seed network has layers. The parameters of each layer can be denoted as . Similarly, we assume that the corresponding stage with layers in the new network has parameters . The remapping process on the depth-level is shown in Fig. 2(a). The parameters of layers in which also exit in are just copied from . The parameters of new layers are all copied from the last layer in the stage of . Parameter remapping in layer is formulated as
3.1.2 Remapping on Width-level
As shown in Fig. 3, in the MBConv block of the MobileNetV2  network, the first point-wise convolution expands the low-dimensional features to a high dimension. This practice can be used for expanding the width and capacity of one neural network. We allow diverse expansion ratios for architecture adaptation. We denote the parameters of one convolution in as and that in as , where , denotes the output, input dimension of the parameter and denote the spatial dimension. The width-level remapping is illustrated in Fig. 2(b). If the channel number of is smaller, the first or channels of are directly remapped to . If the channel number of is larger than , the parameters of are remapped to the first or channels in . The parameters of the other channels in are initialized with . The above remapping process can be formulated as follows.
In our ResNet  adaptation, we allow architectures with larger receptive field by introducing grouped convolutions with larger kernel sizes, which do not introduce much additional MAdds. For architecture adaptation, the parameters of the plain convolution in the seed network need to be remapped to the new parameters of the grouped convolution in the super network . We assume the group number in the grouped convolution is . The input channel number of the grouped convolution is of the plain convolution, i.e., and . The parameters of the plain convolution are remapped to of the grouped convolution with the corresponding input dimension. This process can be formulated as,
3.1.3 Remapping on Kernel-level
The kernel size is commonly set as in most artificially-designed networks [2, 29]. However, the optimal kernel size settings may not be restricted to a fixed one. In a neural network, the larger kernel size can be used to expand the receptive field and capture abundant contextual features in segmentation or detection tasks but takes more computation cost than the smaller one. How to allocate the kernel sizes in a network more flexibly is explored in our method. We introduce the parameter remapping on the kernel size level and show it in Fig. 2(c). We denote the weights of the convolution in the seed network as whose kernel size is . The weights in is denoted as with kernel size. If the kernel size of is smaller than , the parameters of are remapped from the central region in . Otherwise, we assign the parameters of the central region in with the values of . The values of the other region surrounding the central part are assigned with . The remapping process on the kernel-level is formulated as follows.
where denote the indices of the spatial dimension.
3.2 Fast Network Adaptation
We divide our neural network adaptation into three steps. Fig. 1 illustrates the whole adaptation procedure. Firstly, we expand the seed network to a super network which is the representation of the search space in the latter architecture adaptation process. Secondly, we perform the differentiable NAS method to implement network adaptation on the architecture-level and obtain the target architecture . Finally, we adapt the parameters of the target architecture and obtain the target network . The aforementioned parameter remapping mechanism is deployed before the two stages, i.e., architecture adaptation and parameter adaptation.
3.2.1 Network Expansion
We expand the seed network to a super network by introducing more options of architecture elements. For every MBConv layer, we allow for more kernel size settings and more expansion ratios . As most differentiable NAS methods [27, 18, 30] do, we construct a super network as the representation of the search space. In the super network, we relax every layer by assigning each candidate operation with an architecture parameter. The output of each layer is computed as a weighted sum of output tensors from all candidate operations.
where denotes the operation set, denotes the architecture parameter of operation in the th layer, and denotes the input tensor. We set more layers in one stage of the super network and add the identity connection to the candidate operations for depth search. The structure of the search space is detailed in Tab. I. After expanding the seed network to the super network , we remap the parameters of to based on the paradigm illustrated in Sec. 3.1. As shown in Fig. 1, the parameters of different candidate operations (except the identity connection) in one layer of are all remapped from the same remapping layer of . This remapping strategy prevents the huge cost of ImageNet pre-training involved in the search space, i.e. the super network in differentiable NAS.
3.2.2 Architecture Adaptation
We start the differentiable NAS process with the expanded super network directly on the target task, i.e
., semantic segmentation, object detection and human pose estimation in our experiments. As NAS works commonly do, we split a part of data from the original training dataset as the validation set. In the preliminary search epochs, as the operation weights are not sufficiently trained, the architecture parameters cannot be updated towards a clear and correct direction. We first train operation weights of the super network for some epochs on the training dataset, which is also mentioned in some previous differentiable NAS works[30, 19]. After the weights get sufficiently trained, we start alternating the optimization of operation weights and architecture parameters . Specifically, we update on the training dataset by computing and optimize on the validation dataset with
. To control the computation cost (MAdds in our experiments) of the searched network, we define the loss function as follows.
where in the second term controls the magnitude of the MAdds optimization. The term during search is computed as
where is obtained by measuring the cost of operation in layer , is the total cost of layer which is computed by a weighted-sum of all operation costs and is the total cost of the network obtained by summing the cost of all the layers. To accelerate the search process and decouple the parameters of different sub-networks, we only sample one path in each iteration according to the distribution of architecture parameters for operation weight updating. As the search process terminates, we use the architecture parameters to derive the target architecture . The final operation type in each searched layer is determined as the one with the maximum architecture parameter .
|Method||Total Cost||ArchAdapt Cost||ParamAdapt Cost|
|DPC ||62.2K GHs||62.2K GHs||30.0 GHs|
|Auto-DeepLab-S ||244.0 GHs||72.0 GHs||172.0 GHs|
|FNA++||35.8 GHs||1.4 GHs||34.4 GHs|
3.2.3 Parameter Adaptation
We obtain the target architecture from architecture adaptation. To accommodate the new tasks, the target architecture becomes different from that of the seed network (which is primitively designed for the image classification task). Unlike conventional training strategy, we discard the cumbersome pre-training process of on ImageNet. We remap parameters of to before parameter adaptation. As shown in Fig. 1, the parameters of every searched layer in are remapped from the operation with the same type in the corresponding layer in . As the shape of the parameters is the same for the same operation type, the remapping process here can be performed as a pure collection manner. All the other layers in , including the input convolution and the head part of the network etc., are directly remapped from as well. In our former conference version , the parameters of are remapped from the seed network . We find that performing parameter remapping from can achieve better performance than from . We further study the remapping mechanism for parameter adaptation in experiments (Sec. 4.6). With parameter remapping on finished, we fine-tune the parameters of on the target task and obtain the final target network .
In this section, we first select the ImageNet pre-trained model MobileNetV2  as the seed network and apply our FNA++ method on three common computer vision tasks in Sec. 4.1 - 4.3, i.e., semantic segmentation, object detection and human pose estimation. We implement FNA++ on more network types to demonstrate the generalization ability, including ResNets  in Sec. 4.4 and NAS networks in Sec. 4.5. We study the remapping mechanism for parameter adaptation in Sec. 4.6 by comparing and analyzing two remapping mechanisms. Then in Sec. 4.7, we evaluate the effectiveness of parameter remapping for the two adaptation stages. Random search experiments in Sec. 4.8 are performed to demonstrate our method can be used as a NAS-method agnostic mechanism. Finally we study different remapping strategies in Sec. 4.9.
4.1 Network Adaptation on Semantic Segmentation
4.1.1 Implementation Details
The semantic segmentation experiments are conducted on the Cityscapes  dataset. In the architecture adaptation process, we map the seed network to the super network, which is used as the backbone of DeepLabv3 . We randomly sample images from the training set as the validation set for architecture parameters updating. The original validation set is not used in the search process. The image is first resized to and patches are randomly cropped as the input data. The output feature maps of the backbone are down-sampled by a factor of . Depthwise separable convolutions  are used in the ASPP module [50, 6]. As shown in Tab. I, the stages where the expansion ratio of MBConv is 6 in the original MobileNetV2 are searched and adjusted. We set the maximum numbers of layers in each searched stage of the super network as . We set a warm-up stage in the first epochs to linearly increase the learning rate from to . Then, the learning rate decays to with the cosine annealing schedule . The batch size is set as . We use the SGD optimizer with momentum and weight decay for operation weights and the Adam optimizer  with weight decay and a fixed learning rate of for architecture parameters. For the loss function defined in Eq. 8, we set as to optimize the MAdds of the searched network. The search process takes epochs in total. The architecture optimization starts after epochs. The whole search process is conducted on a single V100 GPU and takes only 1.4 hours in total.
|Method||Total Cost||Super Network||Target Network|
|DetNAS ||68 GDs||12 GDs||12 GDs||20 GDs||12 GDs||12 GDs|
|FNA++ (RetinaNet)||8.5 GDs||-||-||5.3 GDs||-||3.2 GDs|
|FNA++ (SSDLite)||21.0 GDs||-||-||5.7 GDs||-||15.3 GDs|
In the parameter adaptation process, we remap the parameters of the super network to the target architecture obtained in the aforementioned architecture adaptation. The training data is cropped as a patch from the rescaled image. The scale is randomly selected from
. The random left-right flipping is used. We update the statistics of the batch normalization (BN) for iterations before the parameter fine-tuning process. We use the same SGD optimizer as the search process. The learning rate linearly increases from to and then decays to with the polynomial schedule. The batch size is set as . The whole parameter adaptation process is conducted on TITAN-Xp GPUs and takes K iterations, which costs only hours in total.
4.1.2 Experimental Results
Our semantic segmentation results are shown in Tab. II. The FNA++ network achieves mIOU on Cityscapes with the DeepLabv3  framework, mIOU better than the manually designed seed network MobileNetV2  with fewer MAdds. Compared with the NAS method DPC  (with MobileNetV2 as the backbone) which searches a multi-scale module for semantic segmentation tasks, FNA++ gets mIOU promotion with B fewer MAdds. For fair comparison with Auto-DeepLab  which searches the backbone architecture on DeepLabv3 and retrains the searched network on DeepLabv3+ , we adapt the parameters of the target architecture to the DeepLabv3+ framework. Comparing with Auto-DeepLab-S, FNA++ achieves far better mIOU with fewer MAdds, Params and training iterations. With the output stride of 16, FNA++ promotes the mIOU by with only MAdds of Auto-DeepLab-S. With the improved remapping mechanism for parameter adaptation, FNA++ achieves better performance than our former version . We compare the computation cost in Tab. III. With the remapping mechanism, FNA++ greatly decreases the computation cost for adaptation, only taking 35.8 GPU hours, less than DPC and less than Auto-DeepLab.
4.2 Network Adaptation on Object Detection
4.2.1 Implementation Details
We further implement our FNA++ method on object detection tasks. We adapt the MobileNetV2 seed network to two commonly used detection systems, RetinaNet  and SSDLite [8, 29], on the MS-COCO dataset 
. Our implementation is based on the PyTorch framework and the MMDetection  toolkit. In the search process of architecture adaptation, we randomly sample data from the original trainval35k set as the validation set.
RetinaNet. We describe the details in the search process of architecture adaptation as follows. The maximum layer numbers in each searched stage are set as , as Tab. I shows. For the input image, the short side is resized to while the maximum long side is set as . For operation weights, we use the SGD optimizer with weight decay and momentum. We set a warm-up stage in the first iterations to linearly increase the learning rate from to . Then we decay the learning rate by a factor of at the 8th and 11th epoch. For the architecture parameters, we use the Adam optimizer  with weight decay and a fixed learning rate . For the multi-objective loss function, we set as in Eq. 8. We begin optimizing the architecture parameters after epochs. All the other training settings are the same as the RetinaNet implementation in MMDetection . For fine-tuning of the parameter adaptation, we use the SGD optimizer with weight decay and momentum. The same warm-up procedure is set in the first iterations to increase the learning rate from to . Then we decay the learning rate by at the 8th and 11th epoch. The whole architecture search process takes epochs, hours on 8 TITAN-Xp GPUs with the batch size of 16 and the whole parameter fine-tuning takes 12 epochs, about hours on 8 TITAN-Xp GPUs with 32 batch size.
SSDLite. We resize the input images to
ones. For operation weights in the search process, we use the standard RMSProp optimizer withweight decay. The warm-up stage in the first iterations increases learning rate from to . Then we decay the learning rate by at the , and epoch. The architecture optimization starts after epochs. We set as for the loss function. The other search settings are the same as the RetinaNet experiment. For parameter adaptation, the initial learning rate is and decays after , and epochs by a factor of . The other training settings follow the SSD  implementation in MMDetection . The search process takes epochs in total, hours on TITAN-Xp GPUs with batch size. The parameter adaptation takes epochs, hours on TITAN-Xp GPUs with batch size.
4.2.2 Experimental Results
We show the results on the MS-COCO dataset in Tab. IV. For the RetinaNet framework, compared with two manually designed networks, ShuffleNetV2-10 [44, 21] and MobileNetV2 , FNA++ achieves higher mAP with similar MAdds. Compared with DetNAS  which searches the backbone of the detection network, FNA++ achieves higher mAP with M fewer Params and B fewer MAdds. As shown in Tab. V, our total computation cost is only of DetNAS on RetinaNet. For SSDLite in Tab. IV, FNA++ surpasses both the manually designed network MobileNetV2 and the NAS-searched network MnasNet-92 , while MnasNet takes around 3.8K GPU days to search for the backbone network on ImageNet . The total computation cost of MnasNet is far larger than ours and is unaffordable for most researchers or engineers. The specific cost FNA++ takes on SSDLite is shown in Tab. V. It is difficult to train the small network due to the simplification . Therefore, experiments on SSDLite need longer training schedules and take larger computation cost than RetinaNet. The experimental results further demonstrate the efficiency and effectiveness of direct adaptation on the target task with parameter remapping and architecture search.
4.3 Network Adaptation on Human Pose Estimation
4.3.1 Implementation Details
We apply FNA++ on the human pose estimation task. The experiments are performed on the MPII dataset  with the SimpleBaseline framework . MPII dataset contains around 25K images with about 40K people. For the search process in architecture adaptation, we randomly sample data from the original training set as the validation set for architecture parameter optimization. The other data is used as the training set for search. For architecture parameters, we use the Adam optimizer  with a fixed learning rate of and weight decay. We set in Eq. 8 as for MAdds optimization. The input image is cropped and then resized to following the standard training settings [10, 11]. The batch size is set as 32. All the other training hyper-parameters are the same as SimpleBaseline. The search process takes epochs in total and the architecture parameter updating starts after 70 epochs. For parameter adaptation, we use the same training settings as SimpleBaseline. PCKh@0.5 
is used as the evaluation metric.
4.3.2 Experimental Results
The architecture adaptation takes 16 hours in total on only one TITAN X GPU and parameter adaptation takes 5.5 hours on one TITAN X GPU. The search cost is only 16 GPU hours and parameter adaptation takes 5.5 GPU hours. The total computation cost is 21.5 GPU hours. As shown in Tab. VI, FNA++ promotes the PCKh@0.5 by with similar model MAdds. As we aim at validating the effectiveness of FNA++ on networks, we do not tune the training hyper-parameters and just follow the default ResNet-50  training settings in SimpleBaseline for both MobileNetV2 and the FNA++ network training.
|block type||kernel size||group number|
4.4 Network Adaptation on ResNet
To evaluate the generalization ability on different network types, we perform our method on ResNets , including ResNet-18 and ResNet-50. As ResNets are composed of plain convolutions, kernel size enlargement will cause huge MAdds increase. We propose to search for diverse kernel sizes in ResNets without much MAdds increase by introducing grouped convolutions . The searchable ResNet blocks are shown in Fig. 5. We allow the first convolution in the basic block and the second convolution in the bottleneck block to be searched. All the optional block types in the designed ResNet search space are shown in Tab. VII. As the kernel size enlarges, we set more groups in the convolution block to maintain the MAdds.
We perform the adaptation on ResNet-18 and -50 to the RetinaNet  framework. For ResNet-18, the input image for search is resized to ones with the short side to and the long side not exceeding (shortly denoted as in MMDetection ). The SGD optimizer for operation weights is used with weight decay and initial learning rate. in Eq. 8 is set as . All the other search and training settings are the same as the MobileNetV2 experiments on RetinaNet. The total adaptation cost is only GPU days, including hours on TITAN-Xp GPUs for search and hours on GPUs for parameter adaptation. For ResNet-50, the batch size is set as in total for search. The input image is also resized to . For the SGD optimizer, the initial learning rate is and the weight decay is . The other hyper-parameters for search are the same as that for ResNet-18. For the training in parameter adaptation, we first recalculate the running statistics of BN for iterations with the synchronized batch normalization across GPUs (SyncBN). Then we freeze the BN layers111Freezing BN means using the running statistics of BN during training and not updating the BN parameters. It is implemented as .eval() in PyTorch . and train the target architecture on MS-COCO using the same hyper-parameters as ResNet-50 training in MMDetection. The architecture adaptation takes hours and parameter adaptation takes hours on TITAN-Xp GPUs, GPU days in total. The results are shown in Tab. VIII. Compared with the original ResNet-18 and -50, FNA++ can further promote the mAP by and with fewer Params and MAdds.
4.5 Network Adaptation on NAS networks
Our proposed parameter remapping paradigm can be implemented on various types of networks. We further apply FNA++ on two popular NAS networks, i.e., FBNet-C  and Proxyless (mobile) . The search space is constructed as Tab. IX shows. FBNet and ProxylessNAS search for architectures on the ImageNet classification task. To compare with the seed networks FBNet-C and Proxyless (mobile), we re-implement the two NAS networks and deploy them on the RetinaNet  framework. Then we train them on the MS-COCO  dataset with the ImageNet pre-trained parameters using the same training hyper-parameters as ours. The results are shown in Tab. X. Though the NAS networks already achieve far better performance than handcrafted MobileNetV2 on the detection task, our FNA++ networks further promote the mAP which cost similar MAdds with the NAS seed networks. This experiment demonstrates that FNA++ can not only promote the performance of manually designed networks, but also improve the NAS networks which are not searched on the target task. In real applications, if there is a demand for a new task, FNA++ helps to adapt the network with a low cost, avoiding cumbersome cost for extra pre-training and huge cost for searching from scratch. We visualize the architectures in Fig. 6.
|from seed ()||132.99B||11.91M||35.6|
|from sup ()||132.99B||11.91M||36.0|
4.6 Study the Remapping Mechanism for Parameter Adaptation
In our preliminary version , with the target architecture obtained by architecture adaptation, we remap the parameters of the seed network to the target architecture for latter parameter adaptation. As we explore the mechanism of parameter remapping, we find that parameters remapped from the super network can bring further performance promotion for parameter adaptation. However, the batch normalization (BN) parameters during search may cause unstability and damage the training performance of the sub-architectures in the super network. The parameters of BN are usually disabled during search in many differentiable/one-shot NAS methods [27, 24]
. We open the BN parameter updating in the search process, including learnable affine parameters and global mean/variance statistics, so as to completely use parameters fromfor parameter adaptation. Experiments show that BN parameters updating causes little effect on the search performance.
|(1)||Remap ArchAdapt RemapSuper ParamAdapt (FNA++)||24.17B||77.1|
|(2)||Remap ArchAdapt Remap ParamAdapt (FNA )||24.17B||76.6|
|(3)||RandInit ArchAdapt Remap ParamAdapt||24.29B||76.0|
|(4)||Remap ArchAdapt RandInit ParamAdapt||24.17B||73.0|
|(5)||RandInit ArchAdapt RandInit ParamAdapt||24.29B||72.4|
|(6)||Remap ArchAdapt Pretrain ParamAdapt||24.17B||76.5|
As shown in Tab. XI and Tab XII, remapping from the super network demonstrates better performance on both object detection framework RetinaNet  and semantic segmentation framework DeepLabv3 . However, for SSDLite [8, 29], remapping parameters from the super network achieves the same mAP as that from the seed network. We deduce this is due to the long training schedule of SSDLite, i.e., 60 epochs. We further perform a long training schedule on RetinaNet (24 epochs in MMDetection ). The results in Tab. XI show performance promotion that remapping from can bring over from decays from to with the training schedule set to . It indicates that remapping from the super network for parameter adaptation shows more effectiveness in short training scenarios. This conclusion is somewhat similar to that in , which demonstrates longer training schedules from scratch can achieve comparable results with training with a pre-trained model. We compare the training loss and mAP with different remapping mechanisms in Fig. 7. Model training with initial parameters remapped from the super network converges much faster than that remapped from the seed network in early epochs and achieves a higher final mAP in short training schedules. Training with the two remapping mechanisms can achieve similar results in long training schedules, e.g., SSDLite training. It is suggested to remap the parameters from the super network when computation resources are constrained.
4.7 Effectiveness of Parameter Remapping
To evaluate the effectiveness of the parameter remapping paradigm in our method, we attempt to optionally remove the parameter remapping process before the two stages, i.e. architecture adaptation and parameter adaptation. The experiments are conducted with the DeepLabv3  semantic segmentation framework on the Cityscapes dataset .
Tab. XIII shows the complete experiments we perform on parameter remapping. Row (1) denotes the procedure of FNA++ and Row (2) denotes the former version which remaps the seed parameters for parameter adaptation. In Row (3) we remove the parameter remapping process before architecture adaptation. In other word, the search is performed from scratch without using the pre-trained network. The mIOU in Row (3) drops by 0.6% compared to Row (2). Then we remove the parameter remapping before parameter adaptation in Row (4), i.e. training the target architecture from scratch on the target task. The mIOU decreases by 3.6% compared with (2). When we remove the parameter remapping before both stages in Row (5), it gets the worst performance. In Row (6), we first pre-train the searched architecture on ImageNet and then fine-tune it on the target task. It is worth noting that FNA achieves a higher mIOU by a narrow margin (0.1%) than the ImageNet pre-trained one in Row (6). We conjecture that this may benefit from the regularization effect of parameter remapping before the parameter adaptation stage.
All the experiments are conducted using the same searching and training settings for fair comparisons. With parameter remapping applied on both stages, the adaptation achieves the best results in Tab. XIII. Especially, the remapping process before parameter adaptation tends to provide greater performance gains than the remapping before architecture adaptation. All the experimental results demonstrate the importance and effectiveness of the proposed parameter remapping scheme.
|(2)||Remap DiffSearch Remap ParamAdapt||133.03||33.9|
|(3)||Remap RandSearch Remap ParamAdapt||133.11||33.5|
|(4)||RandInit RandSearch Remap ParamAdapt||133.08||31.5|
|(5)||Remap RandSearch RandInit ParamAdapt||133.11||25.3|
|(6)||RandInit RandSearch RandInit ParamAdapt||133.08||24.9|
4.8 Random Search Experiments
We carry out the Random Search (RandSearch) experiments with the RetinaNet  framework on the MS-COCO  dataset. All the results are shown in the Tab. XIV. We purely replace the original differentiable NAS (DiffSearch) method in FNA++ with the random search method in Row (3). The random search takes the same computation cost as the search in FNA++ for fair comparisons. We observe that FNA++ with RandSearch achieves comparable results with our original method. It further confirms that FNA++ is a general framework for network adaptation and has great generalization ability. NAS is only an implementation tool for architecture adaptation. The whole framework of FNA++ can be treated as a NAS-method agnostic mechanism. It is worth noting that even using random search, our FNA++ still outperforms DetNAS  with 0.2% mAP better and 150M MAdds fewer.
We further conduct similar ablation studies with experiments in Sec. 4.7 about the parameter remapping scheme in Row (4) - (6). All the experiments further support the effectiveness of the parameter remapping scheme.
4.9 Study Parameter Remapping Strategies
We explore more strategies for the parameter remapping paradigm. All the experiments are conducted with the DeepLabv3  framework on the Cityscapes dataset . We make exploration from the following respects. For simplicity, we denote the weights of the seed network and the new network on the remapping dimension (output/input channel) as and .
4.9.1 Remapping with BN Statistics on Width-level
We review the formulation of batch normalization  as follows,
where denotes the -dimensional input tensor of the th layer, denotes the learnable parameter which scales the normalized data on the channel dimension. We compute the absolute values of as . When remapping the parameters on the width-level, we sort the values of and map the parameters with the sorted top- indices. More specifically, we define a weights remapping function in Algo. 1
, where the reference vectoris .
4.9.2 Remapping with Weight Importance on Width-level
We attempt to use a canonical form of convolution weights to measure the importance of parameters. Then we remap the seed network parameters with great importance to the new network. The remapping operation is conducted based on Algo. 1
as well. We experiment with two canonical forms of weights to compute the reference vector, the standard deviation ofas and the norm of as .
4.9.3 Remapping with Dilation on Kernel-level
We experiment with another strategy of parameter remapping on the kernel-level. Different from the method defined in Sec. 3.1, we remap the parameters with a dilation manner as shown in Fig. 4.9.1. The values in the convolution kernel without remapping are all assigned with . It is formulated as
where and denote the weights of the new network and the seed network respectively, denote the spatial indices.
Tab. XV shows the experimental results and all the searched models hold the similar MAdds. The network adaptation with the parameter remapping paradigm defined in Sec. 3.1 achieves the best results. Furthermore, the remapping operation of FNA++ is easier to implement compared to the several aforementioned ones. We explore limited number of methods to implement the parameter remapping paradigm. How to conduct the remapping strategy more efficiently remains a significative work.
In this paper, we propose a fast neural network adaptation method (FNA++) with a novel parameter remapping paradigm and the architecture search method. We adapt the manually designed network MobileNetV2 to semantic segmentation, object detection and human pose estimation tasks on both architecture- and parameter- level. The generalization ability of FNA++ is further demonstrated on both ResNets and NAS networks. The parameter remapping paradigm takes full advantages of the seed network parameters, which greatly accelerates both the architecture search and parameter fine-tuning process. With our FNA++ method, researchers and engineers could fast adapt more pre-trained networks to various frameworks on different tasks. As there are lots of ImageNet pre-trained models available in the community, we could conduct adaptation with low cost and do more applications, e.g., face recognition, depth estimation, etc. Towards real scenarios with dynamic dataset or task demands, FNA++ is a good solution to adapt or update the network with negligible cost. For researchers with constrained computation resources, FNA++ can be an efficient tool to perform various explorations on computation consuming tasks.
We thank Liangchen Song for the discussion and assistance.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NeurIPS, 2012.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in LNCS, 2015.
-  L. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv:1706.05587, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in ECCV, 2016.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in ICCV, 2017.
-  B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in ECCV, 2018.
-  K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in CVPR, 2019.
-  Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, “Detnet: Design backbone for object detection,” in ECCV, 2018.
-  J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao, “Deep high-resolution representation learning for visual recognition,” arXiv:1908.07919, 2019.
-  B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” arXiv:1707.07012, 2017.
-  H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient neural architecture search via parameter sharing,” in ICML, 2018.
-  C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. L. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” in ECCV, 2018.
-  M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” arXiv:1807.11626, 2018.
-  H. Cai, L. Zhu, and S. Han, “ProxylessNAS: Direct neural architecture search on target task and hardware,” in ICLR, 2019.
-  C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei, “Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation,” in CVPR, 2019.
-  Y. Zhang, Z. Qiu, J. Liu, T. Yao, D. Liu, and T. Mei, “Customizable architecture search for semantic segmentation,” in CVPR, 2019.
-  Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, and J. Sun, “Detnas: Neural architecture search on object detection,” in NeurIPS, 2019.
-  K. He, R. B. Girshick, and P. Dollár, “Rethinking imagenet pre-training,” arXiv:1811.08883, 2018.
-  A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “SMASH: one-shot model architecture search through hypernetworks,” arXiv:1708.05344, 2017.
-  G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. V. Le, “Understanding and simplifying one-shot architecture search,” in ICML, 2018.
-  J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
-  E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” arXiv:abs/1802.01548, 2018.
-  H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” in ICLR, 2019.
-  L. Chen, M. D. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens, “Searching for efficient multi-scale architectures for dense image prediction,” in NeurIPS, 2018.
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2:
Inverted residuals and linear bottlenecks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
-  B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer, “Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search,” arXiv:1812.03443, 2018.
-  J. Fang, Y. Sun, K. Peng, Q. Zhang, Y. Li, W. Liu, and X. Wang, “Fast neural network adaptation via parameter remapping and architecture search,” in ICLR, 2020.
-  B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv:1611.01578, 2016.
-  H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu, “Hierarchical representations for efficient architecture search,” arXiv:1711.00436, 2017.
-  X. Dong and Y. Yang, “One-shot neural architecture search via self-evaluated template network,” in ICCV, 2019.
-  J. Fang, Y. Sun, Q. Zhang, Y. Li, W. Liu, and X. Wang, “Densely connected search space for more flexible neural architecture search,” in CVPR, 2020.
-  X. Dong and Y. Yang, “Searching for a robust neural architecture in four gpu hours,” in Proceedings of the IEEE Conference on computer vision and pattern recognition, 2019, pp. 1761–1770.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in CVPR, 2016.
-  R. J. Wang, X. Li, and C. X. Ling, “Pelee: A real-time object detection system on mobile devices,” in NeurIPS, 2018.
-  G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in CVPR, 2017.
-  F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in ICLR, 2016.
-  C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet: Bilateral segmentation network for real-time semantic segmentation,” in ECCV, 2018.
-  X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” arXiv:1707.01083, 2017.
-  N. Ma, X. Zhang, H. Zheng, and J. Sun, “Shufflenet V2: practical guidelines for efficient CNN architecture design,” 2018.
-  T. Chen, I. J. Goodfellow, and J. Shlens, “Net2net: Accelerating learning via knowledge transfer,” arXiv:1511.05641, 2015.
-  H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang, “Efficient architecture search by network transformation,” in AAAI, 2018.
-  J. Fang, Y. Chen, X. Zhang, Q. Zhang, C. Huang, G. Meng, W. Liu, and X. Wang, “EAT-NAS: elastic architecture transfer for accelerating large-scale neural architecture search,” arXiv:1901.05884, 2019.
-  T. Elsken, J. H. Metzen, and F. Hutter, “Efficient multi-objective neural architecture search via lamarckian evolution,” in ICLR, 2019.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inCVPR, 2016.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” TPAMI, 2017.
I. Loshchilov and F. Hutter, “SGDR: stochastic gradient descent with warm restarts,” inICLR, 2017.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
-  S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015.
-  L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018.
-  T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in ECCV, 2014.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
-  K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, and D. Lin, “Mmdetection: Open mmlab detection toolbox and benchmark,” arXiv:1906.07155, 2019.
-  Z. Liu, T. Zheng, G. Xu, Z. Yang, H. Liu, and D. Cai, “Training-time-friendly network for real-time object detection,” arXiv:1909.00700, 2019.
-  M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in CVPR, 2014.
-  T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in ECCV, 2014.