Neural Architecture Search for Lightweight Non-Local Networks

04/04/2020 ∙ by Yingwei Li, et al. ∙ 5

Non-Local (NL) blocks have been widely studied in various vision tasks. However, it has been rarely explored to embed the NL blocks in mobile neural networks, mainly due to the following challenges: 1) NL blocks generally have heavy computation cost which makes it difficult to be applied in applications where computational resources are limited, and 2) it is an open problem to discover an optimal configuration to embed NL blocks into mobile neural networks. We propose AutoNL to overcome the above two obstacles. Firstly, we propose a Lightweight Non-Local (LightNL) block by squeezing the transformation operations and incorporating compact features. With the novel design choices, the proposed LightNL block is 400x computationally cheaper than its conventional counterpart without sacrificing the performance. Secondly, by relaxing the structure of the LightNL block to be differentiable during training, we propose an efficient neural architecture search algorithm to learn an optimal configuration of LightNL blocks in an end-to-end manner. Notably, using only 32 GPU hours, the searched AutoNL model achieves 77.7 accuracy on ImageNet under a typical mobile setting (350M FLOPs), significantly outperforming previous mobile models including MobileNetV2 (+5.7 (+2.8 https://github.com/LiYingwei/AutoNL.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Non-Local (NL) block [2, 39] aims to capture long-range dependencies in deep neural networks, which have been used in a variety of vision tasks such as video classification [39], object detection [39], semantic segmentation [46, 48], image classification [2], and adversarial robustness [41]. Despite the remarkable progress, the general utilization of non-local modules under resource-constrained scenarios such as mobile devices remains underexplored. This may be due to the following two factors. Work done during an internship at Bytedance AI Lab.

First, NL blocks compute the response at each position by attending to all other positions and computing a weighted average of the features in all positions, which incurs a large computation burden. Several efforts have been explored to reduce the computation overhead. For instance, [8, 22] use associative law to reduce the memory and computation cost of matrix multiplication; Yue et al[44] use Taylor expansion to optimize the non-local module; Cao et al[4]

compute the affinity matrix via a convolutional layer; Bello 

et al[2] design a novel attention-augmented convolution. However, these methods either still lead to relatively large computation overhead (via using heavy operators, such as large matrix multiplications) or result in a less accurate outcome (e.g., simplified NL blocks [4]), making these methods undesirable for mobile-level vision systems.

Figure 1: ImageNet Accuracy vs. Computation Cost. Details can be found in Table 3.

Second, NL blocks are usually implemented as individual modules which can be plugged into a few manually selected layers (usually relatively deep layers). While it is intractable to densely embed it into a deep network due to the high computational complexity, it remains unclear where to insert those modules economically. Existing methods have not fully exploited the capacity of NL blocks in relational modeling under mobile settings.

Taking the two factors aforementioned into account, we aim to answer the following questions in this work: is it possible to develop an efficient NL block for mobile networks? What is the optimal configuration to embed those modules into mobile neural networks? We propose AutoNL to address these two questions. First, we design a Lightweight Non-Local (LightNL) block, which is the first work to apply non-local techniques to mobile networks to our best knowledge. We achieve this with two critical design choices 1) lighten the transformation operators (e.g., convolutions) and 2) utilize compact features. As a result, the proposed LightNL blocks are usually 400 computationally cheaper than conventional NL blocks [39]

, which is favorable to be applied to mobile deep learning systems. Second, we propose a novel neural architecture search algorithm. Specifically, we relax the structure of LightNL blocks to be differentiable so that our search algorithm can simultaneously determine the compactness of the features and the locations for LightNL blocks during the end-to-end training. We also reuse intermediate search results by acquiring various affinity matrices in one shot to reduce the redundant computation cost, which speeds up the search process.

Our proposed searching algorithm is fast and delivers high-performance lightweight models. As shown in Figure 1, our searched small AutoNL model achieves ImageNet top-1 accuracy with M FLOPs, which is faster than MobileNetV3 [17] with comparable performance ( top-1 accuracy with M FLOPs). Also, our searched large AutoNL model achieves ImageNet top-1 accuracy with M FLOPs, which has similar computation cost as MobileNetV3 but improves the top-1 accuracy by .

To summarize, our contributions are three-fold: (1) We design a lightweight and search compatible NL block for visual recognition models on mobile devices and resource-constrained platforms; (2) We propose an efficient neural architecture search algorithm to automatically learn an optimal configuration of the proposed LightNL blocks; 3) Our model achieves state-of-the-art performance on the ImageNet classification task under mobile settings.

Figure 2: Original NL vs. LightNL Block. (a) A typical architecture of the NL block contains several heavy operators, such as convolution ops and large matrix multiplications. (b) The proposed LightNL block contains much more lightweight operators, such as depthwise convolution ops and small matrix multiplications.

2 Related Work

Attention mechanism. The attention mechanism has been successfully applied to neural language processing in recent years [1, 36, 11]. Wang et al[39]

bridge attention mechanism and non-local operator, and use it to model long-range relationships in computer vision applications. Attention mechanisms can be applied along two orthogonal directions: channel attention and spatial attention. Channel attention 

[18, 37, 27] aims to model the relationships between different channels with different semantic concepts. By focusing on a part of the channels of the input feature and deactivating non-related concepts, the models can focus on the concepts of interest. Due to its simplicity and effectiveness [18], it is widely used in neural architecture search [33, 34, 17, 9].

Our work explores in both directions of spatial/channel attention. Although existing works [8, 44, 4, 2, 38] exploit various techniques to improve efficiency, they are still too computationally heavy under mobile settings. To alleviate this problem, we design a lightweight spatial attention module with low computational cost and it can be easily integrated into mobile neural networks.

Efficient mobile architectures. There are a lot of handcrafted neural network architectures [19, 42, 16, 30, 45, 25] for mobile applications. Among them, the family of MobileNet [16, 30] and the family of ShuffleNet [45, 25] stand out due to their superior efficiency and performance. MobileNetV2 [30] proposes the inverted residual block to improve both efficiency and performance over MobileNetV1 [16]. ShuffleNet [25] proposes to use efficient shuffle operations along with group convolutions to design efficient networks. Above methods are usually subject to trial-and-errors by experts in the model design process.

Neural Architecture Search. Recently, it has received much attention to use neural architecture search (NAS) to design efficient network architectures for various applications [35, 13, 23, 43, 21]

. A critical part of NAS is to design proper search spaces. Guided by a meta-controller, early NAS methods either use reinforcement learning 

[50] or evolution algorithm [29] to discover better architectures. These methods are computationally inefficient, requiring thousands of GPU days to search. ENAS [28] shares parameters across sampled architectures to reduce the search cost. DARTS [24] proposes a continuous relaxation of the architecture parameters and conducts one-shot search and evaluation. These methods all adopt a NASNet [50] like search space. Recently, more expert knowledge in handcrafting network architectures are introduced in NAS. Using MobileNetV2 basic blocks in search space [3, 40, 33, 34, 31, 14, 26] significantly improves the performance of searched architectures. [3, 14] reduce the GPU memory consumption by executing only part of the super-net in each forward pass during training. [26] proposes an ensemble perspective of the basic block and simultaneously searches and trains the target architecture in the fine-grained search space. [31] proposes a super-kernel representation to incorporate all architectural hyper-parameters (e.g., kernel sizes, expansion rations in MobileNetV2 blocks) in a unified search framework to reuse model parameters and computations. In our proposed searching algorithm, we focus on seeking an optimal configuration of LightNL blocks in low-cost neural networks which brought significant performance gains.

3 AutoNL

In this section, we present AutoNL: we first elaborate on how to design a Lightweight Non-Local (LightNL) block in Section 3.1; then we introduce a novel neural architecture search algorithm in Section 3.2 to automatically search for an optimal configuration of LightNL blocks.

3.1 Lightweight Non-Local Blocks

In this section, we first revisit the NL blocks, then we introduce our proposed Lightweight Non-Local (LightNL) block in detail.

Revisit NL blocks. The core component in the NL blocks is the non-local operation. Following [39], a generic non-local operation can be formulated as

(1)

where indexes the position of input feature whose response is to be computed, enumerates all possible positions in , outputs the affinity matrix between and its context features , computes an embedding of the input feature at the position , and is the normalization term. Following [39], the non-local operation in Eqn. (1

) is wrapped into a NL block with a residual connection from the input feature

. The mathematical formulation is given as

(2)

where denotes a learnable feature transformation.

Instantiation. Dot product is used as the function form of due to its simplicity in computing the correlation between features. Eqn. (1) thus becomes

(3)

Here the shape of is denoted as where , and are the height, width and number of channels, respectively. and are convolutional layers with filters. Before matrix multiplications, the outputs of convolution are reshaped to .

Levi et al[22] discover that for NL blocks instantiated in the form of Eqn. (3), employing the associative law of matrix multiplication can largely reduce the computation overhead. Based on the associative rules, Eqn. (3) can be written in two equivalent forms:

(4)

Although the two forms produce the same numerical results, they have different computational complexity [22]. Therefore in computing Eqn. (3), one can always choose the form with smaller computation cost for better efficiency.

Design principles. The following part introduces two key principles to reduce the computation cost of Eqn. (3).

Design principle 1: Share and lighten the feature transformations. Instead of using two different transformations ( and ) on the same input feature x in Eqn. (3), we use a shared transformation in the non-local operation. In this way, the computation cost of Eqn. (3) is significantly reduced by reusing the result of in computing the affinity matrix. The simplified non-local operation is

(5)

The input feature (output of hidden layer) can be seen as the transformation of input data through a feature transformer . Therefore Eqn. (5) can be written as

(6)

In the scenario of using NL blocks in neural networks, is represented by a parameterized deep neural network. In contrast, is a single convolution operation. To further simplify Eqn. (6), we integrate the learning process of into that of . Taking advantage of the strong capability of deep neural networks on approximating functions [15], we remove and Eqn. (6) is simplified as

(7)

At last, we introduce our method to simplify “”, another heavy transformation function in Eqn. (2). Recent works [39] instantiate it as a convolutional layer. To further reduce the computation cost of NL blocks, we propose to replace the convolution with a depthwise convolution [16] since the latter is more efficient. Eqn. (2) is then modified to be

(8)

where denotes the depthwise convolution kernel.

Design principle 2: Use compact features for computing affinity matrices. Since is a high-dimensional feature, directly performing matrix multiplication using the full-sized per Eqn. (7) leads to large computation overhead. To solve this problem, we propose to downsample first to obtain a more compact feature which replaces in Eqn. (7). Since is a three-dimensional feature with depth (channels), width and height, we propose to downsample along either channel dimension, spatial dimension or both dimensions to obtain compact features , and respectively. Consequently, the computation cost of Eqn. (7) is reduced.

Therefore, based on Eqn. (7), we can simply apply the compact features in the NL block to compute and as

(9)

Note that there is a trade-off between the computation cost and the representation capacity of the output (i.e., ) of the non-local operation: using more compact features (with a lower downsampling ratio) reduces the computation cost but the output fails to capture the informative context information in those discarded features; on the other hand, using denser features (with a higher downsampling ratio) helps the output capture richer contexts, but it is more computationally demanding. Manually setting the downsampling ratios requires trial-and-errors. To solve this issue, we propose a novel neural network architecture search (NAS) method in Section 3.2 to efficiently search for the configuration of NL blocks that achieve descent performance under specific resource constraints.

Before introduce our NAS method, let’s briefly summarize the advantages of the proposed LightNL blocks. Thanks to the aforementioned two design principles, our proposed LightNL block is empirically demonstrated to be much more efficient (refer to Section 4) than the conventional NL block [39], making it favorable to be deployed in mobile devices with limited computational budgets. In addition, since the computational complexity of the blocks can be easily adjusted by the downsampling ratios, the proposed LightNL blocks can provide better support on deep learning models at different scales. We illustrate the structure of the conventional NL block and that of the proposed block in Figure 2.

Figure 3: MobileNetV2 vs. MobileNetV2 + LightNL. The proposed LightNL block improves the baseline by in ImageNet top-1 accuracy and in PASCAL VOC 2012 mIoU.
Figure 4: Illustrate the feature reuse paradigm along channel dimension.

3.2 Neural Architecture Search for LightNL

To validate the efficacy and generalization of the proposed LightNL block for deep networks, we perform a proof test by applying it to every MobileNetV2 block. As shown in Figure 3, such a simple way of using the proposed LightNL blocks can already significantly boost the performance on both image classification and semantic segmentation. This observation motivates us to search for a better configuration of the proposed LightNL blocks in neural networks to fully utilize its representation learning capacity. As can be seen in Section 3.1, except for the insert locations in a neural network, the downsampling scheme that controls the complexity of the LightNL blocks is another important factor to be determined. We note that both insert locations and downsampling schedule of LightNL blocks are critical to the performance and computational cost of models. To automate the process of model design and find an optimal configuration of the proposed LightNL blocks, we propose an efficient Neural Architecture Search (NAS) method. Concretely, we propose to jointly search the configurations of LightNL blocks and the basic neural network architectural parameters (e.g

., kernel size, number of channels) using a cost-aware loss function.

Insert location. Motivated by [31], we select several candidate locations for inserting LightNL blocks throughout the network and decide whether a LightNL block should be used by comparing the norm of the depthwise convolution kernel to a trainable latent variable :

(10)

where replaces to be used in Eqn. (8), and is an indicator function. indicates that a LightNL block will be used with being the depthwise convolution kernel. Otherwise, when and thus Eqn. (8) is degenerated to meaning no lightweight non-local block will be inserted.

Instead of manually selecting the value of threshold , we set it to be a trainable parameter, which is jointly optimized with other parameters via gradient decent. To compute the gradient of , we relax the indicator function

to a differentiable sigmoid function

during the back-propagation process.

Module compactness. As can be seen from Eqn. (7), the computational cost of LightNL block when performing the matrix multiplication is determined by the compactness of downsampled features. Given a search space which contains candidate downsampling ratios, i.e.,  where , our goal is to search for an optimal downsampling ratio for each LightNL block. For the sake of clarity, here we use the case of searching downsampling ratios along the channel dimension to illustrate our method. Note that searching downsampling ratios along other dimensions can be performed in the same manner.

Different from searching for the insert locations through Eqn. (10), we encode the choice of downsampling ratios in the process of computing affinity matrix:

(11)

where denotes the computed affinity matrix, denotes the downsampled feature with downsampling ratio , and is an indicator which holds true when is selected. By setting the constraint that only one downsampling ratio is used, Eqn. (11) can be simplified as when is selected as the downsampling ratio.

A critical step is how to formulate the condition of for deciding which downsampling ratio to use. A reasonable intuition is that the criteria should be able to determine whether the downsampled feature can be used to compute an accurate affinity matrix. Thus, our goal is to define a “similarity” signal that models whether the affinity matrix from the downsampled feature is close to the “ground-truth” affinity matrix, denotes as . Specifically, we write the indicator as

(12)

where denotes the logical operator AND. An intuitive explanation to the rational of Eqn. (12) is the algorithm always selects the smallest with which the Euclidean distance between and is lower than threshold . To ensure when all other indicators are zeros, we set so that . Meanwhile, we relax the indicator function to sigmoid when computing gradients and update the threshold via gradient descent. Since the output of indicator changes with different input feature

, for better training convergence, we get inspired from batch normalization 

[20] and use the exponential moving average of affinity matrices in computing Eqn. (12). After the searching stage, the downsampling ratio is determined by evaluating the following indicators:

(13)

where denotes the exponential moving averaged value of .

From Eqn. (12), one can observe that the output of indicator depends on indicators with smaller downsampling ratio. Based on this finding, we propose to reuse the affinity matrix computed with low-dimensional features (generated with lower downsampling ratios) when computing affinity matrix with high-dimensional features (generated with higher downsampling ratios). Concretely, can be partitioned into , . The calculation of affinity matrix using can be decomposed as

(14)

where is the reusable affinity matrix computed with a smaller downsampling ratio (recall that ). This feature reusing paradigm can largely reduce the search overhead since computing affinity matrices with more choices of downsampling ratios does not incur any additional computation cost. The process of feature reusing is illustrated in Figure 4.

Searching process. We integrate our proposed search algorithm with Single-path NAS [31] and jointly search basic architectural parameters (following MNasNet [33]) along with the insert locations and downsampling schemes of LightNL blocks. We search downsampling ratios along both spatial and channel dimensions to achieve better compactness. To learn efficient deep learning models, the overall objective function is to minimize both standard classification loss and the model’s computation complexity which is related to both the insert locations and the compactness of LightNL blocks:

(15)

where w denotes model weights and t denotes architectural parameters which can be grouped in two categories: one is from LightNL block including the insert positions and downsampling ratios while the other follows MNasNet [33] including kernel size, number of channels, etc. is the cross-entropy loss and is the computation (i.e., FLOPs) cost. We use gradient descent to optimize the above objective function in an end-to-end manner.

4 Experiments

We first demonstrate the efficacy and efficiency of LightNL by manually inserting it into lightweight models in Section 4.1. Then we apply the proposed search algorithm to the LightNL blocks in Section 4.2. The evaluation and comparison with state-of-the-art methods are done on ImageNet classification [10].

4.1 Manually Designed LightNL Networks

Models. Our experiments are based on MobileNetV2 1.0 [30]. We insert LightNL blocks after the second point-wise convolution layer in every MobileNetV2 block. We use channels to compute the affinity matrix for the sake of low computation cost. Also, if the feature map is larger than

, we downsample it along the spatial axis with a stride of

. We call the transformed model MobileNetV2-LightNL for short. We compare the two models with different depth multipliers, including , , and .

Training setup. Following the training schedule in MNasNet [33], we train the models using the synchronous training setup on Tesla-V100-SXM2-16GB GPUs. We use an initial learning rate of , and a batch size of (128 images per GPU). The learning rate linearly increases to in the first epochs and then is decayed by every epochs. We use a dropout of , a weight decay of and Inception image preprocessing [32] of size . Finally, we use exponential moving average on model weights with a momentum of . All batch normalization layers use a momentum of .

ImageNet classification results. We compare the results between the original MobileNetV2 and MobileNetV2-LightNL in Figure 5. We observe consistent performance gain even without tuning the hyper-parameters of LightNL blocks for models with different depth multipliers. For example, when the depth multiplier is , the original MobileNetV2 model achieves a top-1 accuracy of with M FLOPs, while our MobileNetV2-LightNL achieves with M FLOPs. According to Figure 5, it is unlikely to boost the performance of the MobileNetV2 model to the comparable performance by simply increasing the width to get a M FLOPs model. When the depth multiplier is , LightNL blocks bring a performance gain of with a marginal increase in FLOPs (M).

Figure 5: MobileNetV2 vs. MobileNetV2-LightNL. We apply LightNL blocks to MobileNetV2 with different depth multipliers, i.e., , , , , from left to right respectively. Despite inserting LightNL blocks manually, consistent performance gains can be observed for different MobileNetV2 base models.
Figure 6: Class Activation Map (CAM) [47] for MobileNetV2 and MobileNetV2-LightNL. The three columns correspond to the ground truth, predictions by MobileNetV2 and predictions by MobileNetV2-LightNL respectively. The proposed LightNL block helps the model attend to image regions with more class-specific discriminative features.
Non-local Module FLOPs / Acc ()
Operator Wrapper FLOPs
- - 301M 73.4
Wang et al. [39] Wang et al. [39] +6.2G 75.2
Levi et al. [22] +146M 75.2
Zhu et al. [49] +107M -
Eqn. (3) Wang et al. [39] +119M 75.2
Eqn. (5) +93M 75.1
Eqn. (7) +66M 75.0
Eqn. (9) +38M 75.0
Eqn. (9) Eqn. (8) +15M 75.0
Table 1: Ablation Analysis. We present the comparison of different NL blocks and different variants in our design. The base model is MobileNetV2, which achieves a top-1 accuracy of with 301M FLOPs.
Method FLOPs (M) mIoU
MobileNetV2 301 70.6
MobileNetV2-LightNL (ours) 316 72.9
Table 2: Comparison of FLOPs and mIoU on PASCAL VOC 2012.

Ablation study. To diagnose the proposed LightNL block, we present a step-by-step ablation study in Table 1. As shown in the table, every modification preserves the model performance but reduces the computation cost. By comparing with the baseline model, the proposed LightNL block improves ImageNet top-1 accuracy by (from to ), but only increases M FLOPs, which is only of the total FLOPs on MobileNetV2. Comparing with the standard NL block, the proposed LightNL block is about computationally cheaper (6.2G vs. 15M) with comparable performance ( vs. ). Comparing with Levi et al[22] which optimized the matrix multiplication with the associative law, the proposed LightNL block is still computationally cheaper. Compared with a very recent work proposed by Zhu et al[49] which leverages the pyramid pooling to reduce the complexity, LightNL is around computationally cheaper.

CAM visualization. In order to illustrate the efficacy of our LightNL, Figure 6 compares the class activation map [47] for the original MobileNetV2 and MobileNetV2-LightNL. We see that LightNL is capable of helping the model to focus on more relevant regions while it is much computationally cheaper than the conventional counterparts as analyzed above. For example, at the middle top of Figure 6, the model without the LightNL blocks focus on only a part of the sewing machine. When LightNL is applied, the model can “see” the whole machine, leading to more accurate and robust predictions.

PASCAL VOC segmentation results. To demonstrate the generalization ability of our method, we compare the performance of MobileNetV2 and MobileNetV2-LightNL on the PASCAL VOC 2012 semantic segmentation dataset [12]. Following Chen et al[7], we use the classification model as a drop-in replacement for the backbone feature extractor in the Deeplabv3 [6]. It is cascaded by an Atrous Spatial Pyramid Pooling module (ASPP) [5] with three convolutions with different atrous rates. The modified architectures share the same computation costs as the backbone models due to the low computation cost of LightNL blocks. All models are initialized with ImageNet pre-trained weights and then fine-tuned with the same training protocol in [5]. It should be emphasized here that the focus of this part is to assess the efficacy of the proposed LightNL while keeping other factors fixed. It is notable that we do not adopt complex training techniques such as multi-scale and left-right flipped inputs, which may lead to better performance. The results are shown in Table 2, LightNL blocks bring a performance gain of in mIoU with a minor increase in FLOPs. The results indicate the proposed LightNL blocks are well suitable for other tasks such as semantic segmentation.

Figure 7: The searched architecture of AutoNL-L. C and S denote channel downsampling ratio and the stride of spatial downsampling respectively. We use different colors to denote the kernel size (K) of the depthwise convolution and use height to denote the expansion rate (E) of the block. We use the round corner to denote adding SE [18] to the MobileNetV2 block.
Model #Params Flops Top-1 Top-5
MobileNetV2 [30] 3.4M 300M 72.0 91.0
MBV2 (our impl.) 3.4M 301M 73.4 91.4
ShuffleNetV2 [25] 3.5M 299M 72.6 -
FBNet-A [40] 4.3M 249M 73.0 -
Proxyless [3] 4.1M 320M 74.6 92.2
MnasNet-A1 [33] 3.9M 312M 75.2 92.5
MnasNet-A2 4.8M 340M 75.6 92.7
AA-MnasNet-A1 [2] 4.1M 350M 75.7 92.6
MobileNetV3-L 5.4M 217M 75.2 -
MixNet-S [34] 4.1M 256M 75.8 92.8
AutoNL-S (ours) 4.4M 267M 76.5 93.1
FBNet-C [40] 5.5M 375M 74.9 -
Proxyless (GPU) [3] - 465M 75.1 92.5
SinglePath [31] 4.4M 334M 75.0 92.2
SinglePath (our impl.) 4.4M 334M 74.7 92.2
FairNAS-A [9] 4.6M 388M 75.3 92.4
EfficientNet-B0 [35] 5.3M 388M 76.3 93.2
SCARLET-A [9] 6.7M 365M 76.9 93.4
MBV3-L (1.25x) [17] 7.5M 356M 76.6 -
MixNet-M [34] 5.0M 360M 77.0 93.3
AutoNL-L (ours) 5.6M 353M 77.7 93.7
Table 3: Comparison with the state-of-the-art models on ImageNet 2012 Val set.

4.2 AutoNL

We apply the proposed neural architecture search algorithm to search for an optimal configuration of LightNL blocks. Specifically, we have five LightNL candidates for each potential insert location, i.e., sampling or channels to compute affinity matrix, sampling along spatial dimensions with stride or , inserting a LightNL block at the current position or not. Note that it is easy to enlarge the search space by including other LightNL blocks with more hyper-parameters. In addition, similar to recent work [33, 40, 3, 31], we also search for optimal kernel sizes, optimal expansion ratios and optimal SE ratios with MobileNetV2 block [30] as the building block.

We directly search on the ImageNet training set and use a computation cost loss and the cross-entropy loss as guidance, both of which are differentiable thanks to the relaxations of the indicator functions during the back-propagation process. It takes epochs (about GPU hours) for the search process to converge.

Performance on classification. We obtain two models using the proposed neural architecture search algorithm; we denote the large one as AutoNL-L and the small one as AutoNL-S in Table 3. The architecture of AutoNL-L is presented in Figure 7.

Table 3 shows that AutoNL outperforms all the latest mobile CNNs. Comparing to the handcrafted models, AutoNL-S improves the top-1 accuracy by over MobileNetV2 [30] and over ShuffleNetV2 [25] while saving about FLOPs. Besides, AutoNL achieves better results than the latest models from NAS approaches. For example, compared to EfficientNet-B0, AutoNL-L improves the top-1 accuracy by while saving about FLOPs. Our models also achieve better performance than the latest MobileNetV3 [17], which is developed with several manual optimizations in addition to architecture search.

AutoNL-L also surpasses the state-of-the-art NL method (i.e., AA-MnasNet-A1) by with comparable FLOPs. Even AutoNL-S improves accuracy by while saving FLOPs. We also compare with MixNet, which is a very recent state-of-the-art model under mobile settings, both AutoNL-L and AutoNL-S achieve improvement with comparable FLOPs but with much less search time ( GPU hours vs.  GPU hours [40], faster).

Figure 8: Performance comparison on different input resolutions and depth multipliers under extremely low FLOPs. For MobileNetV2 [30], FBNet [40] and our searched models, the tuples of (input resolution, depth multiplier) are , , and respectively from left to right. For MNasNet [33], we show the result of input resolution with depth multiplier.

We also search for models under different combinations of input resolutions and channel sizes under extremely low FLOPs. The results are summarized in Figure 8. AutoNL achieves consistent improvement over MobileNetV2, FBNet, and MNasNet. For example, when the input resolution is and the depth multiplier is , our model achieves accuracy, outperforming MobileNetV2 by and FBNet by .

5 Conclusion

As an important building block for various vision applications, NL blocks under mobile settings remain underexplored due to their heavy computation overhead. To our best knowledge, AutoNL is the first method to explore the usage of NL blocks for general mobile networks. Specifically, we design a LightNL block to enable highly efficient context modeling in mobile settings. We then propose a neural architecture search algorithm to optimize the configuration of LightNL blocks. Our method significantly outperforms prior arts with 77.7% top-1 accuracy on ImageNet under a typical mobile setting (350M FLOPs).

Acknowledgements This work was partially supported by ONR N00014-15-1-2356.

References

  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §2.
  • [2] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le (2019) Attention augmented convolutional networks. arXiv preprint arXiv:1904.09925. Cited by: §1, §1, §2, Table 3.
  • [3] H. Cai, L. Zhu, and S. Han (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In ICLR, Cited by: §2, §4.2, Table 3.
  • [4] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu (2019) GCNet: non-local networks meet squeeze-excitation networks and beyond. arXiv preprint arXiv:1904.11492. Cited by: §1, §2.
  • [5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40 (4), pp. 834–848. Cited by: §4.1.
  • [6] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §4.1.
  • [7] Y. Chen, G. Meng, Q. Zhang, S. Xiang, C. Huang, L. Mu, and X. Wang (2019) RENAS: reinforced evolutionary neural architecture search. In CVPR, pp. 4787–4796. Cited by: §4.1.
  • [8] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng (2018) A^ 2-nets: double attention networks. In NeurIPS, pp. 352–361. Cited by: §1, §2.
  • [9] X. Chu, B. Zhang, J. Li, Q. Li, and R. Xu (2019) ScarletNAS: bridging the gap between scalability and fairness in neural architecture search. CoRR abs/1908.06022. Cited by: §2, Table 3.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §4.
  • [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §2.
  • [12] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: a retrospective. IJCV 111 (1), pp. 98–136. Cited by: §4.1.
  • [13] G. Ghiasi, T. Lin, and Q. V. Le (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In CVPR, pp. 7036–7045. Cited by: §2.
  • [14] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun (2019) Single path one-shot neural architecture search with uniform sampling. CoRR abs/1904.00420. Cited by: §2.
  • [15] K. Hornik, M. Stinchcombe, and H. White (1989) Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: §3.1.
  • [16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    Mobilenets: efficient convolutional neural networks for mobile vision applications

    .
    arXiv preprint arXiv:1704.04861. Cited by: §2, §3.1.
  • [17] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. arXiv preprint arXiv:1905.02244. Cited by: §1, §2, §4.2, Table 3.
  • [18] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, pp. 7132–7141. Cited by: §2, Figure 7.
  • [19] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger (2017) CondenseNet: an efficient densenet using learned group convolutions. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §2.
  • [20] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §3.2.
  • [21] Q. Jin, L. Yang, and Z. Liao (2019) AdaBits: neural network quantization with adaptive bit-widths. arXiv preprint arXiv:1912.09666. Cited by: §2.
  • [22] H. Levi and S. Ullman (2018) Efficient coarse-to-fine non-local module for the detection of small objects. arXiv preprint arXiv:1811.12152. Cited by: §1, §3.1, §4.1, Table 1.
  • [23] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In CVPR, pp. 82–92. Cited by: §2.
  • [24] H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: differentiable architecture search. In ICLR, External Links: Link Cited by: §2.
  • [25] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In ECCV, Cited by: §2, §4.2, Table 3.
  • [26] J. Mei, Y. Li, X. Lian, X. Jin, L. Yang, A. Yuille, and J. Yang (2020) AtomNAS: fine-grained end-to-end neural architecture search. In ICLR, External Links: Link Cited by: §2.
  • [27] J. Park, S. Woo, J. Lee, and I. S. Kweon (2018) BAM: bottleneck attention module. In BMVC, pp. 147. Cited by: §2.
  • [28] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. In ICML, pp. 4092–4101. Cited by: §2.
  • [29] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    .
    In AAAI 2019, pp. 4780–4789. Cited by: §2.
  • [30] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §2, Figure 8, §4.1, §4.2, §4.2, Table 3.
  • [31] D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, B. Priyantha, J. Liu, and D. Marculescu (2019) Single-path nas: designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877. Cited by: §2, §3.2, §3.2, §4.2, Table 3.
  • [32] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, Cited by: §4.1.
  • [33] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In CVPR, Cited by: §2, §2, §3.2, Figure 8, §4.1, §4.2, Table 3.
  • [34] M. Tan and Q. V. Le (2019) Mixnet: mixed depthwise convolutional kernels. In BMVC, Cited by: §2, §2, Table 3.
  • [35] M. Tan and Q. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In ICML, pp. 6105–6114. Cited by: §2, Table 3.
  • [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §2.
  • [37] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. In CVPR, pp. 6450–6458. Cited by: §2.
  • [38] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L. Chen (2020) Axial-deeplab: stand-alone axial-attention for panoptic segmentation. arXiv preprint arXiv:2003.07853. Cited by: §2.
  • [39] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: §1, §1, §2, §3.1, §3.1, §3.1, Table 1.
  • [40] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019) Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In CVPR, pp. 10734–10742. Cited by: §2, Figure 8, §4.2, §4.2, Table 3.
  • [41] C. Xie, Y. Wu, L. v. d. Maaten, A. L. Yuille, and K. He (2019) Feature denoising for improving adversarial robustness. In CVPR, Cited by: §1.
  • [42] G. Xie, J. Wang, T. Zhang, J. Lai, R. Hong, and G. Qi (2018) IGCV2: interleaved structured sparse convolutional neural networks. In CVPR, Cited by: §2.
  • [43] Q. Yu, D. Yang, H. Roth, Y. Bai, Y. Zhang, A. L. Yuille, and D. Xu (2020) C2FNAS: coarse-to-fine neural architecture search for 3d medical image segmentation. In CVPR, Cited by: §2.
  • [44] K. Yue, M. Sun, Y. Yuan, F. Zhou, E. Ding, and F. Xu (2018) Compact generalized non-local network. In NeurIPS, pp. 6510–6519. Cited by: §1, §2.
  • [45] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §2.
  • [46] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia (2018) Psanet: point-wise spatial attention network for scene parsing. In ECCV, pp. 267–283. Cited by: §1.
  • [47] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    .
    In CVPR, pp. 2921–2929. Cited by: Figure 6, §4.1.
  • [48] Y. Zhou, D. Dreizin, Y. Li, Z. Zhang, Y. Wang, and A. Yuille (2019) Multi-scale attentional network for multi-focal segmentation of active bleed after pelvic fractures. In

    International Workshop on Machine Learning in Medical Imaging

    ,
    pp. 461–469. Cited by: §1.
  • [49] Z. Zhu, M. Xu, S. Bai, T. Huang, and X. Bai (2019) Asymmetric non-local neural networks for semantic segmentation. In ICCV, pp. 593–602. Cited by: §4.1, Table 1.
  • [50] B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §2.