1 Introduction
NonLocal (NL) block [2, 39] aims to capture longrange dependencies in deep neural networks, which have been used in a variety of vision tasks such as video classification [39], object detection [39], semantic segmentation [46, 48], image classification [2], and adversarial robustness [41]. Despite the remarkable progress, the general utilization of nonlocal modules under resourceconstrained scenarios such as mobile devices remains underexplored. This may be due to the following two factors. ^{†}^{†}Work done during an internship at Bytedance AI Lab.
First, NL blocks compute the response at each position by attending to all other positions and computing a weighted average of the features in all positions, which incurs a large computation burden. Several efforts have been explored to reduce the computation overhead. For instance, [8, 22] use associative law to reduce the memory and computation cost of matrix multiplication; Yue et al. [44] use Taylor expansion to optimize the nonlocal module; Cao et al. [4]
compute the affinity matrix via a convolutional layer; Bello
et al. [2] design a novel attentionaugmented convolution. However, these methods either still lead to relatively large computation overhead (via using heavy operators, such as large matrix multiplications) or result in a less accurate outcome (e.g., simplified NL blocks [4]), making these methods undesirable for mobilelevel vision systems.Second, NL blocks are usually implemented as individual modules which can be plugged into a few manually selected layers (usually relatively deep layers). While it is intractable to densely embed it into a deep network due to the high computational complexity, it remains unclear where to insert those modules economically. Existing methods have not fully exploited the capacity of NL blocks in relational modeling under mobile settings.
Taking the two factors aforementioned into account, we aim to answer the following questions in this work: is it possible to develop an efficient NL block for mobile networks? What is the optimal configuration to embed those modules into mobile neural networks? We propose AutoNL to address these two questions. First, we design a Lightweight NonLocal (LightNL) block, which is the first work to apply nonlocal techniques to mobile networks to our best knowledge. We achieve this with two critical design choices 1) lighten the transformation operators (e.g., convolutions) and 2) utilize compact features. As a result, the proposed LightNL blocks are usually 400 computationally cheaper than conventional NL blocks [39]
, which is favorable to be applied to mobile deep learning systems. Second, we propose a novel neural architecture search algorithm. Specifically, we relax the structure of LightNL blocks to be differentiable so that our search algorithm can simultaneously determine the compactness of the features and the locations for LightNL blocks during the endtoend training. We also reuse intermediate search results by acquiring various affinity matrices in one shot to reduce the redundant computation cost, which speeds up the search process.
Our proposed searching algorithm is fast and delivers highperformance lightweight models. As shown in Figure 1, our searched small AutoNL model achieves ImageNet top1 accuracy with M FLOPs, which is faster than MobileNetV3 [17] with comparable performance ( top1 accuracy with M FLOPs). Also, our searched large AutoNL model achieves ImageNet top1 accuracy with M FLOPs, which has similar computation cost as MobileNetV3 but improves the top1 accuracy by .
To summarize, our contributions are threefold: (1) We design a lightweight and search compatible NL block for visual recognition models on mobile devices and resourceconstrained platforms; (2) We propose an efficient neural architecture search algorithm to automatically learn an optimal configuration of the proposed LightNL blocks; 3) Our model achieves stateoftheart performance on the ImageNet classification task under mobile settings.
2 Related Work
Attention mechanism. The attention mechanism has been successfully applied to neural language processing in recent years [1, 36, 11]. Wang et al. [39]
bridge attention mechanism and nonlocal operator, and use it to model longrange relationships in computer vision applications. Attention mechanisms can be applied along two orthogonal directions: channel attention and spatial attention. Channel attention
[18, 37, 27] aims to model the relationships between different channels with different semantic concepts. By focusing on a part of the channels of the input feature and deactivating nonrelated concepts, the models can focus on the concepts of interest. Due to its simplicity and effectiveness [18], it is widely used in neural architecture search [33, 34, 17, 9].Our work explores in both directions of spatial/channel attention. Although existing works [8, 44, 4, 2, 38] exploit various techniques to improve efficiency, they are still too computationally heavy under mobile settings. To alleviate this problem, we design a lightweight spatial attention module with low computational cost and it can be easily integrated into mobile neural networks.
Efficient mobile architectures. There are a lot of handcrafted neural network architectures [19, 42, 16, 30, 45, 25] for mobile applications. Among them, the family of MobileNet [16, 30] and the family of ShuffleNet [45, 25] stand out due to their superior efficiency and performance. MobileNetV2 [30] proposes the inverted residual block to improve both efficiency and performance over MobileNetV1 [16]. ShuffleNet [25] proposes to use efficient shuffle operations along with group convolutions to design efficient networks. Above methods are usually subject to trialanderrors by experts in the model design process.
Neural Architecture Search. Recently, it has received much attention to use neural architecture search (NAS) to design efficient network architectures for various applications [35, 13, 23, 43, 21]
. A critical part of NAS is to design proper search spaces. Guided by a metacontroller, early NAS methods either use reinforcement learning
[50] or evolution algorithm [29] to discover better architectures. These methods are computationally inefficient, requiring thousands of GPU days to search. ENAS [28] shares parameters across sampled architectures to reduce the search cost. DARTS [24] proposes a continuous relaxation of the architecture parameters and conducts oneshot search and evaluation. These methods all adopt a NASNet [50] like search space. Recently, more expert knowledge in handcrafting network architectures are introduced in NAS. Using MobileNetV2 basic blocks in search space [3, 40, 33, 34, 31, 14, 26] significantly improves the performance of searched architectures. [3, 14] reduce the GPU memory consumption by executing only part of the supernet in each forward pass during training. [26] proposes an ensemble perspective of the basic block and simultaneously searches and trains the target architecture in the finegrained search space. [31] proposes a superkernel representation to incorporate all architectural hyperparameters (e.g., kernel sizes, expansion rations in MobileNetV2 blocks) in a unified search framework to reuse model parameters and computations. In our proposed searching algorithm, we focus on seeking an optimal configuration of LightNL blocks in lowcost neural networks which brought significant performance gains.3 AutoNL
In this section, we present AutoNL: we first elaborate on how to design a Lightweight NonLocal (LightNL) block in Section 3.1; then we introduce a novel neural architecture search algorithm in Section 3.2 to automatically search for an optimal configuration of LightNL blocks.
3.1 Lightweight NonLocal Blocks
In this section, we first revisit the NL blocks, then we introduce our proposed Lightweight NonLocal (LightNL) block in detail.
Revisit NL blocks. The core component in the NL blocks is the nonlocal operation. Following [39], a generic nonlocal operation can be formulated as
(1) 
where indexes the position of input feature whose response is to be computed, enumerates all possible positions in , outputs the affinity matrix between and its context features , computes an embedding of the input feature at the position , and is the normalization term. Following [39], the nonlocal operation in Eqn. (1
) is wrapped into a NL block with a residual connection from the input feature
. The mathematical formulation is given as(2) 
where denotes a learnable feature transformation.
Instantiation. Dot product is used as the function form of due to its simplicity in computing the correlation between features. Eqn. (1) thus becomes
(3) 
Here the shape of is denoted as where , and are the height, width and number of channels, respectively. and are convolutional layers with filters. Before matrix multiplications, the outputs of convolution are reshaped to .
Levi et al. [22] discover that for NL blocks instantiated in the form of Eqn. (3), employing the associative law of matrix multiplication can largely reduce the computation overhead. Based on the associative rules, Eqn. (3) can be written in two equivalent forms:
(4) 
Although the two forms produce the same numerical results, they have different computational complexity [22]. Therefore in computing Eqn. (3), one can always choose the form with smaller computation cost for better efficiency.
Design principles. The following part introduces two key principles to reduce the computation cost of Eqn. (3).
Design principle 1: Share and lighten the feature transformations. Instead of using two different transformations ( and ) on the same input feature x in Eqn. (3), we use a shared transformation in the nonlocal operation. In this way, the computation cost of Eqn. (3) is significantly reduced by reusing the result of in computing the affinity matrix. The simplified nonlocal operation is
(5) 
The input feature (output of hidden layer) can be seen as the transformation of input data through a feature transformer . Therefore Eqn. (5) can be written as
(6) 
In the scenario of using NL blocks in neural networks, is represented by a parameterized deep neural network. In contrast, is a single convolution operation. To further simplify Eqn. (6), we integrate the learning process of into that of . Taking advantage of the strong capability of deep neural networks on approximating functions [15], we remove and Eqn. (6) is simplified as
(7) 
At last, we introduce our method to simplify “”, another heavy transformation function in Eqn. (2). Recent works [39] instantiate it as a convolutional layer. To further reduce the computation cost of NL blocks, we propose to replace the convolution with a depthwise convolution [16] since the latter is more efficient. Eqn. (2) is then modified to be
(8) 
where denotes the depthwise convolution kernel.
Design principle 2: Use compact features for computing affinity matrices. Since is a highdimensional feature, directly performing matrix multiplication using the fullsized per Eqn. (7) leads to large computation overhead. To solve this problem, we propose to downsample first to obtain a more compact feature which replaces in Eqn. (7). Since is a threedimensional feature with depth (channels), width and height, we propose to downsample along either channel dimension, spatial dimension or both dimensions to obtain compact features , and respectively. Consequently, the computation cost of Eqn. (7) is reduced.
Therefore, based on Eqn. (7), we can simply apply the compact features in the NL block to compute and as
(9) 
Note that there is a tradeoff between the computation cost and the representation capacity of the output (i.e., ) of the nonlocal operation: using more compact features (with a lower downsampling ratio) reduces the computation cost but the output fails to capture the informative context information in those discarded features; on the other hand, using denser features (with a higher downsampling ratio) helps the output capture richer contexts, but it is more computationally demanding. Manually setting the downsampling ratios requires trialanderrors. To solve this issue, we propose a novel neural network architecture search (NAS) method in Section 3.2 to efficiently search for the configuration of NL blocks that achieve descent performance under specific resource constraints.
Before introduce our NAS method, let’s briefly summarize the advantages of the proposed LightNL blocks. Thanks to the aforementioned two design principles, our proposed LightNL block is empirically demonstrated to be much more efficient (refer to Section 4) than the conventional NL block [39], making it favorable to be deployed in mobile devices with limited computational budgets. In addition, since the computational complexity of the blocks can be easily adjusted by the downsampling ratios, the proposed LightNL blocks can provide better support on deep learning models at different scales. We illustrate the structure of the conventional NL block and that of the proposed block in Figure 2.
3.2 Neural Architecture Search for LightNL
To validate the efficacy and generalization of the proposed LightNL block for deep networks, we perform a proof test by applying it to every MobileNetV2 block. As shown in Figure 3, such a simple way of using the proposed LightNL blocks can already significantly boost the performance on both image classification and semantic segmentation. This observation motivates us to search for a better configuration of the proposed LightNL blocks in neural networks to fully utilize its representation learning capacity. As can be seen in Section 3.1, except for the insert locations in a neural network, the downsampling scheme that controls the complexity of the LightNL blocks is another important factor to be determined. We note that both insert locations and downsampling schedule of LightNL blocks are critical to the performance and computational cost of models. To automate the process of model design and find an optimal configuration of the proposed LightNL blocks, we propose an efficient Neural Architecture Search (NAS) method. Concretely, we propose to jointly search the configurations of LightNL blocks and the basic neural network architectural parameters (e.g
., kernel size, number of channels) using a costaware loss function.
Insert location. Motivated by [31], we select several candidate locations for inserting LightNL blocks throughout the network and decide whether a LightNL block should be used by comparing the norm of the depthwise convolution kernel to a trainable latent variable :
(10) 
where replaces to be used in Eqn. (8), and is an indicator function. indicates that a LightNL block will be used with being the depthwise convolution kernel. Otherwise, when and thus Eqn. (8) is degenerated to meaning no lightweight nonlocal block will be inserted.
Instead of manually selecting the value of threshold , we set it to be a trainable parameter, which is jointly optimized with other parameters via gradient decent. To compute the gradient of , we relax the indicator function
to a differentiable sigmoid function
during the backpropagation process.Module compactness. As can be seen from Eqn. (7), the computational cost of LightNL block when performing the matrix multiplication is determined by the compactness of downsampled features. Given a search space which contains candidate downsampling ratios, i.e., where , our goal is to search for an optimal downsampling ratio for each LightNL block. For the sake of clarity, here we use the case of searching downsampling ratios along the channel dimension to illustrate our method. Note that searching downsampling ratios along other dimensions can be performed in the same manner.
Different from searching for the insert locations through Eqn. (10), we encode the choice of downsampling ratios in the process of computing affinity matrix:
(11) 
where denotes the computed affinity matrix, denotes the downsampled feature with downsampling ratio , and is an indicator which holds true when is selected. By setting the constraint that only one downsampling ratio is used, Eqn. (11) can be simplified as when is selected as the downsampling ratio.
A critical step is how to formulate the condition of for deciding which downsampling ratio to use. A reasonable intuition is that the criteria should be able to determine whether the downsampled feature can be used to compute an accurate affinity matrix. Thus, our goal is to define a “similarity” signal that models whether the affinity matrix from the downsampled feature is close to the “groundtruth” affinity matrix, denotes as . Specifically, we write the indicator as
(12) 
where denotes the logical operator AND. An intuitive explanation to the rational of Eqn. (12) is the algorithm always selects the smallest with which the Euclidean distance between and is lower than threshold . To ensure when all other indicators are zeros, we set so that . Meanwhile, we relax the indicator function to sigmoid when computing gradients and update the threshold via gradient descent. Since the output of indicator changes with different input feature
, for better training convergence, we get inspired from batch normalization
[20] and use the exponential moving average of affinity matrices in computing Eqn. (12). After the searching stage, the downsampling ratio is determined by evaluating the following indicators:(13) 
where denotes the exponential moving averaged value of .
From Eqn. (12), one can observe that the output of indicator depends on indicators with smaller downsampling ratio. Based on this finding, we propose to reuse the affinity matrix computed with lowdimensional features (generated with lower downsampling ratios) when computing affinity matrix with highdimensional features (generated with higher downsampling ratios). Concretely, can be partitioned into , . The calculation of affinity matrix using can be decomposed as
(14) 
where is the reusable affinity matrix computed with a smaller downsampling ratio (recall that ). This feature reusing paradigm can largely reduce the search overhead since computing affinity matrices with more choices of downsampling ratios does not incur any additional computation cost. The process of feature reusing is illustrated in Figure 4.
Searching process. We integrate our proposed search algorithm with Singlepath NAS [31] and jointly search basic architectural parameters (following MNasNet [33]) along with the insert locations and downsampling schemes of LightNL blocks. We search downsampling ratios along both spatial and channel dimensions to achieve better compactness. To learn efficient deep learning models, the overall objective function is to minimize both standard classification loss and the model’s computation complexity which is related to both the insert locations and the compactness of LightNL blocks:
(15) 
where w denotes model weights and t denotes architectural parameters which can be grouped in two categories: one is from LightNL block including the insert positions and downsampling ratios while the other follows MNasNet [33] including kernel size, number of channels, etc. is the crossentropy loss and is the computation (i.e., FLOPs) cost. We use gradient descent to optimize the above objective function in an endtoend manner.
4 Experiments
We first demonstrate the efficacy and efficiency of LightNL by manually inserting it into lightweight models in Section 4.1. Then we apply the proposed search algorithm to the LightNL blocks in Section 4.2. The evaluation and comparison with stateoftheart methods are done on ImageNet classification [10].
4.1 Manually Designed LightNL Networks
Models. Our experiments are based on MobileNetV2 1.0 [30]. We insert LightNL blocks after the second pointwise convolution layer in every MobileNetV2 block. We use channels to compute the affinity matrix for the sake of low computation cost. Also, if the feature map is larger than
, we downsample it along the spatial axis with a stride of
. We call the transformed model MobileNetV2LightNL for short. We compare the two models with different depth multipliers, including , , and .Training setup. Following the training schedule in MNasNet [33], we train the models using the synchronous training setup on TeslaV100SXM216GB GPUs. We use an initial learning rate of , and a batch size of (128 images per GPU). The learning rate linearly increases to in the first epochs and then is decayed by every epochs. We use a dropout of , a weight decay of and Inception image preprocessing [32] of size . Finally, we use exponential moving average on model weights with a momentum of . All batch normalization layers use a momentum of .
ImageNet classification results. We compare the results between the original MobileNetV2 and MobileNetV2LightNL in Figure 5. We observe consistent performance gain even without tuning the hyperparameters of LightNL blocks for models with different depth multipliers. For example, when the depth multiplier is , the original MobileNetV2 model achieves a top1 accuracy of with M FLOPs, while our MobileNetV2LightNL achieves with M FLOPs. According to Figure 5, it is unlikely to boost the performance of the MobileNetV2 model to the comparable performance by simply increasing the width to get a M FLOPs model. When the depth multiplier is , LightNL blocks bring a performance gain of with a marginal increase in FLOPs (M).
Nonlocal Module  FLOPs /  Acc ()  

Operator  Wrapper  FLOPs  
    301M  73.4 
Wang et al. [39]  Wang et al. [39]  +6.2G  75.2 
Levi et al. [22]  +146M  75.2  
Zhu et al. [49]  +107M    
Eqn. (3)  Wang et al. [39]  +119M  75.2 
Eqn. (5)  +93M  75.1  
Eqn. (7)  +66M  75.0  
Eqn. (9)  +38M  75.0  
Eqn. (9)  Eqn. (8)  +15M  75.0 
Method  FLOPs (M)  mIoU 

MobileNetV2  301  70.6 
MobileNetV2LightNL (ours)  316  72.9 
Ablation study. To diagnose the proposed LightNL block, we present a stepbystep ablation study in Table 1. As shown in the table, every modification preserves the model performance but reduces the computation cost. By comparing with the baseline model, the proposed LightNL block improves ImageNet top1 accuracy by (from to ), but only increases M FLOPs, which is only of the total FLOPs on MobileNetV2. Comparing with the standard NL block, the proposed LightNL block is about computationally cheaper (6.2G vs. 15M) with comparable performance ( vs. ). Comparing with Levi et al. [22] which optimized the matrix multiplication with the associative law, the proposed LightNL block is still computationally cheaper. Compared with a very recent work proposed by Zhu et al. [49] which leverages the pyramid pooling to reduce the complexity, LightNL is around computationally cheaper.
CAM visualization. In order to illustrate the efficacy of our LightNL, Figure 6 compares the class activation map [47] for the original MobileNetV2 and MobileNetV2LightNL. We see that LightNL is capable of helping the model to focus on more relevant regions while it is much computationally cheaper than the conventional counterparts as analyzed above. For example, at the middle top of Figure 6, the model without the LightNL blocks focus on only a part of the sewing machine. When LightNL is applied, the model can “see” the whole machine, leading to more accurate and robust predictions.
PASCAL VOC segmentation results. To demonstrate the generalization ability of our method, we compare the performance of MobileNetV2 and MobileNetV2LightNL on the PASCAL VOC 2012 semantic segmentation dataset [12]. Following Chen et al. [7], we use the classification model as a dropin replacement for the backbone feature extractor in the Deeplabv3 [6]. It is cascaded by an Atrous Spatial Pyramid Pooling module (ASPP) [5] with three convolutions with different atrous rates. The modified architectures share the same computation costs as the backbone models due to the low computation cost of LightNL blocks. All models are initialized with ImageNet pretrained weights and then finetuned with the same training protocol in [5]. It should be emphasized here that the focus of this part is to assess the efficacy of the proposed LightNL while keeping other factors fixed. It is notable that we do not adopt complex training techniques such as multiscale and leftright flipped inputs, which may lead to better performance. The results are shown in Table 2, LightNL blocks bring a performance gain of in mIoU with a minor increase in FLOPs. The results indicate the proposed LightNL blocks are well suitable for other tasks such as semantic segmentation.
Model  #Params  Flops  Top1  Top5 

MobileNetV2 [30]  3.4M  300M  72.0  91.0 
MBV2 (our impl.)  3.4M  301M  73.4  91.4 
ShuffleNetV2 [25]  3.5M  299M  72.6   
FBNetA [40]  4.3M  249M  73.0   
Proxyless [3]  4.1M  320M  74.6  92.2 
MnasNetA1 [33]  3.9M  312M  75.2  92.5 
MnasNetA2  4.8M  340M  75.6  92.7 
AAMnasNetA1 [2]  4.1M  350M  75.7  92.6 
MobileNetV3L  5.4M  217M  75.2   
MixNetS [34]  4.1M  256M  75.8  92.8 
AutoNLS (ours)  4.4M  267M  76.5  93.1 
FBNetC [40]  5.5M  375M  74.9   
Proxyless (GPU) [3]    465M  75.1  92.5 
SinglePath [31]  4.4M  334M  75.0  92.2 
SinglePath (our impl.)  4.4M  334M  74.7  92.2 
FairNASA [9]  4.6M  388M  75.3  92.4 
EfficientNetB0 [35]  5.3M  388M  76.3  93.2 
SCARLETA [9]  6.7M  365M  76.9  93.4 
MBV3L (1.25x) [17]  7.5M  356M  76.6   
MixNetM [34]  5.0M  360M  77.0  93.3 
AutoNLL (ours)  5.6M  353M  77.7  93.7 
4.2 AutoNL
We apply the proposed neural architecture search algorithm to search for an optimal configuration of LightNL blocks. Specifically, we have five LightNL candidates for each potential insert location, i.e., sampling or channels to compute affinity matrix, sampling along spatial dimensions with stride or , inserting a LightNL block at the current position or not. Note that it is easy to enlarge the search space by including other LightNL blocks with more hyperparameters. In addition, similar to recent work [33, 40, 3, 31], we also search for optimal kernel sizes, optimal expansion ratios and optimal SE ratios with MobileNetV2 block [30] as the building block.
We directly search on the ImageNet training set and use a computation cost loss and the crossentropy loss as guidance, both of which are differentiable thanks to the relaxations of the indicator functions during the backpropagation process. It takes epochs (about GPU hours) for the search process to converge.
Performance on classification. We obtain two models using the proposed neural architecture search algorithm; we denote the large one as AutoNLL and the small one as AutoNLS in Table 3. The architecture of AutoNLL is presented in Figure 7.
Table 3 shows that AutoNL outperforms all the latest mobile CNNs. Comparing to the handcrafted models, AutoNLS improves the top1 accuracy by over MobileNetV2 [30] and over ShuffleNetV2 [25] while saving about FLOPs. Besides, AutoNL achieves better results than the latest models from NAS approaches. For example, compared to EfficientNetB0, AutoNLL improves the top1 accuracy by while saving about FLOPs. Our models also achieve better performance than the latest MobileNetV3 [17], which is developed with several manual optimizations in addition to architecture search.
AutoNLL also surpasses the stateoftheart NL method (i.e., AAMnasNetA1) by with comparable FLOPs. Even AutoNLS improves accuracy by while saving FLOPs. We also compare with MixNet, which is a very recent stateoftheart model under mobile settings, both AutoNLL and AutoNLS achieve improvement with comparable FLOPs but with much less search time ( GPU hours vs. GPU hours [40], faster).
We also search for models under different combinations of input resolutions and channel sizes under extremely low FLOPs. The results are summarized in Figure 8. AutoNL achieves consistent improvement over MobileNetV2, FBNet, and MNasNet. For example, when the input resolution is and the depth multiplier is , our model achieves accuracy, outperforming MobileNetV2 by and FBNet by .
5 Conclusion
As an important building block for various vision applications, NL blocks under mobile settings remain underexplored due to their heavy computation overhead. To our best knowledge, AutoNL is the first method to explore the usage of NL blocks for general mobile networks. Specifically, we design a LightNL block to enable highly efficient context modeling in mobile settings. We then propose a neural architecture search algorithm to optimize the configuration of LightNL blocks. Our method significantly outperforms prior arts with 77.7% top1 accuracy on ImageNet under a typical mobile setting (350M FLOPs).
Acknowledgements This work was partially supported by ONR N000141512356.
References
 [1] (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §2.
 [2] (2019) Attention augmented convolutional networks. arXiv preprint arXiv:1904.09925. Cited by: §1, §1, §2, Table 3.
 [3] (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In ICLR, Cited by: §2, §4.2, Table 3.
 [4] (2019) GCNet: nonlocal networks meet squeezeexcitation networks and beyond. arXiv preprint arXiv:1904.11492. Cited by: §1, §2.
 [5] (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40 (4), pp. 834–848. Cited by: §4.1.
 [6] (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §4.1.
 [7] (2019) RENAS: reinforced evolutionary neural architecture search. In CVPR, pp. 4787–4796. Cited by: §4.1.
 [8] (2018) A^ 2nets: double attention networks. In NeurIPS, pp. 352–361. Cited by: §1, §2.
 [9] (2019) ScarletNAS: bridging the gap between scalability and fairness in neural architecture search. CoRR abs/1908.06022. Cited by: §2, Table 3.
 [10] (2009) Imagenet: a largescale hierarchical image database. In CVPR, Cited by: §4.
 [11] (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In NAACL, pp. 4171–4186. Cited by: §2.
 [12] (2015) The pascal visual object classes challenge: a retrospective. IJCV 111 (1), pp. 98–136. Cited by: §4.1.
 [13] (2019) Nasfpn: learning scalable feature pyramid architecture for object detection. In CVPR, pp. 7036–7045. Cited by: §2.
 [14] (2019) Single path oneshot neural architecture search with uniform sampling. CoRR abs/1904.00420. Cited by: §2.
 [15] (1989) Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: §3.1.

[16]
(2017)
Mobilenets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: §2, §3.1.  [17] (2019) Searching for mobilenetv3. arXiv preprint arXiv:1905.02244. Cited by: §1, §2, §4.2, Table 3.
 [18] (2018) Squeezeandexcitation networks. In CVPR, pp. 7132–7141. Cited by: §2, Figure 7.

[19]
(2017)
CondenseNet: an efficient densenet using learned group convolutions.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §2.  [20] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §3.2.
 [21] (2019) AdaBits: neural network quantization with adaptive bitwidths. arXiv preprint arXiv:1912.09666. Cited by: §2.
 [22] (2018) Efficient coarsetofine nonlocal module for the detection of small objects. arXiv preprint arXiv:1811.12152. Cited by: §1, §3.1, §4.1, Table 1.
 [23] (2019) Autodeeplab: hierarchical neural architecture search for semantic image segmentation. In CVPR, pp. 82–92. Cited by: §2.
 [24] (2019) DARTS: differentiable architecture search. In ICLR, External Links: Link Cited by: §2.
 [25] (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In ECCV, Cited by: §2, §4.2, Table 3.
 [26] (2020) AtomNAS: finegrained endtoend neural architecture search. In ICLR, External Links: Link Cited by: §2.
 [27] (2018) BAM: bottleneck attention module. In BMVC, pp. 147. Cited by: §2.
 [28] (2018) Efficient neural architecture search via parameter sharing. In ICML, pp. 4092–4101. Cited by: §2.

[29]
(2019)
Regularized evolution for image classifier architecture search
. In AAAI 2019, pp. 4780–4789. Cited by: §2.  [30] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §2, Figure 8, §4.1, §4.2, §4.2, Table 3.
 [31] (2019) Singlepath nas: designing hardwareefficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877. Cited by: §2, §3.2, §3.2, §4.2, Table 3.
 [32] (2017) Inceptionv4, inceptionresnet and the impact of residual connections on learning. In AAAI, Cited by: §4.1.
 [33] (2019) Mnasnet: platformaware neural architecture search for mobile. In CVPR, Cited by: §2, §2, §3.2, Figure 8, §4.1, §4.2, Table 3.
 [34] (2019) Mixnet: mixed depthwise convolutional kernels. In BMVC, Cited by: §2, §2, Table 3.
 [35] (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In ICML, pp. 6105–6114. Cited by: §2, Table 3.
 [36] (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §2.
 [37] (2017) Residual attention network for image classification. In CVPR, pp. 6450–6458. Cited by: §2.
 [38] (2020) Axialdeeplab: standalone axialattention for panoptic segmentation. arXiv preprint arXiv:2003.07853. Cited by: §2.
 [39] (2018) Nonlocal neural networks. In CVPR, pp. 7794–7803. Cited by: §1, §1, §2, §3.1, §3.1, §3.1, Table 1.
 [40] (2019) Fbnet: hardwareaware efficient convnet design via differentiable neural architecture search. In CVPR, pp. 10734–10742. Cited by: §2, Figure 8, §4.2, §4.2, Table 3.
 [41] (2019) Feature denoising for improving adversarial robustness. In CVPR, Cited by: §1.
 [42] (2018) IGCV2: interleaved structured sparse convolutional neural networks. In CVPR, Cited by: §2.
 [43] (2020) C2FNAS: coarsetofine neural architecture search for 3d medical image segmentation. In CVPR, Cited by: §2.
 [44] (2018) Compact generalized nonlocal network. In NeurIPS, pp. 6510–6519. Cited by: §1, §2.
 [45] (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §2.
 [46] (2018) Psanet: pointwise spatial attention network for scene parsing. In ECCV, pp. 267–283. Cited by: §1.

[47]
(2016)
Learning deep features for discriminative localization
. In CVPR, pp. 2921–2929. Cited by: Figure 6, §4.1. 
[48]
(2019)
Multiscale attentional network for multifocal segmentation of active bleed after pelvic fractures.
In
International Workshop on Machine Learning in Medical Imaging
, pp. 461–469. Cited by: §1.  [49] (2019) Asymmetric nonlocal neural networks for semantic segmentation. In ICCV, pp. 593–602. Cited by: §4.1, Table 1.
 [50] (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §2.
Comments
There are no comments yet.