Yet Another Pytorch Distributed MobileNetV2-based Networks Implementation
Non-Local (NL) blocks have been widely studied in various vision tasks. However, it has been rarely explored to embed the NL blocks in mobile neural networks, mainly due to the following challenges: 1) NL blocks generally have heavy computation cost which makes it difficult to be applied in applications where computational resources are limited, and 2) it is an open problem to discover an optimal configuration to embed NL blocks into mobile neural networks. We propose AutoNL to overcome the above two obstacles. Firstly, we propose a Lightweight Non-Local (LightNL) block by squeezing the transformation operations and incorporating compact features. With the novel design choices, the proposed LightNL block is 400x computationally cheaper than its conventional counterpart without sacrificing the performance. Secondly, by relaxing the structure of the LightNL block to be differentiable during training, we propose an efficient neural architecture search algorithm to learn an optimal configuration of LightNL blocks in an end-to-end manner. Notably, using only 32 GPU hours, the searched AutoNL model achieves 77.7 accuracy on ImageNet under a typical mobile setting (350M FLOPs), significantly outperforming previous mobile models including MobileNetV2 (+5.7 (+2.8 https://github.com/LiYingwei/AutoNL.READ FULL TEXT VIEW PDF
Yet Another Pytorch Distributed MobileNetV2-based Networks Implementation
Non-Local (NL) block [2, 39] aims to capture long-range dependencies in deep neural networks, which have been used in a variety of vision tasks such as video classification , object detection , semantic segmentation [46, 48], image classification , and adversarial robustness . Despite the remarkable progress, the general utilization of non-local modules under resource-constrained scenarios such as mobile devices remains underexplored. This may be due to the following two factors. ††Work done during an internship at Bytedance AI Lab.
First, NL blocks compute the response at each position by attending to all other positions and computing a weighted average of the features in all positions, which incurs a large computation burden. Several efforts have been explored to reduce the computation overhead. For instance, [8, 22] use associative law to reduce the memory and computation cost of matrix multiplication; Yue et al.  use Taylor expansion to optimize the non-local module; Cao et al. 
compute the affinity matrix via a convolutional layer; Belloet al.  design a novel attention-augmented convolution. However, these methods either still lead to relatively large computation overhead (via using heavy operators, such as large matrix multiplications) or result in a less accurate outcome (e.g., simplified NL blocks ), making these methods undesirable for mobile-level vision systems.
Second, NL blocks are usually implemented as individual modules which can be plugged into a few manually selected layers (usually relatively deep layers). While it is intractable to densely embed it into a deep network due to the high computational complexity, it remains unclear where to insert those modules economically. Existing methods have not fully exploited the capacity of NL blocks in relational modeling under mobile settings.
Taking the two factors aforementioned into account, we aim to answer the following questions in this work: is it possible to develop an efficient NL block for mobile networks? What is the optimal configuration to embed those modules into mobile neural networks? We propose AutoNL to address these two questions. First, we design a Lightweight Non-Local (LightNL) block, which is the first work to apply non-local techniques to mobile networks to our best knowledge. We achieve this with two critical design choices 1) lighten the transformation operators (e.g., convolutions) and 2) utilize compact features. As a result, the proposed LightNL blocks are usually 400 computationally cheaper than conventional NL blocks 
, which is favorable to be applied to mobile deep learning systems. Second, we propose a novel neural architecture search algorithm. Specifically, we relax the structure of LightNL blocks to be differentiable so that our search algorithm can simultaneously determine the compactness of the features and the locations for LightNL blocks during the end-to-end training. We also reuse intermediate search results by acquiring various affinity matrices in one shot to reduce the redundant computation cost, which speeds up the search process.
Our proposed searching algorithm is fast and delivers high-performance lightweight models. As shown in Figure 1, our searched small AutoNL model achieves ImageNet top-1 accuracy with M FLOPs, which is faster than MobileNetV3  with comparable performance ( top-1 accuracy with M FLOPs). Also, our searched large AutoNL model achieves ImageNet top-1 accuracy with M FLOPs, which has similar computation cost as MobileNetV3 but improves the top-1 accuracy by .
To summarize, our contributions are three-fold: (1) We design a lightweight and search compatible NL block for visual recognition models on mobile devices and resource-constrained platforms; (2) We propose an efficient neural architecture search algorithm to automatically learn an optimal configuration of the proposed LightNL blocks; 3) Our model achieves state-of-the-art performance on the ImageNet classification task under mobile settings.
bridge attention mechanism and non-local operator, and use it to model long-range relationships in computer vision applications. Attention mechanisms can be applied along two orthogonal directions: channel attention and spatial attention. Channel attention[18, 37, 27] aims to model the relationships between different channels with different semantic concepts. By focusing on a part of the channels of the input feature and deactivating non-related concepts, the models can focus on the concepts of interest. Due to its simplicity and effectiveness , it is widely used in neural architecture search [33, 34, 17, 9].
Our work explores in both directions of spatial/channel attention. Although existing works [8, 44, 4, 2, 38] exploit various techniques to improve efficiency, they are still too computationally heavy under mobile settings. To alleviate this problem, we design a lightweight spatial attention module with low computational cost and it can be easily integrated into mobile neural networks.
Efficient mobile architectures. There are a lot of handcrafted neural network architectures [19, 42, 16, 30, 45, 25] for mobile applications. Among them, the family of MobileNet [16, 30] and the family of ShuffleNet [45, 25] stand out due to their superior efficiency and performance. MobileNetV2  proposes the inverted residual block to improve both efficiency and performance over MobileNetV1 . ShuffleNet  proposes to use efficient shuffle operations along with group convolutions to design efficient networks. Above methods are usually subject to trial-and-errors by experts in the model design process.
. A critical part of NAS is to design proper search spaces. Guided by a meta-controller, early NAS methods either use reinforcement learning or evolution algorithm  to discover better architectures. These methods are computationally inefficient, requiring thousands of GPU days to search. ENAS  shares parameters across sampled architectures to reduce the search cost. DARTS  proposes a continuous relaxation of the architecture parameters and conducts one-shot search and evaluation. These methods all adopt a NASNet  like search space. Recently, more expert knowledge in handcrafting network architectures are introduced in NAS. Using MobileNetV2 basic blocks in search space [3, 40, 33, 34, 31, 14, 26] significantly improves the performance of searched architectures. [3, 14] reduce the GPU memory consumption by executing only part of the super-net in each forward pass during training.  proposes an ensemble perspective of the basic block and simultaneously searches and trains the target architecture in the fine-grained search space.  proposes a super-kernel representation to incorporate all architectural hyper-parameters (e.g., kernel sizes, expansion rations in MobileNetV2 blocks) in a unified search framework to reuse model parameters and computations. In our proposed searching algorithm, we focus on seeking an optimal configuration of LightNL blocks in low-cost neural networks which brought significant performance gains.
In this section, we present AutoNL: we first elaborate on how to design a Lightweight Non-Local (LightNL) block in Section 3.1; then we introduce a novel neural architecture search algorithm in Section 3.2 to automatically search for an optimal configuration of LightNL blocks.
In this section, we first revisit the NL blocks, then we introduce our proposed Lightweight Non-Local (LightNL) block in detail.
Revisit NL blocks. The core component in the NL blocks is the non-local operation. Following , a generic non-local operation can be formulated as
where indexes the position of input feature whose response is to be computed, enumerates all possible positions in , outputs the affinity matrix between and its context features , computes an embedding of the input feature at the position , and is the normalization term. Following , the non-local operation in Eqn. (1
) is wrapped into a NL block with a residual connection from the input feature. The mathematical formulation is given as
where denotes a learnable feature transformation.
Instantiation. Dot product is used as the function form of due to its simplicity in computing the correlation between features. Eqn. (1) thus becomes
Here the shape of is denoted as where , and are the height, width and number of channels, respectively. and are convolutional layers with filters. Before matrix multiplications, the outputs of convolution are reshaped to .
Levi et al.  discover that for NL blocks instantiated in the form of Eqn. (3), employing the associative law of matrix multiplication can largely reduce the computation overhead. Based on the associative rules, Eqn. (3) can be written in two equivalent forms:
Although the two forms produce the same numerical results, they have different computational complexity . Therefore in computing Eqn. (3), one can always choose the form with smaller computation cost for better efficiency.
Design principles. The following part introduces two key principles to reduce the computation cost of Eqn. (3).
Design principle 1: Share and lighten the feature transformations. Instead of using two different transformations ( and ) on the same input feature x in Eqn. (3), we use a shared transformation in the non-local operation. In this way, the computation cost of Eqn. (3) is significantly reduced by reusing the result of in computing the affinity matrix. The simplified non-local operation is
The input feature (output of hidden layer) can be seen as the transformation of input data through a feature transformer . Therefore Eqn. (5) can be written as
In the scenario of using NL blocks in neural networks, is represented by a parameterized deep neural network. In contrast, is a single convolution operation. To further simplify Eqn. (6), we integrate the learning process of into that of . Taking advantage of the strong capability of deep neural networks on approximating functions , we remove and Eqn. (6) is simplified as
At last, we introduce our method to simplify “”, another heavy transformation function in Eqn. (2). Recent works  instantiate it as a convolutional layer. To further reduce the computation cost of NL blocks, we propose to replace the convolution with a depthwise convolution  since the latter is more efficient. Eqn. (2) is then modified to be
where denotes the depthwise convolution kernel.
Design principle 2: Use compact features for computing affinity matrices. Since is a high-dimensional feature, directly performing matrix multiplication using the full-sized per Eqn. (7) leads to large computation overhead. To solve this problem, we propose to downsample first to obtain a more compact feature which replaces in Eqn. (7). Since is a three-dimensional feature with depth (channels), width and height, we propose to downsample along either channel dimension, spatial dimension or both dimensions to obtain compact features , and respectively. Consequently, the computation cost of Eqn. (7) is reduced.
Therefore, based on Eqn. (7), we can simply apply the compact features in the NL block to compute and as
Note that there is a trade-off between the computation cost and the representation capacity of the output (i.e., ) of the non-local operation: using more compact features (with a lower downsampling ratio) reduces the computation cost but the output fails to capture the informative context information in those discarded features; on the other hand, using denser features (with a higher downsampling ratio) helps the output capture richer contexts, but it is more computationally demanding. Manually setting the downsampling ratios requires trial-and-errors. To solve this issue, we propose a novel neural network architecture search (NAS) method in Section 3.2 to efficiently search for the configuration of NL blocks that achieve descent performance under specific resource constraints.
Before introduce our NAS method, let’s briefly summarize the advantages of the proposed LightNL blocks. Thanks to the aforementioned two design principles, our proposed LightNL block is empirically demonstrated to be much more efficient (refer to Section 4) than the conventional NL block , making it favorable to be deployed in mobile devices with limited computational budgets. In addition, since the computational complexity of the blocks can be easily adjusted by the downsampling ratios, the proposed LightNL blocks can provide better support on deep learning models at different scales. We illustrate the structure of the conventional NL block and that of the proposed block in Figure 2.
To validate the efficacy and generalization of the proposed LightNL block for deep networks, we perform a proof test by applying it to every MobileNetV2 block. As shown in Figure 3, such a simple way of using the proposed LightNL blocks can already significantly boost the performance on both image classification and semantic segmentation. This observation motivates us to search for a better configuration of the proposed LightNL blocks in neural networks to fully utilize its representation learning capacity. As can be seen in Section 3.1, except for the insert locations in a neural network, the downsampling scheme that controls the complexity of the LightNL blocks is another important factor to be determined. We note that both insert locations and downsampling schedule of LightNL blocks are critical to the performance and computational cost of models. To automate the process of model design and find an optimal configuration of the proposed LightNL blocks, we propose an efficient Neural Architecture Search (NAS) method. Concretely, we propose to jointly search the configurations of LightNL blocks and the basic neural network architectural parameters (e.g
., kernel size, number of channels) using a cost-aware loss function.
Insert location. Motivated by , we select several candidate locations for inserting LightNL blocks throughout the network and decide whether a LightNL block should be used by comparing the norm of the depthwise convolution kernel to a trainable latent variable :
where replaces to be used in Eqn. (8), and is an indicator function. indicates that a LightNL block will be used with being the depthwise convolution kernel. Otherwise, when and thus Eqn. (8) is degenerated to meaning no lightweight non-local block will be inserted.
Instead of manually selecting the value of threshold , we set it to be a trainable parameter, which is jointly optimized with other parameters via gradient decent. To compute the gradient of , we relax the indicator function
to a differentiable sigmoid functionduring the back-propagation process.
Module compactness. As can be seen from Eqn. (7), the computational cost of LightNL block when performing the matrix multiplication is determined by the compactness of downsampled features. Given a search space which contains candidate downsampling ratios, i.e., where , our goal is to search for an optimal downsampling ratio for each LightNL block. For the sake of clarity, here we use the case of searching downsampling ratios along the channel dimension to illustrate our method. Note that searching downsampling ratios along other dimensions can be performed in the same manner.
Different from searching for the insert locations through Eqn. (10), we encode the choice of downsampling ratios in the process of computing affinity matrix:
where denotes the computed affinity matrix, denotes the downsampled feature with downsampling ratio , and is an indicator which holds true when is selected. By setting the constraint that only one downsampling ratio is used, Eqn. (11) can be simplified as when is selected as the downsampling ratio.
A critical step is how to formulate the condition of for deciding which downsampling ratio to use. A reasonable intuition is that the criteria should be able to determine whether the downsampled feature can be used to compute an accurate affinity matrix. Thus, our goal is to define a “similarity” signal that models whether the affinity matrix from the downsampled feature is close to the “ground-truth” affinity matrix, denotes as . Specifically, we write the indicator as
where denotes the logical operator AND. An intuitive explanation to the rational of Eqn. (12) is the algorithm always selects the smallest with which the Euclidean distance between and is lower than threshold . To ensure when all other indicators are zeros, we set so that . Meanwhile, we relax the indicator function to sigmoid when computing gradients and update the threshold via gradient descent. Since the output of indicator changes with different input feature
, for better training convergence, we get inspired from batch normalization and use the exponential moving average of affinity matrices in computing Eqn. (12). After the searching stage, the downsampling ratio is determined by evaluating the following indicators:
where denotes the exponential moving averaged value of .
From Eqn. (12), one can observe that the output of indicator depends on indicators with smaller downsampling ratio. Based on this finding, we propose to reuse the affinity matrix computed with low-dimensional features (generated with lower downsampling ratios) when computing affinity matrix with high-dimensional features (generated with higher downsampling ratios). Concretely, can be partitioned into , . The calculation of affinity matrix using can be decomposed as
where is the reusable affinity matrix computed with a smaller downsampling ratio (recall that ). This feature reusing paradigm can largely reduce the search overhead since computing affinity matrices with more choices of downsampling ratios does not incur any additional computation cost. The process of feature reusing is illustrated in Figure 4.
Searching process. We integrate our proposed search algorithm with Single-path NAS  and jointly search basic architectural parameters (following MNasNet ) along with the insert locations and downsampling schemes of LightNL blocks. We search downsampling ratios along both spatial and channel dimensions to achieve better compactness. To learn efficient deep learning models, the overall objective function is to minimize both standard classification loss and the model’s computation complexity which is related to both the insert locations and the compactness of LightNL blocks:
where w denotes model weights and t denotes architectural parameters which can be grouped in two categories: one is from LightNL block including the insert positions and downsampling ratios while the other follows MNasNet  including kernel size, number of channels, etc. is the cross-entropy loss and is the computation (i.e., FLOPs) cost. We use gradient descent to optimize the above objective function in an end-to-end manner.
We first demonstrate the efficacy and efficiency of LightNL by manually inserting it into lightweight models in Section 4.1. Then we apply the proposed search algorithm to the LightNL blocks in Section 4.2. The evaluation and comparison with state-of-the-art methods are done on ImageNet classification .
Models. Our experiments are based on MobileNetV2 1.0 . We insert LightNL blocks after the second point-wise convolution layer in every MobileNetV2 block. We use channels to compute the affinity matrix for the sake of low computation cost. Also, if the feature map is larger than
, we downsample it along the spatial axis with a stride of. We call the transformed model MobileNetV2-LightNL for short. We compare the two models with different depth multipliers, including , , and .
Training setup. Following the training schedule in MNasNet , we train the models using the synchronous training setup on Tesla-V100-SXM2-16GB GPUs. We use an initial learning rate of , and a batch size of (128 images per GPU). The learning rate linearly increases to in the first epochs and then is decayed by every epochs. We use a dropout of , a weight decay of and Inception image preprocessing  of size . Finally, we use exponential moving average on model weights with a momentum of . All batch normalization layers use a momentum of .
ImageNet classification results. We compare the results between the original MobileNetV2 and MobileNetV2-LightNL in Figure 5. We observe consistent performance gain even without tuning the hyper-parameters of LightNL blocks for models with different depth multipliers. For example, when the depth multiplier is , the original MobileNetV2 model achieves a top-1 accuracy of with M FLOPs, while our MobileNetV2-LightNL achieves with M FLOPs. According to Figure 5, it is unlikely to boost the performance of the MobileNetV2 model to the comparable performance by simply increasing the width to get a M FLOPs model. When the depth multiplier is , LightNL blocks bring a performance gain of with a marginal increase in FLOPs (M).
|Non-local Module||FLOPs /||Acc ()|
|Wang et al. ||Wang et al. ||+6.2G||75.2|
|Levi et al. ||+146M||75.2|
|Zhu et al. ||+107M||-|
|Eqn. (3)||Wang et al. ||+119M||75.2|
|Eqn. (9)||Eqn. (8)||+15M||75.0|
Ablation study. To diagnose the proposed LightNL block, we present a step-by-step ablation study in Table 1. As shown in the table, every modification preserves the model performance but reduces the computation cost. By comparing with the baseline model, the proposed LightNL block improves ImageNet top-1 accuracy by (from to ), but only increases M FLOPs, which is only of the total FLOPs on MobileNetV2. Comparing with the standard NL block, the proposed LightNL block is about computationally cheaper (6.2G vs. 15M) with comparable performance ( vs. ). Comparing with Levi et al.  which optimized the matrix multiplication with the associative law, the proposed LightNL block is still computationally cheaper. Compared with a very recent work proposed by Zhu et al.  which leverages the pyramid pooling to reduce the complexity, LightNL is around computationally cheaper.
CAM visualization. In order to illustrate the efficacy of our LightNL, Figure 6 compares the class activation map  for the original MobileNetV2 and MobileNetV2-LightNL. We see that LightNL is capable of helping the model to focus on more relevant regions while it is much computationally cheaper than the conventional counterparts as analyzed above. For example, at the middle top of Figure 6, the model without the LightNL blocks focus on only a part of the sewing machine. When LightNL is applied, the model can “see” the whole machine, leading to more accurate and robust predictions.
PASCAL VOC segmentation results. To demonstrate the generalization ability of our method, we compare the performance of MobileNetV2 and MobileNetV2-LightNL on the PASCAL VOC 2012 semantic segmentation dataset . Following Chen et al. , we use the classification model as a drop-in replacement for the backbone feature extractor in the Deeplabv3 . It is cascaded by an Atrous Spatial Pyramid Pooling module (ASPP)  with three convolutions with different atrous rates. The modified architectures share the same computation costs as the backbone models due to the low computation cost of LightNL blocks. All models are initialized with ImageNet pre-trained weights and then fine-tuned with the same training protocol in . It should be emphasized here that the focus of this part is to assess the efficacy of the proposed LightNL while keeping other factors fixed. It is notable that we do not adopt complex training techniques such as multi-scale and left-right flipped inputs, which may lead to better performance. The results are shown in Table 2, LightNL blocks bring a performance gain of in mIoU with a minor increase in FLOPs. The results indicate the proposed LightNL blocks are well suitable for other tasks such as semantic segmentation.
|MBV2 (our impl.)||3.4M||301M||73.4||91.4|
|Proxyless (GPU) ||-||465M||75.1||92.5|
|SinglePath (our impl.)||4.4M||334M||74.7||92.2|
|MBV3-L (1.25x) ||7.5M||356M||76.6||-|
We apply the proposed neural architecture search algorithm to search for an optimal configuration of LightNL blocks. Specifically, we have five LightNL candidates for each potential insert location, i.e., sampling or channels to compute affinity matrix, sampling along spatial dimensions with stride or , inserting a LightNL block at the current position or not. Note that it is easy to enlarge the search space by including other LightNL blocks with more hyper-parameters. In addition, similar to recent work [33, 40, 3, 31], we also search for optimal kernel sizes, optimal expansion ratios and optimal SE ratios with MobileNetV2 block  as the building block.
We directly search on the ImageNet training set and use a computation cost loss and the cross-entropy loss as guidance, both of which are differentiable thanks to the relaxations of the indicator functions during the back-propagation process. It takes epochs (about GPU hours) for the search process to converge.
Performance on classification. We obtain two models using the proposed neural architecture search algorithm; we denote the large one as AutoNL-L and the small one as AutoNL-S in Table 3. The architecture of AutoNL-L is presented in Figure 7.
Table 3 shows that AutoNL outperforms all the latest mobile CNNs. Comparing to the handcrafted models, AutoNL-S improves the top-1 accuracy by over MobileNetV2  and over ShuffleNetV2  while saving about FLOPs. Besides, AutoNL achieves better results than the latest models from NAS approaches. For example, compared to EfficientNet-B0, AutoNL-L improves the top-1 accuracy by while saving about FLOPs. Our models also achieve better performance than the latest MobileNetV3 , which is developed with several manual optimizations in addition to architecture search.
AutoNL-L also surpasses the state-of-the-art NL method (i.e., AA-MnasNet-A1) by with comparable FLOPs. Even AutoNL-S improves accuracy by while saving FLOPs. We also compare with MixNet, which is a very recent state-of-the-art model under mobile settings, both AutoNL-L and AutoNL-S achieve improvement with comparable FLOPs but with much less search time ( GPU hours vs. GPU hours , faster).
We also search for models under different combinations of input resolutions and channel sizes under extremely low FLOPs. The results are summarized in Figure 8. AutoNL achieves consistent improvement over MobileNetV2, FBNet, and MNasNet. For example, when the input resolution is and the depth multiplier is , our model achieves accuracy, outperforming MobileNetV2 by and FBNet by .
As an important building block for various vision applications, NL blocks under mobile settings remain underexplored due to their heavy computation overhead. To our best knowledge, AutoNL is the first method to explore the usage of NL blocks for general mobile networks. Specifically, we design a LightNL block to enable highly efficient context modeling in mobile settings. We then propose a neural architecture search algorithm to optimize the configuration of LightNL blocks. Our method significantly outperforms prior arts with 77.7% top-1 accuracy on ImageNet under a typical mobile setting (350M FLOPs).
Acknowledgements This work was partially supported by ONR N00014-15-1-2356.
Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2, §3.1.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
Regularized evolution for image classifier architecture search. In AAAI 2019, pp. 4780–4789. Cited by: §2.
Learning deep features for discriminative localization. In CVPR, pp. 2921–2929. Cited by: Figure 6, §4.1.
International Workshop on Machine Learning in Medical Imaging, pp. 461–469. Cited by: §1.