Rethinking Bottleneck Structure for Efficient Mobile Network Design

07/05/2020 ∙ by Zhou Daquan, et al. ∙ National University of Singapore 0

The inverted residual block is dominating architecture design for mobile networks recently. It changes the classic residual bottleneck by introducing two design rules: learning inverted residuals and using linear bottlenecks. In this paper, we rethink the necessity of such design changes and find it may bring risks of information loss and gradient confusion. We thus propose to flip the structure and present a novel bottleneck design, called the sandglass block, that performs identity mapping and spatial transformation at higher dimensions and thus alleviates information loss and gradient confusion effectively. Extensive experiments demonstrate that, different from the common belief, such bottleneck structure is more beneficial than the inverted ones for mobile networks. In ImageNet classification, by simply replacing the inverted residual block with our sandglass block without increasing parameters and computation, the classification accuracy can be improved by more than 1.7 MobileNetV2. On Pascal VOC 2007 test set, we observe that there is also 0.9 mAP improvement in object detection. We further verify the effectiveness of the sandglass block by adding it into the search space of neural architecture search method DARTS. With 25 is improved by 0.13



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

[width=]figs/cross_comparison.pdf MobileNetV2-1.4MobileNetV2-1.0MobileNetV2-0.75MobileNetV2-0.5MobileNeXt-1.4MobileNeXt-1.0MobileNeXt-0.75MobileNeXt-0.5
Method #Param. M-adds Acc.
MobileNetV2-1.4 6.9M 690M 74.9%
MobileNeXt-1.4 6.1M 590M 76.1%
MobileNetV2-1.0 3.5M 300M 72.3%
MobileNeXt-1.0 3.4M 300M 74.0%
MobileNetV2-0.75 2.6M 150M 69.9%
MobileNeXt-0.75 2.5M 210M 72.0%
MobileNetV2-0.5 1.7M 97M 65.4%
MobileNeXt-0.5 1.8M 110M 67.7%
Figure 1: Top-1 classification accuracy comparisons between the proposed MobileNeXt and MobileNetV2 [sandler2018mobilenetv2]. We use different width multipliers to trade-off between model complexity and accuracy. Here, four widely-used multipliers are chosen, including 0.5, 0.75, 1.0, and 1.4. As can be seen, under each width multiplier, our MobileNeXt surpasses the MobileNetV2 baseline by a large margin, especially for the models with less learnable parameters.

A common belief behind the design principles of most popular light-weight models (either manually designed or automatically searched) [sandler2018mobilenetv2, ma2018shufflenet, tan2019mnasnet, tan2019efficientnet] is to adopt the inverted residual block [sandler2018mobilenetv2]. Compared to the classic residual bottleneck block [he2016deep, he2016identity], this block shifts the identity mapping from high-dimensional representations to low-dimensional ones (i.e., the bottlenecks). However, connecting identity mapping between thin bottlenecks would inevitably lead to information loss since the residual representations are compressed as shown in Figure 2(b). Moreover, it would also weaken the propagation capability of gradients across layers, due to gradient confusion arising from the narrowed feature dimensions, and hence affect the training convergence and model performance [sankararaman2019impact]. Therefore, despite the wide use of the inverted residual block, how to design residual blocks for mobile devices is very worthy of studying.

In this paper, in view of the above concerns, we rethink the rationality of shifting from the classic bottleneck structure (Figure 2(a)) to the popular inverted residual block (Figure 2(b)) in developing mobile networks. In particular, we consider the following three fundamental questions. (i) What are the effects if we position the identity mapping (i.e., shortcuts) at the high-dimensional representations as done in the classic bottleneck structure? (ii) While the linear activation can reduce information loss, should it only be applied to the bottlenecks? (iii) The previous questions remind us of the classic bottleneck structure which suffers high computational complexity. This cost can be reduced by replacing the dense spatial convolutions with depthwise ones, but, regarding the bottlenecks, should the depthwise convolution be still added in the low-dimensional bottleneck as conventional?

Motivated by the above questions, we present and evaluate a new bottleneck design, termed the sandglass block. Unlike the inverted residual block that builds shortcuts between linear bottlenecks, our sandglass block puts shortcut connections between linear high-dimensional representations, as shown in Figure 2(c). Such structure preserves more information delivered between blocks compared to the inverted residual block and propagates more gradients backward to better optimize network training because of the high-dimensional residuals [sankararaman2019impact]. Furthermore, to learn more expressive spatial representation, instead of putting the spatial convolutions in the bottleneck with compressed channels, we propose to apply them in the expanded high dimensional feature space, which we find is an effective way of improving the model performance. In addition, we maintain the channel reduction and expansion process with pointwise convolutions to reduce computational cost. This makes our block quite different from the inverted residual block but more similar to the classic residual bottleneck.

We stack the sandglass blocks in a modularized way to build the proposed MobileNeXt. Our network achieves more than 1.7% top-1 classification accuracy improvement over MobileNetV2 on ImageNet with slightly less computation and a comparable number of parameters as shown in Figure 1. When applying the sandglass block on the EfficientNet topology to replace the inverted residual block, the resulting model surpasses the previous state-of-the-art by 0.5% with a comparable amount of computation but 20% parameter reduction. Particularly, in object detection, when taking SSDLite [liu2016ssd, sandler2018mobilenetv2] as the object detector, using our MobileNeXt as backbone gains 0.9% in mAP on the Pascal VOC 2007 test set over MobileNetV2. More interestingly, we also experimentally find the proposed sandglass block can be used to enrich the search space of neural architecture search algorithms [liu2018darts]. By adding the sandglass block into the search space as a ‘super’ operator, without changing the search algorithm, the resultant model can improve classification accuracy by 0.13% but with 25% less parameters compared to models searched from the vanilla space.

Figure 2: Conceptual diagram of different residual bottleneck blocks. (a) Classic residual block with bottleneck structure [He2016]. (b) Inverted residual block [sandler2018mobilenetv2]. (c) Our proposed sandglass block. We use thickness of each block to represent the corresponding relative number of channels. As can be seen, compared to the inverted residual block, the proposed residual block reverses the thought of building shortcuts between bottlenecks and adds depthwise convolutions (detached blocks) at both ends of the residual path, both of which are found crucial for performance improvement.

In summary, we make the following contributions in this paper:

  • Our results advocate a rethinking of the bottleneck structure for mobile network design. It seems that the inverted residuals are not so advantageous over the bottleneck structure as commonly believed.

  • Our study reveals that building shortcut connections along higher-dimensional feature space could promote model performance. Moreover, depthwise convolutions should be conducted in the high dimensional space for learning more expressive features and learning linear residuals is also crucial for bottleneck structure.

  • Based on our study, we propose a novel sandglass block, which substantially extends the classic bottleneck structure. We experimentally demonstrate that this structure is more suitable for mobile applications in terms of both accuracy and efficiency and can be used as ‘super’ operators in architecture search algorithms for better architecture generation.

2 Related Work

Modern deep neural networks are mostly built by stacking building blocks, which are designed based on either the classic residual block with bottleneck structure 

[he2016deep] or the inverted residual block [sandler2018mobilenetv2]. In this section, we categorize all related networks based on above two types of building blocks and briefly describe them below.

Classic residual bottleneck blocks The bottleneck structure was first introduced in ResNet [he2016deep]. A typical bottleneck structure consists of three convolutional layers: an convolution for channel reduction, a

convolution for spatial feature extraction, and another

convolution for channel expansion. A residual network is often constructed by stacking a sequence of such residual blocks. The bottleneck structure was further developed in later works by widening the channels in each convolutional layer [zagoruyko2016wide], applying group convolutions to the middle bottleneck convolution for aggregating richer feature representations [xie2017aggregated], or introducing attention based modules to explicitly model inter-dependencies between channels [hu2018squeeze, li2019selective]. There are also other works [chen2017dual, touvron2019fixing] combining residual blocks with dense connections to boost the performance. However, in spite of the success in heavy-weight network design, it is rarely used in light-weight networks due to the model complexity. Our work demonstrates that by reasonably adjusting the residual block, this kind of classic bottleneck structure is also suitable for light-weight networks and can yield state-of-the-art results.

Inverted residual blocks The inverted residual block, which was first introduced in MobileNetV2 [sandler2018mobilenetv2], reverses the idea of the classic bottleneck structure and connects shortcuts between linear bottlenecks. It largely improves performance and optimizes the model complexity compared to the classic MobileNet [howard2017mobilenets] which is composed of a sequence of depthwise separable convolutions. Because of high efficiency, the inverted residual block has been widely adopted in the later mobile network architectures. ShuffleNetV2 [ma2018shufflenet] inserts a channel split module before the inverted residual block and adds another channel shuffle module after it. In HBONet [li2019hbonet], down-sampling operations are introduced into inverted residual blocks for modeling richer spatial information. MobileNetV3 [howard2019searching]

proposes to search for optimal activation functions and the expansion rate of inverted residual blocks at each stage. More recently, MixNet 

[tan2019mixconv] proposes to search for optimal kernel sizes of the depthwise separable convolutions in the inverted residual block. EfficientNet [tan2019efficientnet] is also based on the inverted residual block but differently it uses a scaling method to control the network weight in terms of input resolution, network depth, and network width. Different from all the above approaches, our work advances the standard bottleneck structure and demonstrates the superiority of our building block over the inverted residual block in mobile settings.

Model compression and neural architecture search Model compression algorithms are effective for removing redundant parameters for neural networks, such as network pruning [liu2017learning, han2015deep, radu2019performance, caron2020pruning], quantization [hubara2017quantized, choukroun2019low], factorization [zhou2019tensor, jaderberg2014speeding], and knowledge distillation [hinton2015distilling]. Despite efficient networks, the performance of the compressed networks is still closely related to the original networks’ architectures. Thus, designing more efficient network architectures is essential for yielding efficient models. Neural architecture search achieves so by automatically searching efficient network architectures [tan2019mnasnet, cai2018proxylessnas, guo2019single]. However, the search space requires human expertise and the performance of the searched networks is largely dependent upon the designed search space as pointed out in [ying2019bench, dong2020bench]. In this paper, we show that our proposed building block is complementary to existing search space design principles and can further improve the performance of searched networks if added to existing search spaces.

3 Method

In this section, we first review some preliminaries about the bottleneck structure widely used in previous residual networks and then describe our proposed sandglass block and network architecture.

3.1 Preliminaries

Residual block with bottleneck structure The classic residual block with bottleneck structure [he2016deep], as shown in Figure 2(a), consists of two convolution layers for channel reduction and expansion respectively and one convolution layer between them for spatial information encoding. In spite of its success in heavy-weight network design [he2016deep], this conventional bottleneck structure is not suitable for building light-weight neural networks because of its large amount of parameters and computation cost in the standard convolutional layer.

Depthwise separable convolutions To reduce computational cost and make the network more efficient, depthwise separable convolutions [chollet2017xception, howard2017mobilenets] are developed to replace the standard one. As demonstrated in [chollet2017xception], a convolution with a

weight tensor, where

is the kernel size and and are the number of input and output channels respectively, can be factorized into two convolutions. The first is an -channel depthwise (a.k.a channel-wise) convolution to learn the spatial correlations among locations within each channel separately. The second is a pointwise convolution that learns to linearly combine channels to produce new features. As the combination of a pointwise convolution and a depthwise convolution has significantly less parameters and computations, using depthwise separable convolutions in basic building blocks can remarkably reduce the parameters and computational cost. Our proposed architecture also adopts such separable convolutions.

Inverted residual block The inverted residual block is specifically tailored for mobile devices, especially those with limited computational resource budget. More specifically, unlike the classic bottleneck structure as shown in Figure 2(a), to save computations, it takes as input a low-dimensional compressed tensor and expands it to a higher dimensional one by a pointwise convolution. Then it applies depthwise convolution for spatial context encoding, followed by another pointwise convolution to generate a low-dimensional feature tensor as input to the next block. The inverted residual block presents two distinct architecture designs for gaining efficiency without suffering too much performance drop: the shortcut connection is put between the low-dimensional bottlenecks if necessary (as shown in Figure 2(b)); and linear bottleneck is adopted.

Despite good performance [sandler2018mobilenetv2], in inverted residual blocks, feature maps encoded by the intermediate expansion layer should be first projected to low-dimensional ones, which may not preserve enough useful information due to channel compression. Moreover, recent studies have unveiled that wider architecture is more favorable for alleviating gradient confusion [sankararaman2019impact] and hence can improve network performance. Putting shortcut connections between bottlenecks may prevent the gradients from top layers from being successfully propagated to bottom layers during model training because of the low-dimensionality of representations between adjacent inverted residual blocks.

3.2 Sandglass Block

In view of the aforementioned limitations of the inverted residual block, we rethink its design rules and present a sandglass block that can tackle the above issues by flipping the thought of inverted residuals.

Our design principle is mainly based on the following insights: (i) To preserve more information from the bottom layers when transiting to the top layers and to facilitate the gradients propagation across layers, the shortcuts should be positioned to connect high-dimensional representations. (ii) Depthwise convolutions with small kernel size (e.g., ) are light-weight, so we can appropriately apply a couple of depthwise convolutions onto the higher-dimensional features such that richer spatial information can be encoded to generate more expressive representations. We elaborate on these design considerations in the following.

Rethinking the positions of expansion and reduction layers Originally, the inverted residual block performs expansion at first and then reduction. Based on the aforementioned design principle, to make sure the shortcuts connect high-dimensional representations, we propose to reverse the order of the two pointwise convolutions first. Let be the input tensor and the output tensor of a building block111For simplicity, we assume that the input and output of the building block share the same number of channels and resolution.

. We do not consider the depthwise convolution and activation layers at this moment. The formulation of our building block can be written as follows:


where and denote the two pointwise convolutions for channel expansion and reduction, respectively. In this way, we can keep the bottleneck in the middle of the residual path for saving parameters and computation cost. More importantly, this allows us to use the shortcut connection to connect representations with a large number of channels instead of the bottleneck ones.

High-dimensional shortcuts Instead of putting the shortcut between bottlenecks, we put the shortcuts between higher-dimensional representations as shown in Figure 3(b). The ‘wider’ shortcut delivers more information from the input to the output compared to the inverted residual block and allows more gradients to propagate across multiple layers.

Figure 3: Different types of residual blocks. (a) Classic bottleneck structure with depthwise spatial convolutions. (b) Our proposed sandglass block with bottleneck structure. To encode more expressive spatial information, instead of adding depthwise convolutions in the bottleneck, we propose to move them to the ends of the residual path, which have high-dimensional representations.

Learning expressive spatial features Pointwise convolutions can be used to encode the inter-channel information but fail to capture spatial information. In our building block, we follow previous mobile networks and adopt depthwise spatial convolutions to encode spatial information. The inverted residual block adds depthwise convolutions between pointwise convolutions to learn expressive spatial context information. However, in our case, the position between two pointwise convolutions is the bottleneck. Directly adding depthwise convolutions in the bottleneck as shown in Figure 3(a) makes them have fewer filters and thus, less spatial information can be encoded. We experimentally find that this structure largely degrades the performance compared to MobileNetV2 by more than 1%.

Regarding the positions of the pointwise convolutions, instead of directly putting the depthwise convolution between the two pointwise convolutions, we propose to add depthwise convolutions at the ends of the residual path as shown in Figure 3(b). Mathematically, our building block can be formulated as follows:


where and are the -th pointwise convolution and depthwise convolution, respectively. In this way, since both depthwise convolutions are conducted in high-dimensional spaces, richer feature representations can be extracted compared to the inverted residual block. We will give more explanations on the advantages of such design.

Input dimension Operator type Output dimension
33 Dwise conv, ReLU6
11 conv, linear
11 conv, ReLU6

3 Dwise conv, linear, stride =

Table 1: Basic operator description of the proposed sandglass block. Here, ‘t’ and ‘s’ denote the channel reduction ratio and the stride, respectively.

Activation layers It has been demonstrated in [sandler2018mobilenetv2] that using linear bottlenecks can help prevent the feature values from being zeroed and hence reduce information loss. Following this suggestion, we do not add any activation layer after the reduction layer (the first pointwise convolutional layer). It should also be noted that though the output of our building block is high-dimensional, we empirically find adding an activation layer after the last convolution can negatively influence the classification performance. Therefore, activation layers are only added after the first depthwise convolutional layer and the last pointwise convolutional layer. We will give more explanations in our experiments on this.

Block structure Taking the above considerations, we design a novel residual bottleneck block. The structure details are given in Table 1, and the diagram can also be found in Figure 3(b). Note that when the input and output have different channel numbers, we do not add the shortcut connection. For depthwise convolutions, we always use kernel size as done in other works [he2016deep, sandler2018mobilenetv2]

. We also utilize batch normalization and ReLU6 activation if necessary during training.

Relation to the inverted and classic residual blocks Albeit both architectures exploit the bottlenecks, the design intuition and the internal structure are quite different. Our goal is to demonstrate that the idea of building shortcut connections between high-dimensional representations as in the classic bottleneck structure [he2016deep] is suitable for light-weight networks as well. To the best of our knowledge, this is the first work that attempts to investigate the advantages of the classic bottleneck structure over the inverted residual block for efficient network design. On the other hand, we also attempt to demonstrate that adding depthwise convolutions to the ends of the residual path in our structure can encourage the network to learn more expressive spatial information and hence yield better performance. In our experiment section, we will show more numerical results and provide detailed analysis.

3.3 MobileNeXt Architecture

Based on our sandglass block, we develop a modularized architecture, MobileNeXt. At the beginning of our network, there is a convolutional layer with 32 output channels. After that, our sandglass blocks are stacked together. Detailed information about the network architecture can be found in Table 2. Following [sandler2018mobilenetv2]

, the expansion ratio used in our network is set to 6 by default. The output of the last building block is followed by a global average pooling layer to transform 2D feature maps to 1D feature vectors. A fully-connected layer is finally added to predict the final score for each category.

Identity tensor multiplier The shortcut connections in residual blocks have been shown essential for improving the capability of propagating gradients across layers [he2016deep, sandler2018mobilenetv2]. According to our experiments, we find that there is no need to keep the whole identity tensor to combine with the residual path. To make our network more friendly to mobile devices, we introduce a new hyper-parameter—identity tensor multiplier, denoted by , which controls what portion of the channels in the identity tensor is preserved. For convenience, let be the transformation function of the residual path in our block. Originally, the formulation of our block can be written as . After applying the multiplier, our building block can be rewritten as


where the subscripts index the channel dimension.

The advantages of using are mainly two-fold. First, after reducing the multiplier, the number of element-wise additions in each building block can be reduced. As pointed out in [ma2018shufflenet], the element-wise addition is time consuming. Users can choose a lower identity tensor multiplier to yield better latency with nearly no performance drop. Second, the number of memory access times can be reduced. One of the main factors that affect the model latency is the memory access cost (MAC). As the shortcut identity tensor is from the output of the last building block, its recurrent nature hints an opportunity to cache it on the chip in order to avoid the excessive off-chip memory access. Therefore, reducing the channel dimension of the identity tensor can effectively encourage the processors to store it in the cache or other faster memory near the processors and hence improve the latency. We will give more details on how this multiplier affects the performance and model latency in the experiment section.

No. t Output dimension s b Input dimension Operator
1 - 2 1 conv2d 3x3
2 2 2 1 sandglass block
3 6 1 1 sandglass block
4 6 2 3 sandglass block
5 6 2 3 sandglass block
6 6 1 4 sandglass block
7 6 2 4 sandglass block
8 6 1 2 sandglass block
9 6 1 1 sandglass block
10 - - 1 avgpool 7x7
11 - - 1 conv2d 1x1
Table 2: Architecture details of the proposed MobileNeXt. Each row denotes a sequence of building blocks, which is repeated ‘b’ times. The reduction ratio used in each building block is denoted by ‘t’. The stride of the first building block in each stage is set to 2 and all the others are with stride 1. Each convolutional layer is followed by a batch normalization layer and the kernel size for all spatial convolutions is set to . We do not add identity mappings for those blocks have different input and output channels. We suppose there are totally categories.

4 Experiments

4.1 Experiment Setup

We adopt the PyTorch toolbox

[paszke2019pytorch] to implement all our experiments. We use the standard SGD optimizer to train our models with both decay and momentum of 0.9 and the weight decay is

. We use the cosine learning schedule with an initial learning rate of 0.05. The batch size is set to 256 and four GPUs are used for training. Without special declaration, we train all the models for 200 epochs and report results on the ImageNet

[krizhevsky2012imagenet] for classification and Pascal VOC dataset [pascal-voc-2012] for object detection. We use distributed training with three epochs of warmup.

4.2 Comparisons with MobileNetV2

In this subsection, we extensively study the advantages of our MobileNeXt over MobileNetV2 under various settings. Besides comparing performance of their full models (i.e., , with weight multiplier of 1) for classification, we also compare their performance with other weight multipliers and quantization. This can help unveil the performance advantage of our model w.r.t. the full spectrum of model architecture configurations.

Comparison under different width multipliers We use the width multiplier as a scaling factor to trade off the model complexity and accuracy of the model as used in [howard2017mobilenets, sandler2018mobilenetv2, howard2019searching]. Here, we adopt five different multipliers, including 1.4, 1.0, 0.75, 0.5, and 0.35, to show the superiority of our network over MobileNetV2. As can be seen in Table 3 222We also conduct latency measurements with TF-Lite on Pixel 4XL and the measured latency for MobileNeXt and MobileNetV2 are 66ms and 68 ms respectively., our networks with different multipliers all outperform MobileNetV2 with comparable numbers of learnable parameters and computational cost. The performance gain of our model over MobileNetV2 is especially high when the multiplier is small. This demonstrates that our model is more efficient since our model performance is much better at small sizes.

No. Models Param. (M) MAdd(M) Top-1 Acc. (%)
1 MobileNetV2-1.40 6.9 690 74.9
2 MobileNetV2-1.00 3.5 300 72.3
3 MobileNetV2-0.75 2.6 150 69.9
4 MobileNetV2-0.50 2.0 97 65.4
5 MobileNetV2-0.35 1.7 59 60.3
6 MobileNeXt-1.40 6.1 590 76.1
7 MobileNeXt-1.00 3.4 300 74.0
8 MobileNeXt-0.75 2.5 210 72.0
9 MobileNeXt-0.50 2.1 110 67.7
10 MobileNeXt-0.35 1.8 80 64.7
Table 3: Comparisons with MobileNetV2 using different width multipliers with input resolution . As can be seen, the smaller the multiplier is set to the better performance gain we achieve over MobileNetV2 with comparable latency (e.g., 210ms for both models with width multiplier 1.0) tested on Google Pixel 4XL under the PyTorch environment setting.
Model Precision (W/A) Method Top-1 Acc. (%)
MobileNetV2 INT8/INT8 Post Training Quant. 65.07
MobileNeXt INT8/INT8 Post Training Quant.
MobileNetV2 FP32/FP32 - 72.25
MobileNeXt FP32/FP32 -
Table 4: Performance of our proposed MobileNeXt and MobileNetV2 after post-training quantization. In bites configurations, ‘W’ denotes the number of bits used to represent the weights of the model and ‘A’ denotes the number of bits used to represent the activations.

Comparison under post-training quantization Quantization algorithms are often used in real-world applications as a kind of effective compression tool with subtle performance loss. However, the performance of the quantized model is significantly affected by the original base model. We experimentally show that the MobileNeXt can achieve better performance than the MobileNetV2 when combined with the quantization algorithm. Here, we use a widely-used post-training linear quantization method introduced in [migacz2017nvidia]. We apply 8-bit quantization on both weights and activations as 8-bit is the most common scheme used on hardware platforms. The results are shown in Table 4. Without quantization, our network improves MobileNetV2 by more than 1.7% in terms of top-1 accuracy. When the parameters and activations are quantized to 8 bits, our network outperforms MobileNetV2 by 3.55% under the same quantization settings. The reasons for this large improvement are two-fold. First, compared to MobileNetV2, we move the shortcut in each building block from low-dimensional representations to high-dimensional ones. After quantization, more informative feature representations can be preserved. Second, using more depthwise spatial convolutions can help preserve more spatial information, which we believe is beneficial to the classification performance.

Method #Dwise convs Param. (M) M-Adds (M) Top-1 Acc. (%)
MobileNetV2 2 (middle) 3.6 340 73.02
MobileNeXt 2 (top, bottom) 3.5 300 74.02
Table 5: Performance of our proposed network and MobileNetV2 when adding the number of spatial convolutions (Dwise convs) in each building block. Obviously, our MobileNeXt performs much better than the improved MobileNetV2 with less learnable parameters and computational cost.

Comparison with MobileNetV2 on structure As shown in Figure 3(b), our sandglass block contains two depthwise convolutions for encoding rich spatial context information. To demonstrate the benefit of our model comes from our novel architecture rather than leveraging one more depthwise convolution or larger receptive field, in this experiment, we attempt to compare with an improved version of MobileNetV2 with one more depthwise convolution inserted in the middle of each inverted residual block. The results are shown in Table 5. Obviously, after adding one more depthwise convolution, the performance of MobileNetV2 increases to 73%, which is still far worse than ours (74%) with even more learnable parameters and complexity. This indicates that structurally our network does have an edge over MobileNetV2.

4.3 Comparison with State-of-the-Art Mobile Networks

To further verify the superiority of our proposed sandglass block over the inverted residual blocks, we add squeeze and excite modules into our MobileNeXt as done in [howard2019searching, tan2019efficientnet]. We do not apply any searching algorithms on the architecture design and data augmentation policy. We directly take the EfficientNet-b0 architecture [tan2019efficientnet] and replace the inverted residual block with sandglass block with the basic augmentation policy. As shown in table 6, with a comparable amount of computation and  20% parameter reduction, replacing the inverted residual block with sandglass block results in 0.4% top-1 classification accuracy improvement on ImageNet-1k dataset.

4.4 Ablation Studies

In Sec. 4.2, we have shown the importance of connecting high-dimensional representations with shortcuts. In this subsection, we study how other model design choices contribute to the model performance and efficiency, including the effect of using wider transformation, the importance of learning linear residuals, and the role of identity tensor multiplier.

Importance of using wider transformation As described in Sec. 3, we apply spatial transformation and shortcut connections to high-dimensional representations. To demonstrate the importance of such operations, we follow the inverted residual block to use the shortcuts to connect the bottleneck representations. This operation leads to an accuracy decrease of 1%, which indicates applying shortcuts at wider dimension is more beneficial.

Importance of linear residuals According to MobileNetV2 [sandler2018mobilenetv2], its classification performance will be degraded when replacing the linear bottleneck with the non-linear one because of information loss. From our experiment, we obtain a more general conclusion. We find that though the shortcuts connect high-dimensional representations in our model, adding non-linear activations (ReLU6) to the last convolutional layer decreases the performance by nearly 1% compared to the setting using linear activations (no ReLU6). This indicates that learning linear residual (i.e., adding no non-linear activation layer on the top of the residual path) is essential for light-weight networks with shortcuts connecting either expansion layers or reduction layers.

Models Param. (M) MAdd (M) Top-1 Acc. (%)
MobilenetV1-1.0[howard2017mobilenets] 4.2 575 70.6
ShuffleNetV2-1.5[ma2018shufflenet] 3.5 299 72.6
MobilenetV2-1.0[sandler2018mobilenetv2] 3.5 300 72.3
MnasNet-A1[tan2019mnasnet] 3.9 312 75.2
MobilenetV3-L-0.75[howard2019searching] 4.0 155 73.3
ProxylessNAS[cai2018proxylessnas] 4.1 320 74.6
FBNet-B[wu2019fbnet] 4.5 295 74.1
IGCV3-D[sun2018igcv3] 7.2 610 74.6
GhostNet-1.3[han2019ghostnet] 7.3 226 75.7
EfficientNet-b0[tan2019efficientnet] 5.3 390 76.3
MobileNeXt-1.0 3.4 300 74.02
MobileNeXt-1.0 3.94 330 76.05
MobileNeXt-1.1 420
Table 6: Comparisons with other state-of-the-art models. MobileNeXt denotes the model based on our proposed sandglass block and MobileNeXt denotes the models with sandglass block and the SE module [hu2018squeeze] added for a fair comparison with other state-of-the-art models such as EfficientNet. We do not apply any searching algorithms on both the architecture design and data augmentation policy.

Effect of identity tensor multiplier Here, we investigate how the identity tensor multiplier (Sec. 3.3) would trades-off the model accuracy and latency. We use pytorch to generate the model and run it on Google Pixel 4XL. For each model, we measure the average inference time of 10 images as the final inference latency. As shown in Table 7, the reduction of the multiplier has subtle impacts on the classification accuracy. When half of the identity representations are removed, the performance has no drop but the latency is improved. When the multiplier is set to , the performance decreases by 0.34% from 74.02% to 73.68%, but with further improvement in terms of latency. This indicates that introducing such a hyper-parameter does matter for balancing the model performance and latency.

No. Models Tensor multiplier Param. (M) Top-1 Acc. (%) Latency (ms)
1 MobileNeXt 1.0 3.4 74.02 211
2 MobileNeXt 3.4 74.09 196
3 MobileNeXt 3.4 73.91 195
4 MobileNeXt 3.4 73.68 188
Table 7: Model performance and latency comparisons with different identity tensor multipliers. As can be seen, the latency can be improved by using lower identity tensor multipliers with only negligible sacrifice on the classification accuracy.

4.5 Application for Object Detection

To explore the transferable capability of the proposed approach against MobileNetV2, in this subsection, we apply our classification model to the object detection task as pretrained models. We use both the proposed network and MobileNetv2 as feature extractors and report results on the Pascal VOC 2007 test set [everingham2015pascal] following [liu2016ssd] using SSDLite [sandler2018mobilenetv2, liu2016ssd]. Similar to [sandler2018mobilenetv2], the first and second layers of SSDLite are connected to the last pointwise convolution layer with output stride of 16 and 32, respectively. The rest of SSDLite layers are attached on top of the last convolutional layer with output stride of 32. During training, we use a batch size of 24 and all the models are trained for 240,000 iterations. For more detailed settings, readers can refer to [sandler2018mobilenetv2, liu2016ssd].

In Table 8, we show the results when different backbone networks are used. Obviously, with the nearly the same number of parameters and computation, SSDLite with our backbone improves the one with MobileNetV2 by nearly 1%. This demonstrates that the proposed network has better transferable capability compared to MobileNetV2.

No. Method Backbone Param. (M) M-Adds (B) mAP (%)
1 SSD300 VGG [simonyan2014very] 36.1 35.2 77.2
2 SSDLite320 MobileNetV2 [sandler2018mobilenetv2] 4.3 0.8 71.7
3 SSDLite320 MobileNeXt 4.3 0.8 72.6
Table 8: Detection results on the Pascal VOC 2007 test set. As can be seen, using the same SSDLite320 detector, replacing the MobileNetV2 backbone with our network achieves better results in terms of mAP. Note that the multipliers of both MobileNetV2 and our network are set to 1.0.

4.6 Improving Architecture Search as Super-operators

It has been verified in previous subsections that our proposed sandglass block is more effective than the inverted residual block in both the classification task and the object detection task. From a holistic perspective, we can also regard a residual block as a ‘super’ operator with more powerful transformation power than a regular convolutional operator. To further investigate the superiority of the proposed sandglass block over the inverted residual block, we separately add it into the search space of the differentiable searching algorithm (DARTS) [liu2018darts] to see the network performance after architecture search and report the corresponding results on CIFAR-10 dataset. As shown in Table 9, by adding our sandglass block as an new operator into the DARTS search space without changing the cell structure, the resulting model achieves higher accuracy than the model with the original DARTS search space with about 25% parameter reduction. However, the searched model with the inverted residual block added in the search space decreases the original performance. This demonstrates that our proposed sandglass block can generate more expressive representations than the inverted residual block and can also be used in architecture search algorithms as a kind of ‘super’ operator. For more details on the searched cell structure, please refer to our supplementary materials.

No. Search Space Test Error (%) Param. (M) Search Method #Operators
1 DARTS original 3.11 3.25 gradient based 7
2 DARTS + IR Block 3.26 3.29 gradient based 8
3 DARTS + sandglass block 2.98 2.45 gradient based 8
Table 9: Results produced by different network architectures searched by DARTS [liu2018darts]. For Lines 2 and 3, we separately add the inverted residual (IR) block and our sandglass block into the original search space of DARTS. We report results on CIFAR-10 dataset as in [liu2018darts].

5 Conclusions

In this paper, we deeply analyze the design rules and shortcomings of the previous inverted residual block. Based on the analysis, we propose to reverse the thought of adding shortcut connections between low-dimensional representations and present a novel building block, called the sandglass block, that connects high-dimensional representations instead. We furthermore break through the tradition of previous residual blocks using one spatial convolution in each and emphasize the importance of using one more such convolution. Experiments in both classification, object detection, and neural architecture search demonstrate the effectiveness of the proposed sandglass block and its potential to be used in more contexts.


Jiashi Feng was partially supported by MOE Tier 2 MOE2017-T2-2-151, NUS_ECRA_FY17_P08, AISG-100E-2019-035.


Appendix 0.A Variants of the Proposed Sandglass Block

In this section, we introduce and compare the variants of our proposed sandglass block, which are shown in Figures 4 (a-c). The corresponding results are listed in Table 10.

  1. The first variant (Figure 4(a)) is built from direct modification of the classic bottleneck structure [he2016deep] by replacing the standard convolution with a depthwise convolution. From the result, we can observe performance drop of about 5% compared to our sandglass block. We argue this is mostly because the depthwise convolution is conducted in the bottleneck with a low-dimensional feature space and hence cannot capture enough spatial information, leading to much worse performance compared to our proposed sandglass block (Figure 4(d)).

  2. The second variant (Figure 4(b)) is derived from the first variant, but differently we add another depthwise convolution in the bottleneck. As can be seen, the top-1 accuracy improves by more than 1% compared to the structure shown in Figure 4(a). This indicates encoding more spatial information indeed helps. This phenomenon can also be observed by comparing Figure 4(b) and Figure 4(d) (70.11 v.s. 74.02).

  3. The third variant (Figure 4(c)) is based on the original inverted residual block [sandler2018mobilenetv2]. We move the depthwise convolution from the high-dimensional feature space to the bottleneck positions with less feature channels. Compared with Figure 4(b), this variant in Figure 4(c) has a comparable number of learnable parameters and more computational cost but worse performance (69.26 v.s. 70.11). This also means building shortcuts between high-dimensional representations is more beneficial to the network performance.

As shown in Table 10, our proposed sandglass block achieves much better results than all the three variants. The performance improvements can be explained by the two rules that we have presented in the main paper: (1) adding shortcut connections between high-dimensional representations, and (2) performing the depthwise convolution in high-dimensional feature space. Our experiments also indicate that bottleneck structure is suitable for mobile networks and it can work better than the inverted residual block.

Block structure #Dwise convs Param. (M) M-Adds (M) Top-1 Acc. (%)
MobileNetV2 1 3.5 300 72.3
Figure 4(a) 1 3.4 240 68.90
Figure 4(b) 2 3.4 250 70.11
Figure 4(c) 2 3.5 300 69.26
Figure 4(d) 2 3.5 300 74.02
Table 10: Performance of different variants of our proposed sandglass block shown in Figure 4.
No. Search Space Test Error (%) Param. (M) Search Method #Operators
1 DARTS original 3.11 3.25 gradient based 7
2 DARTS + IR Block 3.26 3.29 gradient based 7
3 DARTS + sandglass block 2.98 2.45 gradient based 7
Table 11: Results produced by different network architectures searched by DARTS [liu2018darts]. For Lines 2 and 3, we separately add the inverted residual (IR) block and our sandglass block into the original search space of DARTS. We report results on CIFAR-10 dataset as in [liu2018darts].

Appendix 0.B Searched Architectures

Following [liu2018darts, zoph2018learning, cai2018proxylessnas], we also search for a computation cell and use it as the basic building block for the final architecture. The searching space and algorithm are described in details as below.

Search space In our experiments, we use the search space from [liu2018darts] as our baseline (denoted as original), which includes the following operators:

  • Convolutional operations (ConvOp): regular convolution, dilated convolution, depthwise convolution;

  • Convolution kernel size333For the inverted residual block and our sandglass block, we only use a kernel size of for depthwise convolutions.: , , followed by ;

  • Non-parametric operations: average pooling, max pooling, skip connection, None.

The results are reported in Line 1 of Table 11. To compare with the inverted residual block, we conduct architecture search within the following two new search spaces.

  • Original + IR block: the original search space plus the inverted residual block as the depthwise separable convolution operation candidate.

  • Original + sandglass block: the original search space plus the sandglass block.

The corresponding results are reported in Lines 2-3 of Table 11, respectively. The zero (None) operation is also included to indicate the miss of the connections as used in [liu2018darts].

Searching algorithm As mentioned in the main paper, we adopt the DARTS searching algorithm [liu2018darts] to search for the cell structure and set the number of nodes to 7 in each directed acyclic graph (DAG) of the cell. During the searching process, we strictly follow the training policy and use the same hyper-parameters as in [liu2018darts] for a fair comparison.

(a) (b)

Figure 5: Cell structures searched on CIFAR-10 with DARTS [liu2018darts]. (a) Searched normal cell structure. (b) Searched reduction cell structure. ‘SGBlock’ denotes our proposed sandglass block. We use the same search space as used in [liu2018darts] with only one more operator included, i.e. our proposed sandglass block.

Architecture and results The searched cell structures (including both the normal cell and the reduction cell) can be found in Figure 5. As can be seen in Table 11, adding the proposed sandglass block into the original search space can largely reduce the number of learnable parameters in the searched architecture with improved classification performance on the CIFAR-10 dataset. This again shows when combined with the searching algorithm, our proposed sandglass block can be used to replace the original block to improve the performance.

Conclusion and discussion From the above results, we can observe that introducing appropriate super operators (e.g., our sandglass block) into the search space can bring better performance compared to using the original basic operators. We hope this experiment could benefit the development of architecture searching algorithms in the future.