Mobile-Former: Bridging MobileNet and Transformer

08/12/2021
by   Yinpeng Chen, et al.
Microsoft
USTC
28

We present Mobile-Former, a parallel design of MobileNet and Transformer with a two-way bridge in between. This structure leverages the advantage of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different with recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. less than 6 tokens) that are randomly initialized, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power, outperforming MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, it achieves 77.9% top-1 accuracy at 294M FLOPs, gaining 1.3% over MobileNetV3 but saving 17% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

page 9

page 13

11/25/2021

Global Interaction Modelling in Vision Transformer via Super Tokens

With the popularity of Transformer architectures in computer vision, the...
10/05/2021

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Light-weight convolutional neural networks (CNNs) are the de-facto for m...
12/09/2021

Locally Shifted Attention With Early Global Integration

Recent work has shown the potential of transformers for computer vision ...
04/12/2022

TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

Although vision transformers (ViTs) have achieved great success in compu...
07/06/2021

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

The core for tackling the fine-grained visual categorization (FGVC) is t...
11/20/2021

Discrete Representations Strengthen Vision Transformer Robustness

Vision Transformer (ViT) is emerging as the state-of-the-art architectur...
06/08/2021

Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight

Vision Transformer (ViT) attains state-of-the-art performance in visual ...

Code Repositories

Pytorch-implementation-of-Mobile-Former

Simple implementation of Mobile-Former on Pytorch


view repo

MobileFormer

MobileFormer in torch


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Overview of Mobile-Former, which parallelizes MobileNet [sandler2018mobilenetv2] (on the left side) and Transformer [NIPS2017_transformer] (on the right side). Different with vision transformer [dosovitskiy2021vit] that uses image patches to form tokens, the transformer in Mobile-Former takes very few learnable tokens as input that are randomly initialized. Mobile (refers to MobileNet) and Former (refers to transformer) are communicated by a bidirectional bridge, which is modeled by the proposed light-weight across attention. Best viewed in color.

Recently, Vision Transformer (ViT) [dosovitskiy2021vit, touvron2020deit] demonstrates the advantage of global processing and achieves significant performance boost over CNNs. However, when constraining the computational budget within 1G FLOPs, the gain of ViT diminishes. If we further challenge the computational cost, MobileNet [howard2017mobilenets, sandler2018mobilenetv2, Howard_2019_ICCV_mbnetv3] and its extensions [Han_2020_CVPR_ghostnet, li2021micronet] still dominate their backyard (e.g. less than 300M FLOPs for ImageNet classification) due to their efficiency in local processing filters via decompostion of depthwise and pointwise convolution. This in turn naturally raises a question:

How to design efficient networks to effectively encode both local processing and global interaction?

A straightforward idea is to combine convolution and vision transformer. Recent works [wu2021cvt, graham2021levit, Xiao-2021-early-cnns-help-transformers] show the benefit of combining convolution and vision transformer in series, either using convolution at the beginning or intertwining convolution into each transformer block.

In this paper, we shift the design paradigm from series to parallel, and propose a new network that parallelizes MobileNet and transformer with a two-way bridge in between (see Figure 1). We name it Mobile-Former, where Mobile refers to MobileNet and Former stands for transformer. Mobile takes the image as input and stacks mobile (or inverted bottleneck) blocks [sandler2018mobilenetv2]. It leverages the efficient depthwise and pointwise convolution to extract local features at pixel level. Former takes a few learnable tokens as input and stacks multi-head attention and feed-forward networks (FFN). These tokens are used to encode global features of the image.

Mobile and Former communicate over a two-way bridge to fuse local and global features. This is crucial since it feeds local features to Former’s tokens as well as introduces global views to every pixel of featuremap in Mobile. We propose a light-weight cross attention to model this bidirectional bridge by (a) performing the cross attention at the bottleneck of Mobile where the number of channel is low, and (b) removing projections on query, key and value (, , ) from Mobile side.

This parallel structure with a bidirectional bridge leverages the advantage of both MobileNet and transformer. The decoupling of local and global features in parallel leverage MobileNet’s efficiency in extracting local features as well as transformer’s power in modeling global interaction. More important, this is achieved in an efficient way via a thin transformer (with very few tokens) and a light-weight bridge to exchange local and global features between Mobile and Former. The bridge and Former consumes less than 20% of the total computational cost, but significantly improve the representation capability. This showcases an efficient and effective implementation of part-whole hierarchy proposed in [Hinton2021-part-whole].

Figure 2: Comparison among Mobile-Former, efficient CNNs and Vision Transformers, in terms of accuracy-flop tradeoff. The comparison is performed on ImageNet classification. Mobile-Former consistently outperforms both efficient CNNs and vision transformers in low FLOP regime (from 25M to 500M MAdds). Note that we implement Swin [liu2021Swin] and DeiT [touvron2020deit] at low computational budget from 100M to 2G FLOPs. Best viewed in color.

Mobile-Former achieves solid performance on both image classification and object detection. For example, it achieves 77.9% top-1 ImageNet accuracy at 294M FLOPs, outperforming MobileNetV3 [Howard_2019_ICCV_mbnetv3] and LeViT [graham2021levit] by a clear margin (see Figure 1). More important, Mobile-Former consistently outperforming both efficient CNNs and vision transformers from 25M to 500M FLOPs (shown in Figure 2), showcasing the usage of transformer at the low FLOP regime where efficient CNNs dominate. Furthermore, when transferring from Image classification to object detection, Mobile-Former significantly outperforms MobileNetV3, e.g. gaining 8.6 AP (35.8 vs. 27.2) with even less computational cost.

Finally we note that exploring the optimal network parameters (e.g. width, height) in Mobile-Former is not a goal of this work, rather we demonstrate that the parallel design of Mobile-Former provides an efficient and effective network architecture.

2 Related Work

Light-weight Convolutional Neural Networks (CNNs):

MobileNets [howard2017mobilenets, sandler2018mobilenetv2, Howard_2019_ICCV_mbnetv3] proposes an efficient way to model local filter processing by using depthwise and pointwise convolution in an inverted bottleneck structure. [Zhang_2018_CVPR, ma_2018_ECCV] proposes ShuffleNet that uses group convolution and channel shuffle to simplify pointwise convolution. Furthermore, MicroNet [li2021micronet] presents micro-factorized convolution that optimizes the combination of inverted bottleneck and group convolution and achieves solid performance at extremely low FLOPs. Dynamic operators [Hu_2018_CVPR, Yang2019CondConvCP, Chen2019DynamicCA, Chen2020DynamicReLU] have been studied to boost performance for MobileNet with negligible computational cost. Other efficient operators include butterfly transform [vahid_2020_CVPR]

, cheap linear transformations in GhostNet

[Han_2020_CVPR_ghostnet], and using cheap additions to trade massive multiplications in AdderNet [Chen_2020_CVPR_addernet]. In addition, different architectures and compound scaling method have been studied. MixConv [Tan-bmvc2019-mixconv] explores mixing up multiple kernel sizes, and Sandglass [Daquan_2020_ECCV_RethinkingBS] flips the structure of inverted residual block. EfficientNet [tan-ICML19-efficientnet, Tan_2020_CVPR] and TinyNet [NEURIPS2020_e069ea4c] study the compound scaling of depth, width and resolution.

Vision Transformers (ViT): Recently, ViT [dosovitskiy2021vit] and its follow-ups [touvron2020deit, yuan2021tokens, liu2021Swin, dong2021cswin, Vaswani_2021_CVPR_halo] achieve impressive performance on multiple vision tasks. The original ViT requires training on large dataset such as JFT-300M to perform well. Later, DeiT [touvron2020deit] demonstrates that good performance can be achieved on the smaller ImageNet-1K dataset by introducing several important training strategies. To enable ViT for high resolution images, several hierarchical transformers are proposed. For example, Swin [liu2021Swin] presents shifted window approach for computing self-attention within a local window and CSWin [dong2021cswin] further improves it by introducing cross-shaped window self-attention. T2T-ViT [yuan2021tokens] progressively converts the image to tokens by recursively aggregating neighboring tokens, such that the local structure can be well modeled. HaloNet [Vaswani_2021_CVPR_halo] develops two attention extensions (blocked local attention and attention downsampling) that result in improvement of speed, memory usage and accuracy.

Combination of CNNs and ViT: Recent works [Srinivas_2021_CVPR_bot, wu2021cvt, d2021convit, Xiao-2021-early-cnns-help-transformers, graham2021levit] show that combining convolution and transformer achieves improvement in prediction accuracy as well as training stability. BoTNet [Srinivas_2021_CVPR_bot] shows significant improvement in instance segmentation and object detection by just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet [he2016deep]. ConViT [d2021convit] improves ViT with soft convolutional inductive biases by introducing a gated positional self-attention (GPSA). CvT [wu2021cvt] introduces depthwise/pointwise convolution before each multi-head attention. LeViT [graham2021levit] and ViT [Xiao-2021-early-cnns-help-transformers] use convolutional stem (stacking 33 convolutions) to replace the patchify stem [Xiao-2021-early-cnns-help-transformers]. LeViT and ViT show clear improvement at the low FLOP regime. In this paper, we propose a different design that parallelizes MobileNet and transformer with bidirectional cross attention in between. Our approach is both efficient and effective, outperforming both efficient CNNs and ViT variants at low FLOP regime.

3 Our Method: Mobile-Former

In this section, we describe the design process of Mobile-Former and its building blocks. The architecture is summarized in Figure 1 and Figure 3.

3.1 Overview

Mobile-Former parallelizes MobileNet and transformer, and connects them by bidirectional cross attention (see Figure 1). In Mobile-Former, Mobile (refers to MobileNet) takes an image as input () and applies inverted bottleneck blocks [sandler2018mobilenetv2] to extract local features. Former (refers to transformer) takes learnable parameters (or tokens) as input, denoted as where and are the dimension and number of tokens, respectively. These tokens are randomly initialized and each represents a global prior of the image. This is different with Vision Transformer (ViT) [dosovitskiy2021vit], where tokens project the local image patch linearly. This difference is important as it significantly reduces the number of tokens ( in this paper), resulting in an efficient Former.

Mobile and Former are connected by a two-way bridge where local and global features are fused bidirectionally. Here, we denote two directions of the bridge as MobileFormer and MobileFormer, respectively. We propose a light-weight cross attention to model this bidirectional bridge, which will be discussed next.

Figure 3: Mobile-Former Block that includes four modules: Mobile sub-block modifies inverted bottleneck block in [sandler2018mobilenetv2]

by replacing ReLU with dynamic ReLU

[Chen2020DynamicReLU].
MobileFormer uses light-weight cross attention to fuse local features into global features. Former sub-block is a standard transformer block including multi-head attention and FFN. Note that the output of Former is used to generate parameters for dynamic ReLU in Mobile sub-block. MobileFormer bridges from global to local features.

3.2 Low Cost Two-Way Bridge

We leverage the advantage of cross attention to fuse the local features (from Mobile) and global tokens (from Former). Here, two changes are introduced to the standard cross attention for the sake of low computational cost: (a) computing the cross attention at the bottleneck of Mobile where the number of channels is low, and (b) removing the projections (, , ) from Mobile side where the number of positions is large, but keeping them at Former side.

Let us denote the local feature map as , and the global tokens as . They are split as and () for multi-head attention with heads. The light-weight cross attention from local to global is defined as follows:

(1)

where is the query projection matrix in the head, is used to combine multiple heads together, is the standard attention function (in [NIPS2017_transformer]) over query , key , and value as follows:

(2)

Note that the global feature is the query and the local feature is the key and value. and are applied on global tokens . The diagram of this cross attention is shown in Figure 3 (MobileFormer).

In a similar manner, the cross attention from global to local is computed as:

(3)

where and are the projection matrices for key and value. Here, the local feature is the query and the global feature is the key and value. The diagram of this cross attention is shown in Figure 3 (MobileFormer).

3.3 Mobile-Former Block

Mobile-Former can be decoupled as a stack of Mobile-Former blocks (see Figure 1). Each block includes a Mobile sub-block, a Former sub-block, and a bidirectional bridge (MobileFormer and MobileFormer). The details of Mobile-Former block are shown in Figure 3.

Input and Output: Mobile-Former block has two inputs: (a) local feature map , which has channels and spatial positions ( where and are the height and width of the feature map), and (b) global tokens , where and are the number and dimension of tokens, respectively. Mobile-Former block outputs the updated local feature map and global tokens , which are used as input for the next () block. Note the number and dimension of global tokens are identical across all blocks.

Mobile sub-block: Mobile sub-block takes feature map as input. It is slightly different to the inverted bottleneck block in [sandler2018mobilenetv2] by replacing ReLU with dynamic ReLU [Chen2020DynamicReLU]

as the activation function after the first pointwise convolution and the 3

3 depthwise convolution. Different with the original dynamic ReLU, in which the parameters are generated by applying two MLP layers on the average pooled feature, we save the average pooling by applying the two MLP layers ( in Figure 3) on the first global token output from Former. Note that the kernel size of depthwise convolution is 33 for all blocks. The output of Mobile sub-block, denoted as , is taken as the input for MobileFormer (see Figure 3). Its computational complexity is , where is the number of spatial positions, is the channel expansion ratio, and is the number of channels before the expansion.

Former sub-block: Former sub-block is a standard transform block with a multi-head attention (MHA) and a feed-forward network (FFN). Here, we follow [NIPS2017_transformer] to use post layer normalization. To save computations, we use expansion ratio 2 instead of 4 in FFN. Note that Former sub-block is processed between the two-way cross attention, i.e. after MobileFormer and before MobileFormer (see Figure 3). Its complexity is . The first item relates to computing dot product between query and key, and aggregating values based on attention values, while the second item covers linear projections and FFN. Since Former only has a few tokens (), the first item is ignorable.

MobileFormer: The proposed light-weight cross attention (Equation 1) is used to fuse local features to global tokens . Compared to the standard attention, the projection matrices for key and value (on the local features) are removed to save computations (shown in Figure 3). Its computational complexity is , where the first item relates to computing cross attention between local and global features and aggregating local features for each global token, and the second item is the complexity to project global features to the same dimension of local features and back to dimension after aggregation.

MobileFormer: Here, the cross attention (Equation 3) is on the opposite direction to MobileFormer. It fuses global tokens to local features . The local features are the query and global tokens are key and value. Therefore, we keep the projection matrices for key and value , but remove the projection matrix for query to save computations, as shown in Figure 3. The computational complexity is .

Computational Complexity: The four pillars of Mobile-Former block have different computational costs. Mobile sub-block consumes the most computations (), as it grows linearly with the number of spatial positions and grows quadratically with the number of channels in local features . Former sub-block and the two-way bridge are computational efficient, consuming less than 20% of total computation for all Mobile-Former models.

3.4 Network Specification

Architecture: Table 1 shows a Mobile-Former architecture at 294M FLOPs, which stacks 11 Mobile-Former blocks at different input resolutions. All Mobile-Former blocks have 6 global tokens with dimension 192. It starts with a 33 convolution as stem which is followed by a lite bottleneck block at stage 1. The lite bottleneck block is proposed in [li2021micronet], which uses a 33 depthwise convolution to expand the number of channels and uses a pointwise convolution to squeeze the number of channels. The classification head applies average pooling on the local features, concatenates with the first global token, and then passes through two fully connected layers with h-swish [Howard_2019_ICCV_mbnetv3] in between.

Downsample Mobile-Former Block: Note that stage 2–5 has a downsample variant of Mobile-Former block (denoted as Mobile-Former) to handle the spatial downsampling. In Mobile-Former, only the convolution layers in Mobile sub-block are changed from three layers (pointwisedepthwisepointwise) to four layers (depthwisepointwisedepthwise

pointwise), where the first depthwise convolution layer has stride 2. The number of channels expands in each depthwise convolution, and squeezes in the following pointwise convolution. This saves computations as the two costly pointwise convolutions are performed at the lower resolution after downsampling.

Mobile-Former Variants: Mobile-Former has 7 models of different computational costs from 26M to 508M FLOPs. They share the similar architecture, but have different widths and heights. We follow [Xiao-2021-early-cnns-help-transformers] to refer our models by their flops, e.g. Mobile-Former-294M, Mobile-Former-96M. The details of network architecture for these Mobile-Former models are listed in the appendix (see Table 10).

Stage Input Operator exp size #out Stride

tokens
6192
stem 2243 conv2d, 33 16 2
1 11216 bneck-lite 32 16 1
2 11216 Mobile-Former 96 24 2
5624 Mobile-Former 96 24 1
3 5624 Mobile-Former 144 48 2
2848 Mobile-Former 192 48 1

4
2848 Mobile-Former 288 96 2
1496 Mobile-Former 384 96 1
1496 Mobile-Former 576 128 1
14128 Mobile-Former 768 128 1
5 14128 Mobile-Former 768 192 2
7192 Mobile-Former 1152 192 1
7192 Mobile-Former 1152 192 1
7192 conv2d, 11 1152 1
head 71152 pool, 77 1
11152 concat w/ cls token 1344 1
11344 FC 1920 1
11920 FC 1000 1
Table 1: Specification for Mobile-Former-294M. “bneck-lite” denotes light bottleneck block. “Mobile-Former” denotes the Mobile-Former block for downsampling.

4 Experimental Results

We conduct experiments on ImageNet classification [deng2009imagenet]

, and COCO object detection

[lin2014microsoft] to evaluate the proposed Mobile-Former.

4.1 ImageNet Classification

We now evaluate our Mobile-Former models on ImageNet [deng2009imagenet] classification. ImageNet has 1000 classes, including 1,281,167 images for training and 50,000 images for validation.

Training Setup: The image resolution is 224224. All models are trained from scratch using AdamW [loshchilov2018decoupled]

optimizer for 450 epochs with cosine learning rate decay. A batch size of 1024 is used. Data augmentation includes Mixup

[zhang2018mixup], auto-augmentation [Cubuk_2019_CVPR], and random erasing [zhong2020random]. Different combinations of initial learning rate, weight decay and dropout are used for models with different complexities, which are listed in the appendix (see Table 11).

Main Results: Table 2 shows the comparison between Mobile-Former and classic efficient CNNs: (a) MobileNetV3 [Howard_2019_ICCV_mbnetv3], (b) EfficientNet [tan-ICML19-efficientnet], and (c) ShuffleNetV2 [ma_2018_ECCV] and its extension WeightNet [Ma_2020_eccv_WeightNetRT]. The comparison covers the FLOP range from 26M to 508M, organized in seven groups based on similar FLOPs. Mobile-Former consistently outperforms efficient CNNs with even less computational cost except the group around 150M FLOPs, where Mobile-Former costs slightly more FLOPs than ShuffleNet/WeightNet (151M vs. 138M/141M), but achieves significantly higher top-1 accuracy (75.2% vs. 69.1%/72.4%). This demonstrates that our parallel design improves the representation capability efficiently.

Model Input #Param MAdds Top-1

MobileNetV3 Small 1.0 [Howard_2019_ICCV_mbnetv3]
160 2.5M 30M 62.8

Mobile-Former-26M
224 3.2M 26M 64.0
MobileNetV3 Small 1.0 [Howard_2019_ICCV_mbnetv3] 224 2.5M 57M 67.5

Mobile-Former-52M
224 3.5M 52M 68.7
MobileNetV3 1.0 [Howard_2019_ICCV_mbnetv3] 160 5.4M 112M 71.7

Mobile-Former-96M
224 4.6M 96M 72.8
ShuffleNetV2 1.0 [ma_2018_ECCV] 224 2.2M 138M 69.1
ShuffleNetV2 1.0+WeightNet 4 [Ma_2020_eccv_WeightNetRT] 224 5.1M 141M 72.4
MobileNetV3 0.75 [Howard_2019_ICCV_mbnetv3] 224 4.0M 155M 73.3
Mobile-Former-151M 224 7.6M 151M 75.2
MobileNetV3 1.0 [Howard_2019_ICCV_mbnetv3] 224 5.4M 217M 75.2
Mobile-Former-214M 224 9.4M 214M 76.7


ShuffleNetV2 1.5 [ma_2018_ECCV]
224 3.5M 299M 72.6
ShuffleNetV2 1.5+WeightNet 4 [Ma_2020_eccv_WeightNetRT] 224 9.6M 307M 75.0
MobileNetV3 1.25 [Howard_2019_ICCV_mbnetv3] 224 7.5M 356M 76.6

EfficientNet-B0 [tan-ICML19-efficientnet]
224 5.3M 390M 77.1
Mobile-Former-294M 224 11.4M 294M 77.9
ShuffleNetV2 2 [ma_2018_ECCV] 224 5.5M 557M 74.5
ShuffleNetV2 2+WeightNet 4 [Ma_2020_eccv_WeightNetRT] 224 18.1M 573M 76.5

Mobile-Former-508M
224 14.0M 508M 79.3
Table 2: Comparing Mobile-Former with efficient CNNs evaluated on ImageNet [deng2009imagenet] classification.
Model Input #Param MAdds Top-1

T2T-ViT-7 [yuan2021tokens]
224 4.3M 1.2G 71.7
DeiT-Tiny [touvron2020deit] 224 5.7M 1.2G 72.2
ConViT-Tiny [d2021convit] 224 6.0M 1.0G 73.1
ConT-Ti [yan2104contnet] 224 5.8M 0.8G 74.9
ViT [Xiao-2021-early-cnns-help-transformers] 224 4.6M 1.1G 75.3
ConT-S [yan2104contnet] 224 10.1M 1.5G 76.5
Swin-1G [liu2021Swin] 224 7.3M 1.0G 77.3

Mobile-Former-294M
224 11.4M 294M 77.9


PVT-Tiny [wang2021pvtv1]
224 13.2M 1.9G 75.1
T2T-ViT-12 [yuan2021tokens] 224 6.9M 2.2G 76.5

CoaT-Lite Tiny [xu2021coscale]
224 5.7M 1.6G 76.6
ConViT-Tiny+ [d2021convit] 224 10.0M 2G 76.7
DeiT-2G [touvron2020deit] 224 9.5M 2.0G 77.6

CoaT-Lite Mini [xu2021coscale]
224 11.0M 2.0G 78.9


BoT-S1-50 [Srinivas_2021_CVPR_bot]
224 20.8M 4.3G 79.1
Swin-2G [liu2021Swin] 224 12.8M 2.0G 79.2
Mobile-Former-508M 224 14.0M 508M 79.3
Table 3: Comparing Mobile-Former with Vision Transformer variants evaluated on ImageNet [deng2009imagenet] classification. Here, we choose ViT variants that use image resolution 224224 and are trained without distillation from a teacher network. We split ViT models based on FLOPs (using 1.5G as threshold) and rank them based on top-1 accuracy. indicates our implementation.

In Table 3, we compare Mobile-Former with multiple variants (DeiT [touvron2020deit], T2T-ViT [yuan2021tokens], PVT [wang2021pvtv1], ConViT [d2021convit], CoaT [xu2021coscale], ViT [Xiao-2021-early-cnns-help-transformers], Swin [liu2021Swin]) of vision transformer. All variants use image resolution 224224 and are trained without distillation from a teacher network. Mobile-Former achieves higher accuracy but uses 34 times less computational cost. This is because that Mobile-Former uses significantly fewer tokens to model global interaction and leverages MobileNet to extract local features efficiently. Note that our Mobile-Former (trained in 450 epochs without distillation) even outperforms LeViT [graham2021levit] which leverages the distillation of a teacher network and much longer training (1000 epochs). Our method achieves higher top-1 accuracy (77.9% vs. 76.6%) but uses less computation (294M vs. 305M FLOPs) than LeViT.

Figure 2 compares Mobile-Former with more efficient CNNs (e.g. GhostNet [Han_2020_CVPR_ghostnet]) and vision transformer variants with lower FLOPs (e.g. Swin [liu2021Swin] and DeiT [touvron2020deit] from 100M to 2G FLOPs). Note that we implement Swin and DeiT for the low computational budget from 100M to 2G FLOPs, by carefully reducing the network width and height. Mobile-Former clearly outperforms both CNNs and ViT variants, demonstrating the advantage of the parallel design to integrate MobileNet and transformer. Although vision transformers are inferior to efficient CNNs by a large margin, our work showcases that the transformer can also contribute to the low FLOP regime with proper architecture design.

4.2 Object Detection

Model AP AP AP AP AP AP MAdds (G) #Params (M)
backbone all backbone all
ShuffleNetV2 [ma_2018_ECCV] 25.9 41.9 26.9 12.4 28.0 36.4 2.6 161 0.8 10.4
Mobile-Former-151M 34.2 53.4 36.0 19.9 36.8 45.3 2.4 161 4.9 14.4
MobileNetV3 [Howard_2019_ICCV_mbnetv3] 27.2 43.9 28.3 13.5 30.2 37.2 4.7 162 2.8 12.3
Mobile-Former-214M 35.8 55.4 38.0 21.8 38.5 46.8 3.6 162 5.7 15.2
ResNet18 [he2016deep] 31.8 49.6 33.6 16.3 34.3 43.2 29 181 11.2 21.3
Mobile-Former-294M 36.6 56.6 38.6 21.9 39.5 47.9 5.2 164 6.5 16.1
ResNet50 [he2016deep] 36.5 55.4 39.1 20.4 40.3 48.1 84 239 23.3 37.7
PVT-Tiny [wang2021pvtv1] 36.7 56.9 38.9 22.6 38.8 50.0 70 221 12.3 23.0
ConT-M [yan2104contnet] 37.9 58.1 40.2 23.0 40.6 50.4 65 217 16.8 27.0


Mobile-Former-508M
38.0 58.3 40.3 22.9 41.2 49.7 9.4 168 8.4 17.9
Table 4: COCO object detection results. All models are trained on train2017 and tested on val2017. All models are trained for 12 epochs (1) from ImageNet pretrained weights.

Object detection experiments are conducted on COCO 2017 [lin2014microsoft], which contains 118K training and 5K validation images. We use RetinaNet [Lin_2017_ICCV_retinanet_focal] (one-stage) as the detection framework and follow the standard settings to use our Mobile-Former as backbone to generate feature map at multiple scales. All models are trained for 12 epochs (1×) from ImageNet pretrained weights.

In Table 4, we compare Mobile-Former with both CNNs (ResNet [he2016deep], MobileNetV3 [Howard_2019_ICCV_mbnetv3], ShuffleNetV2 [ma_2018_ECCV]) and vision transformers (PVT [wang2021pvtv1] and ConT [yan2104contnet]). Mobile-Former significantly outperforms MobileNetV3 and ShuffleNet by 8.3+ AP under similar computational cost. Compared to ResNet and vision transform variants, our Mobile-Former achieves higher AP with significantly less FLOPs in the backbone. Specifically, our Mobile-Former-508M only takes 9.4G FLOPs in backbone but achieves 38.0 AP, outperforming ResNet-50, PVT-Tiny, and ConT-M which consume 7 times more computation (65G to 84G FLOPs) in the backbone. This demonstrates that Mobile-Former is also effective and efficient in the object detection task.

5 Ablations and Discussion

In this section, we show Mobile-Former is effective and efficient via several ablations performed on ImageNet classification. Here, Mobile-Former-294M is used and all models are trained for 300 epochs. Moreover, we visualize the two-way cross attention to understand the communication between Mobile and Former. Finally, the limitations of Mobile-Former are discussed.

5.1 Mobile-Former is Effective

Model #Param MAdds Top-1 Top-5

Mobile (using ReLU)
6.1M 259M 74.2 91.8
+ Former and Bridge 10.1M 290M 76.8 93.2
+ DY-ReLU in Mobile 11.4M 294M 77.8 93.7
Table 5: Ablation of Former, Bridge and DY-ReLU evaluated on ImageNet classification. Mobile-Former-294M is used.

Mobile-Former is more effective than MobileNet as it encodes global interaction via Former, resulting in more accurate prediction. As shown in Table 5, adding Former and bridge (MobileFormer and MobileFormer) only costs 10.6% of the computational cost, but gains 2.6% top-1 accuracy over the baseline that uses Mobile alone. In addition, using the first global token to generate parameters for dynamic ReLU [Chen2020DynamicReLU] in Mobile sub-block (see Figure 3) achieves additional 1.0% top-1 accuracy. This validates our parallel design in Mobile-Former.

Kernel Size in Mobile #Param MAdds Top-1 Top-5

33
11.4M 294M 77.8 93.7
55 11.5M 332M 77.9 93.9

Table 6: Ablation of kernel size of depthwise convolution in Mobile evaluated on ImageNet classification. Mobile-Former-294M is used.

We also perform another ablation on the kernel size of the depthwise convolution in Mobile, to validate the contribution of Former and bridge on global interaction. Table 6 shows that the gain of increasing kernel size (from 33 to 55) is negligible. We believe this is because Former and the bridge enlarge the reception field for Mobile via fusing global features. Therefore, using larger kernel size is not necessary in Mobile-Former.

5.2 Mobile-Former is Efficient

Mobile-Former is not only effective in encoding both local processing and global interaction, but achieves this efficiently. The key finding is that Former only requires a small number of global tokens. Here, we first show Mobile-Former is efficient in terms of both the number of tokens and dimension. Then, we show the efficient parallel design of Mobile-Former is stable when removing FFN in Former and replacing multi-head attention (MHA) with MLP.

Number of tokens in Former: Table 7 shows the ImageNet classification results for using different number of global tokens in Former. The token dimension is 192. Interestingly, even a single global token achieves a good performance (77.1% top-1 accuracy). Additional improvement (0.5% and 0.7% top-1 accuracy) is achieved when using 3 and 6 tokens. But the improvement stops when more than 6 tokens are used. This ablation shows the compactness of global tokens which is important to the efficiency of Mobile-Former.

#tokens #Param MAdds Top-1 Top-5

1
11.4M 269M 77.1 93.2
3 11.4M 279M 77.6 93.6
6 11.4M 294M 77.8 93.7
9 11.4M 309M 77.7 93.8
Table 7: Ablation of number of tokens on ImageNet classification. Mobile-Former-294M is used.

Token Dimension: Table 8 shows the results for different token lengths (or dimension). Here, six global tokens are used in Former. The performance keeps improving from 76.8% to 77.8% when token dimension increases from 64 to 192, but converges when higher dimension is used. This further supports the efficiency of Former. When using six tokens with dimension 192, the total computational cost of Former and the bridge only consumes 12% of the overall budget (35M/294M).

Token Dimension #Param MAdds Top-1 Top-5

64
7.3M 277M 76.8 93.1
128 9.1M 284M 77.3 93.5
192 11.4M 294M 77.8 93.7
256 14.3M 308M 77.8 93.7
320 17.9M 325M 77.6 93.6
Table 8: Ablation of token dimension on ImageNet classification. Mobile-Former-294M is used.

FFN in Former: As shown in Table 9, removing FFN introduces a small drop in top-1 accuracy (-0.3%). Compared to the important role of FFN in the original vision transformer, FFN has limited contribution in Mobile-Former. We believe this is because FFN is not the only module for channel fusion in Mobile-Former. The 11 convolution in Mobile helps the channel fusion of local features, while the projection matrix in MobileFormer (see Equation 1) contributes to the fusion between local and global features.

Attention FFN #Param MAdds Top-1 Top-5

MHA
11.4M 294M 77.8 93.7
MHA 9.8M 284M 77.5 93.6
MLP 10.5M 284M 77.3 93.5

Table 9: Ablation of multi-head attention (MHA) and FFN evaluated on ImageNet classification. Mobile-Former-294M is used.

Multi-head Attention (MHA) vs. MLP: Table 9 shows the result of replacing multi-head attention (MHA) with MLP in both Former and bridge (MobileFormer and MobileFormer). The top-1 accuracy drops from 77.8% to 77.3%. The implementation of MLP is more efficient by a single matrix multiplication, but it is static (i.e. not adaptive to different input images).

5.3 Mobile-Former is Explainable

Figure 4: Visualization of cross attention on the two-way bridge: MobileFormer and MobileFormer. Mobile-Former-294M is used, which includes 6 tokens (each corresponds to a column). Four blocks with different input resolutions are selected and each has two attention heads that are visualized in two rows. Attention in MobileFormer (left half) is normalized over pixels, showing the focused region per token. Attention in MobileFormer (right half) is normalized over tokens showing the contribution per token at each pixel. The cross attention has more diversity across tokens at lower levels than higher levels. At the last block, token 2–5 have very similar cross attention.
Figure 5: Cross attention over featuremap for the first token in MobileFormer across all Mobile-Former blocks. Attention is normalized over pixels, showing focused regions. The focused regions change from low to high levels. The token starts paying more attention to edges/corners at block 2–4. Then it focused more on bigger region rather than scattered small pieces at block 5–12. The focused region shifts between the foreground (person and horse) and background (grass). Finally, it locks the most discriminative part (horse body and head) for classification.
Figure 6: Cross attention in MobileFormer separates foreground and background at middle layers. Attention is normalized over tokens showing the contribution of different tokens at each pixel. Block 8 is chosen where the background pixels pay more attention to the first token and the foreground pixels pay more attention to the last token.

To understand the collaboration between Mobile and Former, we visualize the cross attention on the two-way bridge (MobileFormer and MobileFormer) in Figure 4, Figure 5 and Figure 6. The ImageNet pretrained Mobile-Former-294M is used, which includes 6 global tokens and 11 Mobile-Former blocks. We observe three interesting patterns.

First, the attention has more diversity across tokens at lower levels than higher levels. As shown in Figure 4, each column corresponds to a token, and each row corresponds to a head in the corresponding multi-head cross attention. Note that the attention is normalized over pixels in MobileFormer (left half), showing the focused region per token. In contrast, the attention in MobileFormer is normalized over tokens, comparing the contribution of different tokens at each pixel. Clearly, the six tokens at block 3 and 5 have different cross attention patterns in both MobileFormer and MobileFormer

. Similar attention patterns among tokens is clearly observed at block 8. At block 12, the last five tokens share very similar attention pattern. Note that the first token is class token feed into the classifier head. The similar observation has been identified in recent studies on ViT

[zhou2021refiner, zhou2021deepvit, touvron2021going].

Second, the focused regions of global tokens change progressively from low to high levels. Figure 5 shows the cross attention over pixels for the first token in MobileFormer. This token begins focusing on local features, e.g. edges/corners (at block 2-4). Then it pays more attention to regions with connected pixels. Interestingly, the focused region shifts between foreground (person and horse) and background (grass) across blocks. Finally, it locates the most discriminative region (horse body and head) for classification.

Thirdly, the separation between foreground and background is surprisingly found at middle layers (e.g. block 8) of MobileFormer. Figure 6 shows the cross attention over 6 tokens for each pixel at featuremap. Clearly, the foreground and background are separated by the first and last tokens. This shows that some global tokens learn meaningful prototypes that cluster similar pixels.

5.4 Limitations

The major limitation of Mobile-Former is the model size. This is due to two reasons. Firstly, the parallel design is not efficient in terms of parameter sharing as Mobile, Former and bridge have their own parameters. Although Former is efficient in computation due to the small amount of tokens, but it does not save the numer of parameters. Second, Mobile-Former consumes many parameters in the classification head (two fully connected layers) when performing ImageNet classification task. For instance, Mobile-Former-294M spends 40% (4.6M of 11.4M) parameters in the classification head. The model size problem mitigates when switching from image classification to object detection task, as the classification head is removed. We will explore the parameter efficiency in the future work.

6 Conclusion

This paper presents Mobile-Former, a new parallel design of MobileNet and Transformer with two-way bridge in between to communicate. It leverages the efficiency of MobileNet in local processing and the advantage of Transformer in encoding global interaction. This design is not only effective to boost accuracy, but also efficient to save computational cost. It outperforms both efficient CNNs and vision transformer variants on image classification and object detection in the low FLOP regime with a clear margin. We hope Mobile-Former encourage new design of efficient CNNs and transformers.

References

Appendix A Mobile-Former Architectures

Stage Mobile-Former-508M Mobile-Former-294M Mobile-Former-214M Mobile-Former-151M Mobile-Former-96M Mobile-Former-52M
Block #exp #out Block #exp #out Block #exp #out Block #exp #out Block #exp #out Block #exp #out

token
6192 6192 6192 6192 4128 3128
stem conv 33 24 conv 33 16 conv 33 12 conv 33 12 conv 33 12 conv 33 8

1
bneck-lite 48 24 bneck-lite 32 16 bneck-lite 24 12 bneck-lite 24 12 bneck-lite 24 12
2 M-F 144 40 M-F 96 24 M-F 72 20 M-F 72 16 M-F 72 16 bneck-lite 24 12
M-F 120 40 M-F 96 24 M-F 60 20 M-F 48 16 M-F 36 12
3 M-F 240 72 M-F 144 48 M-F 120 40 M-F 96 32 M-F 96 32 M-F 72 24
M-F 216 72 M-F 192 48 M-F 160 40 M-F 96 32 M-F 96 32 M-F 72 24
4 M-F 432 128 M-F 288 96 M-F 240 80 M-F 192 64 M-F 192 64 M-F 144 48

M-F 512 128 M-F 384 96 M-F 320 80 M-F 256 64 M-F 256 64 M-F 192 48

M-F 768 176 M-F 576 128 M-F 480 112 M-F 384 88 M-F 384 88 M-F 288 64

M-F 1056 176 M-F 768 128 M-F 672 112 M-F 528 88


5
M-F 1056 240 M-F 768 192 M-F 672 160 M-F 528 128 M-F 528 128 M-F 384 96

M-F 1440 240 M-F 1152 192 M-F 960 160 M-F 768 128 M-F 768 128 M-F 576 96

M-F 1440 240 M-F 1152 192 M-F 960 160 M-F 768 128 conv 11 768 conv 11 576

conv 11 1440 conv 11 1152 conv 11 960 conv 11 768

pool
1632 1344 1152 960 896 704
concat
FC1 1920 1920 1600 1280 1280 1024
FC2 1000 1000 1000 1000 1000 1000

Table 10: Specification of Mobile-Former Models. “bneck-lite” denotes light bottleneck block [li2021micronet]. “M-F” denotes the Mobile-Former block and “M-F” denotes the Mobile-Former block with downsampling. Mobile-Former-26M has similar architecture to Mobile-Former-52M except replacing all 11 convolution with group convolution (group=4).

Table 10 shows six Mobile-Former models (508M–52M). These models are manually designed without searching for the optimal architecture parameters (e.g. width or depth). We follow the well known rules used in MobileNet: (a) the number of channels increases across stages, and (b) channel expansion rate starts with three at low levels and increases to six at high levels. For the four bigger models (508M–151M), we use six global tokens with dimension 192 and 11 Mobile-Former blocks. But these four models have different widths. Mobile-Former-96M and Mobile-Former-52M are shallower (with only 8 Mobile-Former blocks) to meet the low computational budget. Mobile-Former-26M has similar architecture to Mobile-Former-52M except replacing all 11 convolution with group convolution (group=4).

Appendix B Training Hyper-Parameters

Model Learing Rate Weight Decay Dropout

Mobile-Former-26M
8e-4 0.08 0.1

Mobile-Former-52M
8e-4 0.10 0.2

Mobile-Former-96M
8e-4 0.10 0.2

Mobile-Former-151M
9e-4 0.10 0.2

Mobile-Former-214M
9e-4 0.15 0.2

Mobile-Former-294M
1e-3 0.20 0.3

Mobile-Former-508M
1e-3 0.20 0.3
Table 11: Hyper-parameters of Mobile-Former models for ImageNet classification.

Table 11 shows three hyper-parameters (learning rate, weight decay and dropout rate) on ImageNet classification for all Mobile-Former models. Their values increase as the model becomes bigger to prevent overfitting.

Appendix C Visualization

We visualize the cross attention on the two-way bridge (MobileFormer and MobileFormer) for all blocks in Figure 7. We use ImageNet pretrained Mobile-Former-294M, which includes 6 global tokens and 11 Mobile-Former blocks. In Figure 7, each column corresponds to a token, and each row corresponds to a head in the corresponding multi-head cross attention. Note that the attention is normalized over pixels in MobileFormer (left half), showing the focused region per token. In contrast, the attention in MobileFormer is normalized over tokens. For instance, at the second head of block 5 in MobileFormer, pixels on the person and horse attend more on the second token while pixels on the background attend more on the last token.

Figure 7: Visualization of cross attention on the two-way bridge: MobileFormer and MobileFormer. Mobile-Former-294M is used, which includes 6 tokens (each corresponds to a column) and 11 Mobile-Former blocks (block 2–12) across 4 stages. Each block has two attention heads that are visualized in two rows. Attention in MobileFormer (left half) is normalized over pixels, showing the focused region per token. Attention in MobileFormer (right half) is normalized over tokens showing the contribution per token at each pixel.