Pytorch-implementation-of-Mobile-Former
Simple implementation of Mobile-Former on Pytorch
view repo
We present Mobile-Former, a parallel design of MobileNet and Transformer with a two-way bridge in between. This structure leverages the advantage of MobileNet at local processing and transformer at global interaction. And the bridge enables bidirectional fusion of local and global features. Different with recent works on vision transformer, the transformer in Mobile-Former contains very few tokens (e.g. less than 6 tokens) that are randomly initialized, resulting in low computational cost. Combining with the proposed light-weight cross attention to model the bridge, Mobile-Former is not only computationally efficient, but also has more representation power, outperforming MobileNetV3 at low FLOP regime from 25M to 500M FLOPs on ImageNet classification. For instance, it achieves 77.9% top-1 accuracy at 294M FLOPs, gaining 1.3% over MobileNetV3 but saving 17% of computations. When transferring to object detection, Mobile-Former outperforms MobileNetV3 by 8.6 AP.
READ FULL TEXT VIEW PDFSimple implementation of Mobile-Former on Pytorch
MobileFormer in torch
Recently, Vision Transformer (ViT) [dosovitskiy2021vit, touvron2020deit] demonstrates the advantage of global processing and achieves significant performance boost over CNNs. However, when constraining the computational budget within 1G FLOPs, the gain of ViT diminishes. If we further challenge the computational cost, MobileNet [howard2017mobilenets, sandler2018mobilenetv2, Howard_2019_ICCV_mbnetv3] and its extensions [Han_2020_CVPR_ghostnet, li2021micronet] still dominate their backyard (e.g. less than 300M FLOPs for ImageNet classification) due to their efficiency in local processing filters via decompostion of depthwise and pointwise convolution. This in turn naturally raises a question:
How to design efficient networks to effectively encode both local processing and global interaction?
A straightforward idea is to combine convolution and vision transformer. Recent works [wu2021cvt, graham2021levit, Xiao-2021-early-cnns-help-transformers] show the benefit of combining convolution and vision transformer in series, either using convolution at the beginning or intertwining convolution into each transformer block.
In this paper, we shift the design paradigm from series to parallel, and propose a new network that parallelizes MobileNet and transformer with a two-way bridge in between (see Figure 1). We name it Mobile-Former, where Mobile refers to MobileNet and Former stands for transformer. Mobile takes the image as input and stacks mobile (or inverted bottleneck) blocks [sandler2018mobilenetv2]. It leverages the efficient depthwise and pointwise convolution to extract local features at pixel level. Former takes a few learnable tokens as input and stacks multi-head attention and feed-forward networks (FFN). These tokens are used to encode global features of the image.
Mobile and Former communicate over a two-way bridge to fuse local and global features. This is crucial since it feeds local features to Former’s tokens as well as introduces global views to every pixel of featuremap in Mobile. We propose a light-weight cross attention to model this bidirectional bridge by (a) performing the cross attention at the bottleneck of Mobile where the number of channel is low, and (b) removing projections on query, key and value (, , ) from Mobile side.
This parallel structure with a bidirectional bridge leverages the advantage of both MobileNet and transformer. The decoupling of local and global features in parallel leverage MobileNet’s efficiency in extracting local features as well as transformer’s power in modeling global interaction. More important, this is achieved in an efficient way via a thin transformer (with very few tokens) and a light-weight bridge to exchange local and global features between Mobile and Former. The bridge and Former consumes less than 20% of the total computational cost, but significantly improve the representation capability. This showcases an efficient and effective implementation of part-whole hierarchy proposed in [Hinton2021-part-whole].
Mobile-Former achieves solid performance on both image classification and object detection. For example, it achieves 77.9% top-1 ImageNet accuracy at 294M FLOPs, outperforming MobileNetV3 [Howard_2019_ICCV_mbnetv3] and LeViT [graham2021levit] by a clear margin (see Figure 1). More important, Mobile-Former consistently outperforming both efficient CNNs and vision transformers from 25M to 500M FLOPs (shown in Figure 2), showcasing the usage of transformer at the low FLOP regime where efficient CNNs dominate. Furthermore, when transferring from Image classification to object detection, Mobile-Former significantly outperforms MobileNetV3, e.g. gaining 8.6 AP (35.8 vs. 27.2) with even less computational cost.
Finally we note that exploring the optimal network parameters (e.g. width, height) in Mobile-Former is not a goal of this work, rather we demonstrate that the parallel design of Mobile-Former provides an efficient and effective network architecture.
Light-weight Convolutional Neural Networks (CNNs):
, cheap linear transformations in GhostNet
[Han_2020_CVPR_ghostnet], and using cheap additions to trade massive multiplications in AdderNet [Chen_2020_CVPR_addernet]. In addition, different architectures and compound scaling method have been studied. MixConv [Tan-bmvc2019-mixconv] explores mixing up multiple kernel sizes, and Sandglass [Daquan_2020_ECCV_RethinkingBS] flips the structure of inverted residual block. EfficientNet [tan-ICML19-efficientnet, Tan_2020_CVPR] and TinyNet [NEURIPS2020_e069ea4c] study the compound scaling of depth, width and resolution.Vision Transformers (ViT): Recently, ViT [dosovitskiy2021vit] and its follow-ups [touvron2020deit, yuan2021tokens, liu2021Swin, dong2021cswin, Vaswani_2021_CVPR_halo] achieve impressive performance on multiple vision tasks. The original ViT requires training on large dataset such as JFT-300M to perform well. Later, DeiT [touvron2020deit] demonstrates that good performance can be achieved on the smaller ImageNet-1K dataset by introducing several important training strategies. To enable ViT for high resolution images, several hierarchical transformers are proposed. For example, Swin [liu2021Swin] presents shifted window approach for computing self-attention within a local window and CSWin [dong2021cswin] further improves it by introducing cross-shaped window self-attention. T2T-ViT [yuan2021tokens] progressively converts the image to tokens by recursively aggregating neighboring tokens, such that the local structure can be well modeled. HaloNet [Vaswani_2021_CVPR_halo] develops two attention extensions (blocked local attention and attention downsampling) that result in improvement of speed, memory usage and accuracy.
Combination of CNNs and ViT: Recent works [Srinivas_2021_CVPR_bot, wu2021cvt, d2021convit, Xiao-2021-early-cnns-help-transformers, graham2021levit] show that combining convolution and transformer achieves improvement in prediction accuracy as well as training stability. BoTNet [Srinivas_2021_CVPR_bot] shows significant improvement in instance segmentation and object detection by just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet [he2016deep]. ConViT [d2021convit] improves ViT with soft convolutional inductive biases by introducing a gated positional self-attention (GPSA). CvT [wu2021cvt] introduces depthwise/pointwise convolution before each multi-head attention. LeViT [graham2021levit] and ViT [Xiao-2021-early-cnns-help-transformers] use convolutional stem (stacking 33 convolutions) to replace the patchify stem [Xiao-2021-early-cnns-help-transformers]. LeViT and ViT show clear improvement at the low FLOP regime. In this paper, we propose a different design that parallelizes MobileNet and transformer with bidirectional cross attention in between. Our approach is both efficient and effective, outperforming both efficient CNNs and ViT variants at low FLOP regime.
In this section, we describe the design process of Mobile-Former and its building blocks. The architecture is summarized in Figure 1 and Figure 3.
Mobile-Former parallelizes MobileNet and transformer, and connects them by bidirectional cross attention (see Figure 1). In Mobile-Former, Mobile (refers to MobileNet) takes an image as input () and applies inverted bottleneck blocks [sandler2018mobilenetv2] to extract local features. Former (refers to transformer) takes learnable parameters (or tokens) as input, denoted as where and are the dimension and number of tokens, respectively. These tokens are randomly initialized and each represents a global prior of the image. This is different with Vision Transformer (ViT) [dosovitskiy2021vit], where tokens project the local image patch linearly. This difference is important as it significantly reduces the number of tokens ( in this paper), resulting in an efficient Former.
Mobile and Former are connected by a two-way bridge where local and global features are fused bidirectionally. Here, we denote two directions of the bridge as MobileFormer and MobileFormer, respectively. We propose a light-weight cross attention to model this bidirectional bridge, which will be discussed next.
by replacing ReLU with dynamic ReLU
[Chen2020DynamicReLU]. MobileFormer uses light-weight cross attention to fuse local features into global features. Former sub-block is a standard transformer block including multi-head attention and FFN. Note that the output of Former is used to generate parameters for dynamic ReLU in Mobile sub-block. MobileFormer bridges from global to local features.We leverage the advantage of cross attention to fuse the local features (from Mobile) and global tokens (from Former). Here, two changes are introduced to the standard cross attention for the sake of low computational cost: (a) computing the cross attention at the bottleneck of Mobile where the number of channels is low, and (b) removing the projections (, , ) from Mobile side where the number of positions is large, but keeping them at Former side.
Let us denote the local feature map as , and the global tokens as . They are split as and () for multi-head attention with heads. The light-weight cross attention from local to global is defined as follows:
(1) |
where is the query projection matrix in the head, is used to combine multiple heads together, is the standard attention function (in [NIPS2017_transformer]) over query , key , and value as follows:
(2) |
Note that the global feature is the query and the local feature is the key and value. and are applied on global tokens . The diagram of this cross attention is shown in Figure 3 (MobileFormer).
In a similar manner, the cross attention from global to local is computed as:
(3) |
where and are the projection matrices for key and value. Here, the local feature is the query and the global feature is the key and value. The diagram of this cross attention is shown in Figure 3 (MobileFormer).
Mobile-Former can be decoupled as a stack of Mobile-Former blocks (see Figure 1). Each block includes a Mobile sub-block, a Former sub-block, and a bidirectional bridge (MobileFormer and MobileFormer). The details of Mobile-Former block are shown in Figure 3.
Input and Output: Mobile-Former block has two inputs: (a) local feature map , which has channels and spatial positions ( where and are the height and width of the feature map), and (b) global tokens , where and are the number and dimension of tokens, respectively. Mobile-Former block outputs the updated local feature map and global tokens , which are used as input for the next () block. Note the number and dimension of global tokens are identical across all blocks.
Mobile sub-block: Mobile sub-block takes feature map as input. It is slightly different to the inverted bottleneck block in [sandler2018mobilenetv2] by replacing ReLU with dynamic ReLU [Chen2020DynamicReLU]
as the activation function after the first pointwise convolution and the 3
3 depthwise convolution. Different with the original dynamic ReLU, in which the parameters are generated by applying two MLP layers on the average pooled feature, we save the average pooling by applying the two MLP layers ( in Figure 3) on the first global token output from Former. Note that the kernel size of depthwise convolution is 33 for all blocks. The output of Mobile sub-block, denoted as , is taken as the input for MobileFormer (see Figure 3). Its computational complexity is , where is the number of spatial positions, is the channel expansion ratio, and is the number of channels before the expansion.Former sub-block: Former sub-block is a standard transform block with a multi-head attention (MHA) and a feed-forward network (FFN). Here, we follow [NIPS2017_transformer] to use post layer normalization. To save computations, we use expansion ratio 2 instead of 4 in FFN. Note that Former sub-block is processed between the two-way cross attention, i.e. after MobileFormer and before MobileFormer (see Figure 3). Its complexity is . The first item relates to computing dot product between query and key, and aggregating values based on attention values, while the second item covers linear projections and FFN. Since Former only has a few tokens (), the first item is ignorable.
MobileFormer: The proposed light-weight cross attention (Equation 1) is used to fuse local features to global tokens . Compared to the standard attention, the projection matrices for key and value (on the local features) are removed to save computations (shown in Figure 3). Its computational complexity is , where the first item relates to computing cross attention between local and global features and aggregating local features for each global token, and the second item is the complexity to project global features to the same dimension of local features and back to dimension after aggregation.
MobileFormer: Here, the cross attention (Equation 3) is on the opposite direction to MobileFormer. It fuses global tokens to local features . The local features are the query and global tokens are key and value. Therefore, we keep the projection matrices for key and value , but remove the projection matrix for query to save computations, as shown in Figure 3. The computational complexity is .
Computational Complexity: The four pillars of Mobile-Former block have different computational costs. Mobile sub-block consumes the most computations (), as it grows linearly with the number of spatial positions and grows quadratically with the number of channels in local features . Former sub-block and the two-way bridge are computational efficient, consuming less than 20% of total computation for all Mobile-Former models.
Architecture: Table 1 shows a Mobile-Former architecture at 294M FLOPs, which stacks 11 Mobile-Former blocks at different input resolutions. All Mobile-Former blocks have 6 global tokens with dimension 192. It starts with a 33 convolution as stem which is followed by a lite bottleneck block at stage 1. The lite bottleneck block is proposed in [li2021micronet], which uses a 33 depthwise convolution to expand the number of channels and uses a pointwise convolution to squeeze the number of channels. The classification head applies average pooling on the local features, concatenates with the first global token, and then passes through two fully connected layers with h-swish [Howard_2019_ICCV_mbnetv3] in between.
Downsample Mobile-Former Block: Note that stage 2–5 has a downsample variant of Mobile-Former block (denoted as Mobile-Former) to handle the spatial downsampling. In Mobile-Former, only the convolution layers in Mobile sub-block are changed from three layers (pointwisedepthwisepointwise) to four layers (depthwisepointwisedepthwise
pointwise), where the first depthwise convolution layer has stride 2. The number of channels expands in each depthwise convolution, and squeezes in the following pointwise convolution. This saves computations as the two costly pointwise convolutions are performed at the lower resolution after downsampling.
Mobile-Former Variants: Mobile-Former has 7 models of different computational costs from 26M to 508M FLOPs. They share the similar architecture, but have different widths and heights. We follow [Xiao-2021-early-cnns-help-transformers] to refer our models by their flops, e.g. Mobile-Former-294M, Mobile-Former-96M. The details of network architecture for these Mobile-Former models are listed in the appendix (see Table 10).
Stage | Input | Operator | exp size | #out | Stride |
tokens |
6192 | – | – | – | – |
stem | 2243 | conv2d, 33 | – | 16 | 2 |
1 | 11216 | bneck-lite | 32 | 16 | 1 |
2 | 11216 | Mobile-Former | 96 | 24 | 2 |
5624 | Mobile-Former | 96 | 24 | 1 | |
3 | 5624 | Mobile-Former | 144 | 48 | 2 |
2848 | Mobile-Former | 192 | 48 | 1 | |
4 |
2848 | Mobile-Former | 288 | 96 | 2 |
1496 | Mobile-Former | 384 | 96 | 1 | |
1496 | Mobile-Former | 576 | 128 | 1 | |
14128 | Mobile-Former | 768 | 128 | 1 | |
5 | 14128 | Mobile-Former | 768 | 192 | 2 |
7192 | Mobile-Former | 1152 | 192 | 1 | |
7192 | Mobile-Former | 1152 | 192 | 1 | |
7192 | conv2d, 11 | – | 1152 | 1 | |
head | 71152 | pool, 77 | – | – | 1 |
11152 | concat w/ cls token | – | 1344 | 1 | |
11344 | FC | – | 1920 | 1 | |
11920 | FC | – | 1000 | 1 |
We conduct experiments on ImageNet classification [deng2009imagenet]
, and COCO object detection
[lin2014microsoft] to evaluate the proposed Mobile-Former.We now evaluate our Mobile-Former models on ImageNet [deng2009imagenet] classification. ImageNet has 1000 classes, including 1,281,167 images for training and 50,000 images for validation.
Training Setup: The image resolution is 224224. All models are trained from scratch using AdamW [loshchilov2018decoupled]
optimizer for 450 epochs with cosine learning rate decay. A batch size of 1024 is used. Data augmentation includes Mixup
[zhang2018mixup], auto-augmentation [Cubuk_2019_CVPR], and random erasing [zhong2020random]. Different combinations of initial learning rate, weight decay and dropout are used for models with different complexities, which are listed in the appendix (see Table 11).Main Results: Table 2 shows the comparison between Mobile-Former and classic efficient CNNs: (a) MobileNetV3 [Howard_2019_ICCV_mbnetv3], (b) EfficientNet [tan-ICML19-efficientnet], and (c) ShuffleNetV2 [ma_2018_ECCV] and its extension WeightNet [Ma_2020_eccv_WeightNetRT]. The comparison covers the FLOP range from 26M to 508M, organized in seven groups based on similar FLOPs. Mobile-Former consistently outperforms efficient CNNs with even less computational cost except the group around 150M FLOPs, where Mobile-Former costs slightly more FLOPs than ShuffleNet/WeightNet (151M vs. 138M/141M), but achieves significantly higher top-1 accuracy (75.2% vs. 69.1%/72.4%). This demonstrates that our parallel design improves the representation capability efficiently.
Model | Input | #Param | MAdds | Top-1 |
MobileNetV3 Small 1.0 [Howard_2019_ICCV_mbnetv3] |
160 | 2.5M | 30M | 62.8 |
Mobile-Former-26M |
224 | 3.2M | 26M | 64.0 |
MobileNetV3 Small 1.0 [Howard_2019_ICCV_mbnetv3] | 224 | 2.5M | 57M | 67.5 |
Mobile-Former-52M |
224 | 3.5M | 52M | 68.7 |
MobileNetV3 1.0 [Howard_2019_ICCV_mbnetv3] | 160 | 5.4M | 112M | 71.7 |
Mobile-Former-96M |
224 | 4.6M | 96M | 72.8 |
ShuffleNetV2 1.0 [ma_2018_ECCV] | 224 | 2.2M | 138M | 69.1 |
ShuffleNetV2 1.0+WeightNet 4 [Ma_2020_eccv_WeightNetRT] | 224 | 5.1M | 141M | 72.4 |
MobileNetV3 0.75 [Howard_2019_ICCV_mbnetv3] | 224 | 4.0M | 155M | 73.3 |
Mobile-Former-151M | 224 | 7.6M | 151M | 75.2 |
MobileNetV3 1.0 [Howard_2019_ICCV_mbnetv3] | 224 | 5.4M | 217M | 75.2 |
Mobile-Former-214M | 224 | 9.4M | 214M | 76.7 |
ShuffleNetV2 1.5 [ma_2018_ECCV] |
224 | 3.5M | 299M | 72.6 |
ShuffleNetV2 1.5+WeightNet 4 [Ma_2020_eccv_WeightNetRT] | 224 | 9.6M | 307M | 75.0 |
MobileNetV3 1.25 [Howard_2019_ICCV_mbnetv3] | 224 | 7.5M | 356M | 76.6 |
EfficientNet-B0 [tan-ICML19-efficientnet] |
224 | 5.3M | 390M | 77.1 |
Mobile-Former-294M | 224 | 11.4M | 294M | 77.9 |
ShuffleNetV2 2 [ma_2018_ECCV] | 224 | 5.5M | 557M | 74.5 |
ShuffleNetV2 2+WeightNet 4 [Ma_2020_eccv_WeightNetRT] | 224 | 18.1M | 573M | 76.5 |
Mobile-Former-508M |
224 | 14.0M | 508M | 79.3 |
Model | Input | #Param | MAdds | Top-1 |
T2T-ViT-7 [yuan2021tokens] |
224 | 4.3M | 1.2G | 71.7 |
DeiT-Tiny [touvron2020deit] | 224 | 5.7M | 1.2G | 72.2 |
ConViT-Tiny [d2021convit] | 224 | 6.0M | 1.0G | 73.1 |
ConT-Ti [yan2104contnet] | 224 | 5.8M | 0.8G | 74.9 |
ViT [Xiao-2021-early-cnns-help-transformers] | 224 | 4.6M | 1.1G | 75.3 |
ConT-S [yan2104contnet] | 224 | 10.1M | 1.5G | 76.5 |
Swin-1G [liu2021Swin] | 224 | 7.3M | 1.0G | 77.3 |
Mobile-Former-294M |
224 | 11.4M | 294M | 77.9 |
PVT-Tiny [wang2021pvtv1] |
224 | 13.2M | 1.9G | 75.1 |
T2T-ViT-12 [yuan2021tokens] | 224 | 6.9M | 2.2G | 76.5 |
CoaT-Lite Tiny [xu2021coscale] |
224 | 5.7M | 1.6G | 76.6 |
ConViT-Tiny+ [d2021convit] | 224 | 10.0M | 2G | 76.7 |
DeiT-2G [touvron2020deit] | 224 | 9.5M | 2.0G | 77.6 |
CoaT-Lite Mini [xu2021coscale] |
224 | 11.0M | 2.0G | 78.9 |
BoT-S1-50 [Srinivas_2021_CVPR_bot] |
224 | 20.8M | 4.3G | 79.1 |
Swin-2G [liu2021Swin] | 224 | 12.8M | 2.0G | 79.2 |
Mobile-Former-508M | 224 | 14.0M | 508M | 79.3 |
In Table 3, we compare Mobile-Former with multiple variants (DeiT [touvron2020deit], T2T-ViT [yuan2021tokens], PVT [wang2021pvtv1], ConViT [d2021convit], CoaT [xu2021coscale], ViT [Xiao-2021-early-cnns-help-transformers], Swin [liu2021Swin]) of vision transformer. All variants use image resolution 224224 and are trained without distillation from a teacher network. Mobile-Former achieves higher accuracy but uses 34 times less computational cost. This is because that Mobile-Former uses significantly fewer tokens to model global interaction and leverages MobileNet to extract local features efficiently. Note that our Mobile-Former (trained in 450 epochs without distillation) even outperforms LeViT [graham2021levit] which leverages the distillation of a teacher network and much longer training (1000 epochs). Our method achieves higher top-1 accuracy (77.9% vs. 76.6%) but uses less computation (294M vs. 305M FLOPs) than LeViT.
Figure 2 compares Mobile-Former with more efficient CNNs (e.g. GhostNet [Han_2020_CVPR_ghostnet]) and vision transformer variants with lower FLOPs (e.g. Swin [liu2021Swin] and DeiT [touvron2020deit] from 100M to 2G FLOPs). Note that we implement Swin and DeiT for the low computational budget from 100M to 2G FLOPs, by carefully reducing the network width and height. Mobile-Former clearly outperforms both CNNs and ViT variants, demonstrating the advantage of the parallel design to integrate MobileNet and transformer. Although vision transformers are inferior to efficient CNNs by a large margin, our work showcases that the transformer can also contribute to the low FLOP regime with proper architecture design.
Model | AP | AP | AP | AP | AP | AP | MAdds (G) | #Params (M) | ||
backbone | all | backbone | all | |||||||
ShuffleNetV2 [ma_2018_ECCV] | 25.9 | 41.9 | 26.9 | 12.4 | 28.0 | 36.4 | 2.6 | 161 | 0.8 | 10.4 |
Mobile-Former-151M | 34.2 | 53.4 | 36.0 | 19.9 | 36.8 | 45.3 | 2.4 | 161 | 4.9 | 14.4 |
MobileNetV3 [Howard_2019_ICCV_mbnetv3] | 27.2 | 43.9 | 28.3 | 13.5 | 30.2 | 37.2 | 4.7 | 162 | 2.8 | 12.3 |
Mobile-Former-214M | 35.8 | 55.4 | 38.0 | 21.8 | 38.5 | 46.8 | 3.6 | 162 | 5.7 | 15.2 |
ResNet18 [he2016deep] | 31.8 | 49.6 | 33.6 | 16.3 | 34.3 | 43.2 | 29 | 181 | 11.2 | 21.3 |
Mobile-Former-294M | 36.6 | 56.6 | 38.6 | 21.9 | 39.5 | 47.9 | 5.2 | 164 | 6.5 | 16.1 |
ResNet50 [he2016deep] | 36.5 | 55.4 | 39.1 | 20.4 | 40.3 | 48.1 | 84 | 239 | 23.3 | 37.7 |
PVT-Tiny [wang2021pvtv1] | 36.7 | 56.9 | 38.9 | 22.6 | 38.8 | 50.0 | 70 | 221 | 12.3 | 23.0 |
ConT-M [yan2104contnet] | 37.9 | 58.1 | 40.2 | 23.0 | 40.6 | 50.4 | 65 | 217 | 16.8 | 27.0 |
Mobile-Former-508M |
38.0 | 58.3 | 40.3 | 22.9 | 41.2 | 49.7 | 9.4 | 168 | 8.4 | 17.9 |
Object detection experiments are conducted on COCO 2017 [lin2014microsoft], which contains 118K training and 5K validation images. We use RetinaNet [Lin_2017_ICCV_retinanet_focal] (one-stage) as the detection framework and follow the standard settings to use our Mobile-Former as backbone to generate feature map at multiple scales. All models are trained for 12 epochs (1×) from ImageNet pretrained weights.
In Table 4, we compare Mobile-Former with both CNNs (ResNet [he2016deep], MobileNetV3 [Howard_2019_ICCV_mbnetv3], ShuffleNetV2 [ma_2018_ECCV]) and vision transformers (PVT [wang2021pvtv1] and ConT [yan2104contnet]). Mobile-Former significantly outperforms MobileNetV3 and ShuffleNet by 8.3+ AP under similar computational cost. Compared to ResNet and vision transform variants, our Mobile-Former achieves higher AP with significantly less FLOPs in the backbone. Specifically, our Mobile-Former-508M only takes 9.4G FLOPs in backbone but achieves 38.0 AP, outperforming ResNet-50, PVT-Tiny, and ConT-M which consume 7 times more computation (65G to 84G FLOPs) in the backbone. This demonstrates that Mobile-Former is also effective and efficient in the object detection task.
In this section, we show Mobile-Former is effective and efficient via several ablations performed on ImageNet classification. Here, Mobile-Former-294M is used and all models are trained for 300 epochs. Moreover, we visualize the two-way cross attention to understand the communication between Mobile and Former. Finally, the limitations of Mobile-Former are discussed.
Model | #Param | MAdds | Top-1 | Top-5 |
---|---|---|---|---|
Mobile (using ReLU) |
6.1M | 259M | 74.2 | 91.8 |
+ Former and Bridge | 10.1M | 290M | 76.8 | 93.2 |
+ DY-ReLU in Mobile | 11.4M | 294M | 77.8 | 93.7 |
Mobile-Former is more effective than MobileNet as it encodes global interaction via Former, resulting in more accurate prediction. As shown in Table 5, adding Former and bridge (MobileFormer and MobileFormer) only costs 10.6% of the computational cost, but gains 2.6% top-1 accuracy over the baseline that uses Mobile alone. In addition, using the first global token to generate parameters for dynamic ReLU [Chen2020DynamicReLU] in Mobile sub-block (see Figure 3) achieves additional 1.0% top-1 accuracy. This validates our parallel design in Mobile-Former.
Kernel Size in Mobile | #Param | MAdds | Top-1 | Top-5 |
---|---|---|---|---|
33 |
11.4M | 294M | 77.8 | 93.7 |
55 | 11.5M | 332M | 77.9 | 93.9 |
|
We also perform another ablation on the kernel size of the depthwise convolution in Mobile, to validate the contribution of Former and bridge on global interaction. Table 6 shows that the gain of increasing kernel size (from 33 to 55) is negligible. We believe this is because Former and the bridge enlarge the reception field for Mobile via fusing global features. Therefore, using larger kernel size is not necessary in Mobile-Former.
Mobile-Former is not only effective in encoding both local processing and global interaction, but achieves this efficiently. The key finding is that Former only requires a small number of global tokens. Here, we first show Mobile-Former is efficient in terms of both the number of tokens and dimension. Then, we show the efficient parallel design of Mobile-Former is stable when removing FFN in Former and replacing multi-head attention (MHA) with MLP.
Number of tokens in Former: Table 7 shows the ImageNet classification results for using different number of global tokens in Former. The token dimension is 192. Interestingly, even a single global token achieves a good performance (77.1% top-1 accuracy). Additional improvement (0.5% and 0.7% top-1 accuracy) is achieved when using 3 and 6 tokens. But the improvement stops when more than 6 tokens are used. This ablation shows the compactness of global tokens which is important to the efficiency of Mobile-Former.
#tokens | #Param | MAdds | Top-1 | Top-5 |
---|---|---|---|---|
1 |
11.4M | 269M | 77.1 | 93.2 |
3 | 11.4M | 279M | 77.6 | 93.6 |
6 | 11.4M | 294M | 77.8 | 93.7 |
9 | 11.4M | 309M | 77.7 | 93.8 |
Token Dimension: Table 8 shows the results for different token lengths (or dimension). Here, six global tokens are used in Former. The performance keeps improving from 76.8% to 77.8% when token dimension increases from 64 to 192, but converges when higher dimension is used. This further supports the efficiency of Former. When using six tokens with dimension 192, the total computational cost of Former and the bridge only consumes 12% of the overall budget (35M/294M).
Token Dimension | #Param | MAdds | Top-1 | Top-5 |
---|---|---|---|---|
64 |
7.3M | 277M | 76.8 | 93.1 |
128 | 9.1M | 284M | 77.3 | 93.5 |
192 | 11.4M | 294M | 77.8 | 93.7 |
256 | 14.3M | 308M | 77.8 | 93.7 |
320 | 17.9M | 325M | 77.6 | 93.6 |
FFN in Former: As shown in Table 9, removing FFN introduces a small drop in top-1 accuracy (-0.3%). Compared to the important role of FFN in the original vision transformer, FFN has limited contribution in Mobile-Former. We believe this is because FFN is not the only module for channel fusion in Mobile-Former. The 11 convolution in Mobile helps the channel fusion of local features, while the projection matrix in MobileFormer (see Equation 1) contributes to the fusion between local and global features.
Attention | FFN | #Param | MAdds | Top-1 | Top-5 |
---|---|---|---|---|---|
MHA |
✓ | 11.4M | 294M | 77.8 | 93.7 |
MHA | ✗ | 9.8M | 284M | 77.5 | 93.6 |
MLP | ✓ | 10.5M | 284M | 77.3 | 93.5 |
|
Multi-head Attention (MHA) vs. MLP: Table 9 shows the result of replacing multi-head attention (MHA) with MLP in both Former and bridge (MobileFormer and MobileFormer). The top-1 accuracy drops from 77.8% to 77.3%. The implementation of MLP is more efficient by a single matrix multiplication, but it is static (i.e. not adaptive to different input images).
To understand the collaboration between Mobile and Former, we visualize the cross attention on the two-way bridge (MobileFormer and MobileFormer) in Figure 4, Figure 5 and Figure 6. The ImageNet pretrained Mobile-Former-294M is used, which includes 6 global tokens and 11 Mobile-Former blocks. We observe three interesting patterns.
First, the attention has more diversity across tokens at lower levels than higher levels. As shown in Figure 4, each column corresponds to a token, and each row corresponds to a head in the corresponding multi-head cross attention. Note that the attention is normalized over pixels in MobileFormer (left half), showing the focused region per token. In contrast, the attention in MobileFormer is normalized over tokens, comparing the contribution of different tokens at each pixel. Clearly, the six tokens at block 3 and 5 have different cross attention patterns in both MobileFormer and MobileFormer
. Similar attention patterns among tokens is clearly observed at block 8. At block 12, the last five tokens share very similar attention pattern. Note that the first token is class token feed into the classifier head. The similar observation has been identified in recent studies on ViT
[zhou2021refiner, zhou2021deepvit, touvron2021going].Second, the focused regions of global tokens change progressively from low to high levels. Figure 5 shows the cross attention over pixels for the first token in MobileFormer. This token begins focusing on local features, e.g. edges/corners (at block 2-4). Then it pays more attention to regions with connected pixels. Interestingly, the focused region shifts between foreground (person and horse) and background (grass) across blocks. Finally, it locates the most discriminative region (horse body and head) for classification.
Thirdly, the separation between foreground and background is surprisingly found at middle layers (e.g. block 8) of MobileFormer. Figure 6 shows the cross attention over 6 tokens for each pixel at featuremap. Clearly, the foreground and background are separated by the first and last tokens. This shows that some global tokens learn meaningful prototypes that cluster similar pixels.
The major limitation of Mobile-Former is the model size. This is due to two reasons. Firstly, the parallel design is not efficient in terms of parameter sharing as Mobile, Former and bridge have their own parameters. Although Former is efficient in computation due to the small amount of tokens, but it does not save the numer of parameters. Second, Mobile-Former consumes many parameters in the classification head (two fully connected layers) when performing ImageNet classification task. For instance, Mobile-Former-294M spends 40% (4.6M of 11.4M) parameters in the classification head. The model size problem mitigates when switching from image classification to object detection task, as the classification head is removed. We will explore the parameter efficiency in the future work.
This paper presents Mobile-Former, a new parallel design of MobileNet and Transformer with two-way bridge in between to communicate. It leverages the efficiency of MobileNet in local processing and the advantage of Transformer in encoding global interaction. This design is not only effective to boost accuracy, but also efficient to save computational cost. It outperforms both efficient CNNs and vision transformer variants on image classification and object detection in the low FLOP regime with a clear margin. We hope Mobile-Former encourage new design of efficient CNNs and transformers.
Stage | Mobile-Former-508M | Mobile-Former-294M | Mobile-Former-214M | Mobile-Former-151M | Mobile-Former-96M | Mobile-Former-52M | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Block | #exp | #out | Block | #exp | #out | Block | #exp | #out | Block | #exp | #out | Block | #exp | #out | Block | #exp | #out | |
token |
6192 | 6192 | 6192 | 6192 | 4128 | 3128 | ||||||||||||
stem | conv 33 | – | 24 | conv 33 | – | 16 | conv 33 | – | 12 | conv 33 | – | 12 | conv 33 | – | 12 | conv 33 | – | 8 |
1 |
bneck-lite | 48 | 24 | bneck-lite | 32 | 16 | bneck-lite | 24 | 12 | bneck-lite | 24 | 12 | bneck-lite | 24 | 12 | |||
2 | M-F | 144 | 40 | M-F | 96 | 24 | M-F | 72 | 20 | M-F | 72 | 16 | M-F | 72 | 16 | bneck-lite | 24 | 12 |
M-F | 120 | 40 | M-F | 96 | 24 | M-F | 60 | 20 | M-F | 48 | 16 | M-F | 36 | 12 | ||||
3 | M-F | 240 | 72 | M-F | 144 | 48 | M-F | 120 | 40 | M-F | 96 | 32 | M-F | 96 | 32 | M-F | 72 | 24 |
M-F | 216 | 72 | M-F | 192 | 48 | M-F | 160 | 40 | M-F | 96 | 32 | M-F | 96 | 32 | M-F | 72 | 24 | |
4 | M-F | 432 | 128 | M-F | 288 | 96 | M-F | 240 | 80 | M-F | 192 | 64 | M-F | 192 | 64 | M-F | 144 | 48 |
|
M-F | 512 | 128 | M-F | 384 | 96 | M-F | 320 | 80 | M-F | 256 | 64 | M-F | 256 | 64 | M-F | 192 | 48 |
|
M-F | 768 | 176 | M-F | 576 | 128 | M-F | 480 | 112 | M-F | 384 | 88 | M-F | 384 | 88 | M-F | 288 | 64 |
|
M-F | 1056 | 176 | M-F | 768 | 128 | M-F | 672 | 112 | M-F | 528 | 88 | ||||||
5 |
M-F | 1056 | 240 | M-F | 768 | 192 | M-F | 672 | 160 | M-F | 528 | 128 | M-F | 528 | 128 | M-F | 384 | 96 |
|
M-F | 1440 | 240 | M-F | 1152 | 192 | M-F | 960 | 160 | M-F | 768 | 128 | M-F | 768 | 128 | M-F | 576 | 96 |
|
M-F | 1440 | 240 | M-F | 1152 | 192 | M-F | 960 | 160 | M-F | 768 | 128 | conv 11 | – | 768 | conv 11 | – | 576 |
|
conv 11 | – | 1440 | conv 11 | – | 1152 | conv 11 | – | 960 | conv 11 | – | 768 | ||||||
pool |
– | – | 1632 | – | – | 1344 | – | – | 1152 | – | – | 960 | – | – | 896 | – | – | 704 |
concat | ||||||||||||||||||
FC1 | – | – | 1920 | – | – | 1920 | – | – | 1600 | – | – | 1280 | – | – | 1280 | – | – | 1024 |
FC2 | – | – | 1000 | – | – | 1000 | – | – | 1000 | – | – | 1000 | – | – | 1000 | – | – | 1000 |
|
Table 10 shows six Mobile-Former models (508M–52M). These models are manually designed without searching for the optimal architecture parameters (e.g. width or depth). We follow the well known rules used in MobileNet: (a) the number of channels increases across stages, and (b) channel expansion rate starts with three at low levels and increases to six at high levels. For the four bigger models (508M–151M), we use six global tokens with dimension 192 and 11 Mobile-Former blocks. But these four models have different widths. Mobile-Former-96M and Mobile-Former-52M are shallower (with only 8 Mobile-Former blocks) to meet the low computational budget. Mobile-Former-26M has similar architecture to Mobile-Former-52M except replacing all 11 convolution with group convolution (group=4).
Model | Learing Rate | Weight Decay | Dropout |
---|---|---|---|
Mobile-Former-26M |
8e-4 | 0.08 | 0.1 |
Mobile-Former-52M |
8e-4 | 0.10 | 0.2 |
Mobile-Former-96M |
8e-4 | 0.10 | 0.2 |
Mobile-Former-151M |
9e-4 | 0.10 | 0.2 |
Mobile-Former-214M |
9e-4 | 0.15 | 0.2 |
Mobile-Former-294M |
1e-3 | 0.20 | 0.3 |
Mobile-Former-508M |
1e-3 | 0.20 | 0.3 |
Table 11 shows three hyper-parameters (learning rate, weight decay and dropout rate) on ImageNet classification for all Mobile-Former models. Their values increase as the model becomes bigger to prevent overfitting.
We visualize the cross attention on the two-way bridge (MobileFormer and MobileFormer) for all blocks in Figure 7. We use ImageNet pretrained Mobile-Former-294M, which includes 6 global tokens and 11 Mobile-Former blocks. In Figure 7, each column corresponds to a token, and each row corresponds to a head in the corresponding multi-head cross attention. Note that the attention is normalized over pixels in MobileFormer (left half), showing the focused region per token. In contrast, the attention in MobileFormer is normalized over tokens. For instance, at the second head of block 5 in MobileFormer, pixels on the person and horse attend more on the second token while pixels on the background attend more on the last token.
Comments
There are no comments yet.