DeepAI
Log In Sign Up

EfficientFormer: Vision Transformers at MobileNet Speed

06/02/2022
by   Yanyu Li, et al.
31

Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2 ms inference latency on iPhone 12 (compiled with CoreML), which is even a bit faster than MobileNetV2 (1.7 ms, 71.8 EfficientFormer-L7, obtains 83.3 proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performance

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/15/2022

Rethinking Vision Transformers for MobileNet Size and Speed

With the success of Vision Transformers (ViTs) in computer vision tasks,...
03/08/2022

EdgeFormer: Improving Light-weight ConvNets by Learning from Vision Transformers

Recently, vision transformers started to show impressive results which o...
06/30/2021

Improving the Efficiency of Transformers for Resource-Constrained Devices

Transformers provide promising accuracy and have become popular and used...
12/15/2022

Real-Time Neural Light Field on Mobile Devices

Recent efforts in Neural Rendering Fields (NeRF) have shown impressive r...
11/09/2022

MFDNet: Towards Real-time Image Denoising On Mobile Devices

Deep convolutional neural networks have achieved great progress in image...
03/26/2018

Latency and Throughput Characterization of Convolutional Neural Networks for Mobile Computer Vision

We study performance characteristics of convolutional neural networks (C...
10/18/2018

Decoupling Semantic Context and Color Correlation with multi-class cross branch regularization

Success and applicability of Deep Neural Network (DNN) based methods for...

Code Repositories

EfficientFormer

[NeurIPs 2022] EfficientFormer: Vision Transformers at MobileNet Speed


view repo

1 Introduction

Figure 1: Inference Speed vs. Accuracy.

All models are trained on ImageNet-1K and measured by iPhone 12 with CoreMLTools to get latency. Compared to CNNs, EfficientFormer-L1 runs

faster than EfficientNet-B0, while achieves higher accuracy. For the latest MobileViT-XS, EfficientFormer-L7 runs ms faster with higher accuracy.

The transformer architecture Vaswani et al. (2017)

, initially designed for Natural Language Processing (NLP) tasks, introduces the Multi-Head Self Attention (MHSA) mechanism that allows the network to model long-term dependencies and is easy to parallelize. In this context, Dosovitskiy

et al. Dosovitskiy et al. (2021)

adapt the attention mechanism to 2D images and propose Vision Transformer (ViT): the input image is divided into non-overlapping patches, and the inter-patch representations are learned through MHSA without inductive bias. ViTs demonstrate promising results compared to convolutional neural networks (CNNs) on computer vision tasks. Following this success, several efforts explore the potential of ViT by improving training strategies 

Touvron et al. (2021b, 2022b, 2022a), introducing architecture changes Yu et al. (2021); Meng et al. (2021), redesigning attention mechanisms Liu et al. (2021b); Jaszczur et al. (2021), and elevating the performance of various vision tasks such as classification Liu et al. (2021a, c); Caron et al. (2021), segmentation Xie et al. (2021); Cheng et al. (2021), and detection Carion et al. (2020); Li et al. (2021a).

On the downside, transformer models are usually times slower than competitive CNNs Wang et al. (2022); Mehta and Rastegari (2021). There are many factors that limit the inference speed of ViT, including the massive number of parameters, quadratic-increasing computation complexity with respect to token length, non-foldable normalization layers, and lack of compiler level optimizations (e.g., Winograd for CNN Liu et al. (2018b)). The high latency makes transformers impractical for real-world applications on resource-constrained hardware, such as augmented or virtual reality applications on mobile devices and wearables. As a result, lightweight CNNs Howard et al. (2017); Sandler et al. (2018); Howard et al. (2019) remain the default choice for real-time inference.

To alleviate the latency bottleneck of transformers, many approaches have been proposed. For instance, some efforts consider designing new architectures or operations by changing the linear layers with convolutional layers (CONV) Graham et al. (2021), combining self-attention with MobileNet blocks Chen et al. (2021e), or introducing sparse attention Wu et al. (2021a); Roh et al. (2021); Zhu et al. (2020), to reduce the computational cost, while other efforts leverage network searching algorithm Gong et al. (2022) or pruning Chavan et al. (2022) to improve efficiency. Although the computation-performance trade-off has been improved by existing works, the fundamental question that relates to the applicability of transformer models remains unanswered: Can powerful vision transformers run at MobileNet speed and become a default option for edge applications? This work provides a study towards the answer through the following contributions:

  • [leftmargin=1em]

  • First, we revisit the design principles of ViT and its variants through latency analysis (Sec. 3). Following existing work Mehta and Rastegari (2021), we utilize iPhone 12 as the testbed and publicly available CoreML CoreMLTools. (2021) as the compiler, since the mobile device is widely used and the results can be easily reproduced.

  • Second, based on our analysis, we identify inefficient designs and operators in ViT and propose a new dimension-consistent design paradigm for vision transformers (Sec. 4.1).

  • Third, starting from a supernet with the new design paradigm, we propose a simple yet effective latency-driven slimming method to obtain a new family of models, namely, EfficientFormers (Sec. 4.2). We directly optimize for inference speed instead of MACs or number of parameters Ma et al. (2018); Tan et al. (2019); Wang et al. (2020).

Our fastest model, EfficientFormer-L1, achieves top-1 accuracy on ImageNet-1K Deng et al. (2009) classification task with only ms inference time (averaged over runs), which has lower latency and higher top-1 accuracy compared to MobileNetV2 (more results in Fig. 1 and Tab. 1). The promising results demonstrate that latency is no longer an obstacle for the widespread adoption of vision transformers. Our largest model, EfficientFormer-L7, achieves accuracy with only ms latency, outperforms ViTMobileNet hybrid designs (MobileViT-XS, , ms) by a large margin. Additionally, we observe superior performance by employing EfficientFormer as the backbone in image detection and segmentation benchmarks (Tab. 2). We provide a preliminary answer to the aforementioned question, ViTs can achieve ultra fast inference speed and wield powerful performance at the same time. We hope our EfficientFormer can serve as a strong baseline and inspire followup works on the edge deployment of vision transformers.

2 Related Work

Transformers are initially proposed to handle the learning of long sequences in NLP tasks Vaswani et al. (2017). Dosovitskiy et al. Dosovitskiy et al. (2021) and Carion et al. Carion et al. (2020) adapt the transformer architecture to classification and detection, respectively, and achieve competitive performance against CNN counterparts with stronger training techniques and larger-scale datasets. DeiT Touvron et al. (2021b) further improves the training pipeline with the aid of distillation, eliminating the need for large-scale pretraining Yuan et al. (2021b). Inspired by the competitive performance and global receptive field of transformer models, follow-up works are proposed to refine the architecture Wang et al. (2021a); Touvron et al. (2021c), explore the relationship between CONV nets and ViT Guo et al. (2021); Dai et al. (2021); Han et al. (2021), and adapt ViT to different computer vision tasks Xie et al. (2021); Zhang et al. (2022b, a); Lee et al. (2021b, a); Esser et al. (2021); Zeng et al. (2021). Other research efforts explore the essence of attention mechanism and propose insightful variants of token mixer, e.g., local attention Liu et al. (2021b), spatial MLP Touvron et al. (2021a); Tolstikhin et al. (2021a), and pooling-mixer Yu et al. (2021).

Despite the success in most vision tasks, ViT-based models cannot compete with the well-studied lightweight CNNs Sandler et al. (2018); Tan and Le (2019) when the inference speed is the major concern Tolstikhin et al. (2021b); Chen et al. (2021d); Zhou et al. (2021), especially on resource-constrained edge devices Wang et al. (2022). To accelerate ViT, many approaches have been introduced with different methodologies, such as proposing new architectures or modules Kitaev et al. (2020); Chen et al. (2021b); Hassani et al. (2021); Fayyaz et al. (2021); Li et al. (2022); Renggli et al. (2022), re-thinking self-attention and sparse-attention mechanisms Wang et al. (2021b); Heo et al. (2021); Chen et al. (2021a); Li et al. (2021b); Chu et al. (2021); Rao et al. (2021); Tu et al. (2022), and utilizing search algorithms that are widely explored in CNNs to find smaller and faster ViTs Chen et al. (2021c); Gong et al. (2022); Chavan et al. (2022); Zhou et al. (2022). Recently, LeViT Graham et al. (2021) proposes a CONV-clothing design to accelerate vision transformer. However, in order to perform MHSA, the D features need to be frequently reshaped into flat patches, which is still expensive to compute on edge resources (Fig. 2). Likewise, MobileViT Mehta and Rastegari (2021) introduces a hybrid architecture that combines lightweight MobileNet blocks (with point-wise and depth-wise CONV) and MHSA blocks; the former is placed at early stages in the network pipeline to extract low-level features, while the latter is placed in late stages to enjoy the global receptive field. Similar approach has been explored by several works Chen et al. (2021e); Gong et al. (2022) as a straightforward strategy to reduce computation.

Different from existing works, we aim at pushing the latency-performance boundary of pure vision transformers instead of relying on hybrid designs, and directly optimize for mobile latency. Through our detailed analysis (Sec. 3), we propose a new design paradigm (Sec. 4.1), which can be further elevated through architecture search (Sec. 4.2).

3 On-Device Latency Analysis of Vision Transformers

Most existing approaches optimize the inference speed of transformers through computation complexity (MACs) or throughput (images/sec) obtained from server GPU Graham et al. (2021); Gong et al. (2022). While such metrics do not reflect the real on-device latency. To have a clear understanding of which operations and design choices slow down the inference of ViTs on edge devices, we perform a comprehensive latency analysis over a number of models and operations, as shown in Fig. 2, whereby the following observations are drawn.

Observation 1:

Patch embedding with large kernel and stride is a speed bottleneck on mobile devices.

Patch embedding is often implemented with a non-overlapping convolution layer that has large kernel size and stride Touvron et al. (2021b); Hassani et al. (2021)

. A common belief is that the computation cost of the patch embedding layer in a transformer network is unremarkable or negligible 

Dosovitskiy et al. (2021); Yu et al. (2021). However, our comparison in Fig. 2 between models with large kernel and stride for patch embedding, i.e., DeiT-S Touvron et al. (2021b) and PoolFormer-s24 Yu et al. (2021), and the models without it, i.e., LeViT-256 Graham et al. (2021) and EfficientFormer, shows that patch embedding is instead a speed bottleneck on mobile devices.

Large-kernel convolutions are not well supported by most compilers and cannot be accelerated through existing algorithms like Winograd Liu et al. (2018b). Alternatively, the non-overlapping patch embedding can be replaced by a convolution stem with fast downsampling Wu et al. (2021b); Yuan et al. (2021a); Graham et al. (2021) that consists of several hardware-efficient convolutions (Fig. 3).

Figure 2: Latency profiling. Results are obtained on iPhone 12 with CoreML. The on-device speed for CNN (MobileNetV2, ResNet50, and EfficientNet), ViT-based models (DeiT-S, LeViT-256, PoolFormer-s24, and EfficientFormer), and various operators are reported. The latency of models and operations are denoted with different color. () is the top-1 accuracy on ImageNet-1K. †LeViT uses HardSwish which is not well supported by CoreML, we replace it with GeLU for fair comparison.

Observation 2: Consistent feature dimension is important for the choice of token mixer. MHSA is not necessarily a speed bottleneck.

Recent work extends ViT-based models to the MetaFormer architecture Yu et al. (2021) consisting of MLP blocks and unspecified token mixers. Selecting a token mixer is an essential design choice when building ViT-based models. The options are many—the conventional MHSA mixer with a global receptive field, more sophisticated shifted window attention Liu et al. (2021b), or a non-parametric operator like pooling Yu et al. (2021).

We narrow the comparison to the two token mixers, pooling and MHSA, where we choose the former for its simplicity and efficiency, while the latter for better performance. More complicated token mixers like shifted window Liu et al. (2021b) are currently not supported by most public mobile compilers and we leave them outside our scope. Furthermore, we do not use depth-wise convolution to replace pooling Trockman and Kolter (2022) as we focus on building architecture without the aid of lightweight convolutions.

To understand the latency of the two token mixers, we perform the following two comparisons:

  • [leftmargin=1em]

  • First, by comparing PoolFormer-s24 Yu et al. (2021) and LeViT-256 Graham et al. (2021), we observe that the Reshape operation is a bottleneck for LeViT-256. The majority of LeViT-256 is implemented with CONV on

    D tensor, requiring frequent reshaping operations when forwarding features into MHSA since the attention has to be performed on patchified

    D tensor (discarding the extra dimension of attention heads). The extensive usage of Reshape limits the speed of LeViT on mobile devices (Fig. 2). On the other hand, pooling naturally suits the D tensor when the network primarily consists of CONV-based implementations, e.g., CONV as MLP implementation and CONV stem for downsampling. As a result, PoolFormer exhibits faster inference speed.

  • Second, by comparing DeiT-S Touvron et al. (2021b) and LeViT-256 Graham et al. (2021), we find that MHSA does not bring significant overhead on mobiles if the feature dimensions are consistent and Reshape is not required. Though much more computation intensive, DeiT-S with a consistent 3D feature can achieve comparable speed to the new ViT variant, i.e., LeViT-256.

In this work, we propose a dimension-consistent network (Sec. 4.1) with both D feature implementation and D MHSA, but the inefficient frequent Reshape operations are eliminated.

Observation 3: CONV-BN is more latency-favorable than LN-Linear and the accuracy drawback is generally acceptable.

Choosing the MLP implementation is another essential design choice. Usually, one of the two options is selected: layer normalization (LN) with 3D linear projection (proj.) and CONV

with batch normalization (BN). CONV-BN is more latency favorable because BN can be folded into the preceding convolution for inference speedup, while LN still collects running statistics at the inference phase, thus contributing to latency. Based on our experimental results and previous work 

Wang et al. (2022), the latency introduced by LN constitutes around latency of the whole network.

Based on our ablation study in Appendix Tab.3, CONV-BN only slightly downgrades performance compared to LN. In this work, we apply CONV-BN as much as possible (in all latent D features) for the latency gain with a negligible performance drop, while using LN for the D features, which aligns with the original MHSA design in ViT and yields better accuracy.

Observation 4: The latency of nonlinearity is hardware and compiler dependent.

Lastly, we study nonlinearity, including GeLU, ReLU, and HardSwish. Previous work 

Wang et al. (2022) suggests GeLU is not efficient on hardware and slows down inference. However, we observe GeLU is well supported by iPhone 12 and hardly slower than its counterpart, ReLU. On the contrary, HardSwish is surprisingly slow in our experiments and may not be well supported by the compiler (LeViT-256 latency with HardSwish is ms while with GeLU ms). We conclude that nonlinearity should be determined on a case-by-case basis given specific hardware and compiler at hand. We believe that most of the activations will be supported in the future. In this work, we employ GeLU activations.

Figure 3: Overview of EfficientFormer. The network starts with a convolution stem as patch embedding, followed by MetaBlock (MB). The and contain different layer configurations with the token mixer, i.e., pooling and multi-head self-attention, arranged in a dimension-consistent manner.

4 Design of EfficientFormer

Based on the latency analysis, we propose the design of EfficientFormer, demonstrated in Fig. 3. The network consists of a patch embedding (PatchEmbed) and stack of meta transformer blocks, denoted as MB:

(1)

where is the input image with batch size as and spatial size as , is the desired output, and is the total number of blocks (depth). MB consists of unspecified token mixer (TokenMixer) followed by a MLP block and can be expressed as follows:

(2)

where is the intermediate feature that forwarded into the MB. We further define Stage (or S) as the stack of several MetaBlocks that processes the features with the same spatial size, such as in Fig. 3 denoting has MetaBlocks. The network includes Stages. Among each Stage, there is an embedding operation to project embedding dimension and downsample token length, denoted as Embedding in Fig. 3. With the above architecture, EfficientFormer is a fully transformer-based model without integrating MobileNet structures. Next, we dive into the details of the network design, specifically, the architecture details and the search algorithm.

4.1 Dimension-consistent Design

With the observations in Sec. 3, we propose a dimension consistent design which splits the network into a 4D partition where operators are implemented in CONV-net style (), and a 3D partition where linear projections and attentions are performed over 3D tensor to enjoy the global modeling power of MHSA without sacrificing efficiency (), as shown in Fig. 3. Specifically, the network starts with 4D partition, while 3D partition is applied in the last stages. Note that Fig. 3 is just an instance, the actual length of 4D and 3D partition is specified later through architecture search.

First, input images are processed by a CONV stem with two convolutions with stride 2 as patch embedding,

(3)

where is the channel number (width) of the stage. Then the network starts with with a simple Pool mixer to extract low level features,

(4)

where refers to whether the convolution is followed by BN and GeLU, respectively. Note here we do not employ Group or Layer Normalization (LN) before the Pool mixer as in Yu et al. (2021), since the 4D partition is CONV-BN based design, thus there exists a BN in front of each Pool mixer.

After processing all the blocks, we perform a one-time reshaping to transform the features size and enter 3D partition. follows conventional ViT structure, as in Fig. 3. Formally,

(5)

where denotes the Linear followed by GeLU, and

(6)

where represents query, key, and values learned by the linear projection, and is parameterized attention bias as position encodings.

4.2 Latency Driven Slimming

Design of Supernet. Based on the dimension-consistent design, we build a supernet for searching efficient models of the network architecture shown in Fig. 3 (Fig. 3 shows an example of searched final network). In order to represent such a supernet, we define the MetaPath (MP), which is the collection of possible blocks:

(7)

where represents identity path, denotes the Stage, and denotes the block. The supernet can be illustrated by replacing MB in Fig. 3 with MP.

As in Eqn. 7, in and of the supernet, each block can select from or , while in and , the block can be , , or . We only enable in the last two Stages for two reasons. First, since the computation of MHSA grows quadratically with respect to token length, integrating it in early Stages would largely increase the computation cost. Second, applying the global MHSA to the last Stages aligns with the intuition that early stages in the networks capture low-level features, while late layers learn long-term dependencies.

Searching Space. Our searching space includes (the width of each Stage), (the number of blocks in each Stage, i.e., depth), and last blocks to apply .

Searching Algorithm. Previous hardware-aware network searching methods generally rely on hardware deployment of each candidate in search space to obtain the latency, which is time consuming Yang et al. (2018). In this work, we propose a simple, fast yet effective gradient-based search algorithm to obtain a candidate network that just needs to train the supernet for once. The algorithm has three major steps.

First, we train the supernet with Gumble Softmax sampling Liu et al. (2018a) to get the importance score for the blocks within each MP, which can be expressed as

(8)

where evaluates the importance of each block in MP

as it represents the probability to select a block,

e.g., or for the block. ensures exploration, is the temperature, and represents the type of blocks in MP, i.e., for and , and for and . By using Eqn. 8, the derivatives with respect to network weights and can be computed easily. The training follows the standard recipe (see Sec. 5.1) to obtain the trained weights and architecture parameter .

Second, we build a latency lookup table by collecting the on-device latency of and with different widths (multiples of ).

Finally, we perform network slimming on the supernet obtained from the first step through latency evaluation using the lookup table. Note that a typical gradient-based searching algorithm simply select the block with largest  Liu et al. (2018a), which does not fit our scope as it cannot search the width . In fact, constructing a multiple-width supernet is memory-consuming and even unrealistic given that each MP has several branches in our design. Instead of directly searching on the complex searching space, we perform a gradual slimming on the single-width supernet as follows.

We first define the importance score for as and for and , respectively. Similarly, the importance score for each Stage can be obtained by summing up the scores for all MP within the Stage. With the importance score, we define the action space that includes three options: 1) select for the least import MP, 2) remove the first , and 3) reduce the width of the least important Stage (by multiples of ). Then, we calculate the resulting latency of each action through lookup table, and evaluate the accuracy drop of each action. Lastly, we choose the action based on per-latency accuracy drop (). This process is performed iteratively until target latency is achieved. We show more details of the algorithm in Appendix.

5 Experiments and Discussion

Model Type Params(M) GMACs

Train. Epoch

Top-1(%) Latency (ms)
MobileNetV2 CONV 3.5 0.3 300 71.8 1.7
ResNet50 CONV 25.5 4.1 300 78.5 3.0
EfficientNet-B0 CONV 5.3 0.4 350 77.1 2.7
EfficientNet-B3 CONV 12.0 1.8 350 81.6 6.6
EfficientNet-B5 CONV 30.0 9.9 350 83.6 23.0
DeiT-T Attention 5.9 1.2 300/1000 74.5/76.6 9.2
DeiT-S Attention 22.5 4.5 300/1000 81.2/82.6 11.8
PVT-Small Attention 24.5 3.8 300 79.8 24.4
T2T-ViT-14 Attention 21.5 4.8 310 81.5 -
Swin-Tiny Attention 29 4.5 300 81.3 -
PoolFormer-s12 Pool 12 2.0 300 77.2 6.1
PoolFormer-s24 Pool 21 3.6 300 80.3 6.2
PoolFormer-s36 Pool 31 5.2 300 81.4 6.7
ResMLP-S24 SMLP 30 6.0 300 79.4 7.6
Convmixer-768 Hybrid 21.1 20.7 300 80.2 11.6
LeViT-256 Hybrid 18.9 1.1 1000 81.6 11.9†
NASViT-A5 Hybrid - 0.76 360 81.8 -
MobileViT-XS Hybrid 2.3 0.7 300 74.8 7.2
EfficientFormer-L1 MetaBlock 12.2 1.2 300 79.2 1.6
EfficientFormer-L3 MetaBlock 31.3 3.4 300 82.4 3.0
EfficientFormer-L7 MetaBlock 82.0 7.9 300 83.3 7.0
Table 1: Comparison results on ImgeNet-1K. Hybrid refers to a mixture of MobileNet blocks and ViT blocks. (-) refers to unrevealed data or unsupported model in CoreML. †Latency measured with GeLU activation, the original LeViT-256 model with HardSwish activations runs at ms. Different training seeds lead to less than fluctuation in accuracy, and the error for latency benchmark is less than ms.

We implement EfficientFormer through PyTorch 1.11 

Paszke et al. (2019) and Timm library Wightman (2019), which is the common practice in recent arts Mehta and Rastegari (2021); Yu et al. (2021). Our models are trained on a cluster with NVIDIA A100 and V100 GPUs. The mobile speed is averaged over runs on iPhone 12 equipped with an A14 bionic chip, with all available computing resources (NPU). CoreMLTools is used to deploy the run-time model. We provide detailed network architecture and more ablations in Appendix.

5.1 Image Classification

All EfficientFormer models are trained from scratch on ImageNet-1K dataset Deng et al. (2009) to perform the image classification task. We employ standard image size () for both training and testing. We follow the training recipe from DeiT Touvron et al. (2021b) but mainly report results with training epochs to have the comparison with other ViT-based models. We use AdamW optimizer Kingma and Ba (2014); Loshchilov and Hutter (2017), warm-up training with epochs, and a cosine annealing learning rate schedule. The initial learning rate is set as and the minimum learning rate is . The teacher model for distillation is RegNetY-16GF Radosavovic et al. (2020) pretrained on ImageNet with top-1 accuracy. Results are demonstrated in Tab. 1 and Fig. 1

Comparison to CNNs. Compared with the widely used CNN-based models, EfficientFormer achieves a better trade-off between accuracy and latency. For example, the EfficientFormer-L1 runs at MobileNetV2 speed while achieving relative higher top-1 accuracy. In addition, EfficientFormer-L3 runs at a similar speed to EfficientNet-B0 while achieving relative higher top-1 accuracy. For the models with high performance ( top-1), EfficientFormer-L7 runs more than faster than EfficientNet-B5, demonstrating the advantageous performance of our models. These results allow us to answer the central question raised earlier; ViTs do not need to sacrifice latency to achieve good performance, and an accurate ViT can still have ultra-fast inference speed as lightweight CNNs do.

Comparison to ViTs. Conventional ViTs are still under-performing CNNs in terms of latency. For instance, DeiT-Tiny achieves similar accuracy to EfficientNet-B0 while it runs slower. However, EfficientFormer performs like other transformer models while running times faster. For instance, EfficientFormer-L3 achieves higher accuracy than DeiT-Small ( vs. ) while being faster. It is notable that though the recent transformer variant, PoolFormer Yu et al. (2021), naturally has a consistent D architecture and runs faster compared to typical ViTs, the absence of global MHSA greatly limits the performance upper-bound. With higher inference latency, PoolFormer-S36 still underperforms EfficientFormer-L3 by top-1 accuracy.

Comparison to Hybrid Designs. Existing hybrid designs, e.g., LeViT-256 and MobileViT, still struggle with the latency bottleneck of ViTs and can hardly outperform lightweight CNNs. For example, LeViT-256 runs slower than DeiT-Small while having lower top-1 accuracy. For MobileViT, which is a hybrid model with both MHSA and MobileNet blocks, we observe that it is significantly slower than CNN counterparts, e.g., MobileNetV2 and EfficientNet-B0, while the accuracy is not satisfactory either ( lower than EfficientNet-B0). Thus, simply trading-off MHSA with MobileNet blocks can hardly push forward the Pareto curve, as in Fig. 1. In contrast, EfficientFormer, as pure transformer-based model, can maintain high performance while achieving ultra-fast inference speed. At a similar inference time, EfficientFormer-L7 outperforms MobileViT-XS by top-1 accuracy on ImageNet, demonstrating the superiority of our design.

5.2 EfficientFormer as Backbone

Object Detection and Instance Segmentation.

We follow the implementation of Mask-RCNN He et al. (2017)

to integrate EfficientFormer as the backbone and verify performance. We experiment over COCO-2017 

Lin et al. (2014) which contains training and validations sets of K and K images, respectively. The EfficientFormer backbone is initialized with ImageNet-1K pretrained weights. Similar to prior work Yu et al. (2021), we use AdamW optimizer Kingma and Ba (2014); Loshchilov and Hutter (2017) with initial learning rate of , and train the model for epochs. We set the input size as .

The results for detection and instance segmentation are shown in Tab. 2. EfficientFormers consistently outperform CNN (ResNet) and transformer (PoolFormer) backbones. With similar computation cost, EfficientFormer-L3 outperforms ResNet50 backbone by box AP and mask AP, and outperforms PoolFormer-S24 backbone with box AP and mask AP, proving that EfficientFormer generalizes well as a strong backbone in vision tasks.

Semantic Segmentation.

Backbone Detection & Instance Segmentation Semantic
mIoU()
ResNet18 34.0 54.0 36.7 31.2 51.0 32.7 32.9
PoolFormer-S12 37.3 59.0 40.1 34.6 55.8 36.9 37.2
EfficientFormer-L1 37.9 60.3 41.0 35.4 57.3 37.3 38.9
ResNet50 38.0 58.6 41.4 34.4 55.1 36.7 36.7
PoolFormer-S24 40.1 62.2 43.4 37.0 59.1 39.6 40.3
EfficientFormer-L3 41.4 63.9 44.7 38.1 61.0 40.4 43.5
ResNet101 40.4 61.1 44.2 36.4 57.7 38.8 38.8
PoolFormer-S36 41.0 63.1 44.8 37.7 60.1 40.0 42.0
EfficientFormer-L7 42.6 65.1 46.1 39.0 62.2 41.7 45.1
Table 2: Comparison results using EfficientFormer as backbone. Results on object detection

instance segmentation are obtained from COCO 2017. Results on semantic segmentation are obtained from ADE20K.

We further validate the performance of EfficientFormer on the semantic segmentation task. We use the challenging scene parsing dataset, ADE20K Zhou et al. (2017, 2019), which contains K training images and K validation ones covering class categories. Similar to existing work Yu et al. (2021), we build EfficientFormer as backbone along with Semantic FPN Kirillov et al. (2019) as segmentation decoder for fair comparison. The backbone is initialized with pretrained weights on ImageNet-1K and the model is trained for K iterations with a total batch size of over GPUs. We follow the common practice in segmentation Yu et al. (2021); Xie et al. (2021), use AdamW optimizer Kingma and Ba (2014); Loshchilov and Hutter (2017), and apply a poly learning rate schedule with power , starting from a initial learning rate . We resize and crop input images to for training and shorter side as for testing (on validation set).

As shown in Tab. 2, EfficientFormer consistently outperforms CNN- and transformer-based backbones by a large margin under a similar computation budget. For example, EfficientFormer-L3 outperforms PoolFormer-S24 by mIoU. We show that with global attention, EfficientFormer learns better long-term dependencies, which is beneficial in high-resolution dense prediction tasks.

5.3 Discussion

Relationship to MetaFormer. The design of EfficientFormer is partly inspired by the MetaFormer concept Yu et al. (2021). Compared to PoolFormer, EfficientFormer addresses the dimension mismatch problem, which is a root cause of inefficient edge inference, thus being capable of utilizing global MHSA without sacrificing speed. Consequently, EfficientFormer exhibits advantageous accuracy performance over PoolFormer. In spite of its fully D design, PoolFormer employs inefficient patch embedding and group normalization (Fig. 2), leading to increased latency. Instead, our redesigned D partition of EfficientFormer (Fig. 3) is more hardware friendly and exhibits better performance across several tasks.

Limitations. (i) Though most designs in EfficientFormer are general-purposed, e.g., dimension-consistent design and D block with CONV-BN fusion, the actual speed of EfficientFormer may vary on other platforms. For instance, if GeLU is not well supported while HardSwish is efficiently implemented on specific hardware and compiler, the operator may need to be modified accordingly. (ii) The proposed latency-driven slimming is simple and fast. However, better results may be achieved if search cost is not a concern and an enumeration-based brute search is performed.

6 Conclusion

In this work, we show that Vision Transformer can operate at MobileNet speed on mobile devices. Starting from a comprehensive latency analysis, we identify inefficient operators in a series of ViT-based architectures, whereby we draw important observations that guide our new design paradigm. The proposed EfficientFormer complies with a dimension consistent design that smoothly leverages hardware-friendly D MetaBlocks and powerful D MHSA blocks. We further propose a fast latency-driven slimming method to derive optimized configurations based on our design space. Extensive experiments on image classification, object detection, and segmentation tasks show that EfficientFormer models outperform existing transformer models while being faster than most competitive CNNs. The latency-driven analysis of ViT architecture and the experimental results validate our claim: powerful vision transformers can achieve ultra-fast inference speed on the edge. Future research will further explore the potential of EfficientFormer on several resource-constrained devices.

References

  • [1] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European conference on computer vision, pp. 213–229. Cited by: §1, §2.
  • [2] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin (2021) Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660. Cited by: §1.
  • [3] A. Chavan, Z. Shen, Z. Liu, Z. Liu, K. Cheng, and E. Xing (2022) Vision transformer slimming: multi-dimension searching in continuous optimization space. Cited by: §1, §2.
  • [4] C. Chen, R. Panda, and Q. Fan (2021) Regionvit: regional-to-local attention for vision transformers. arXiv preprint arXiv:2106.02689. Cited by: §2.
  • [5] C. R. Chen, Q. Fan, and R. Panda (2021) Crossvit: cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 357–366. Cited by: §2.
  • [6] M. Chen, H. Peng, J. Fu, and H. Ling (2021) Autoformer: searching transformers for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12270–12280. Cited by: §2.
  • [7] S. Chen, E. Xie, C. Ge, D. Liang, and P. Luo (2021) Cyclemlp: a mlp-like architecture for dense prediction. arXiv preprint arXiv:2107.10224. Cited by: §2.
  • [8] Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, and Z. Liu (2021) Mobile-former: bridging mobilenet and transformer. arXiv preprint arXiv:2108.05895. Cited by: §1, §2.
  • [9] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar (2021) Masked-attention mask transformer for universal image segmentation. arXiv preprint arXiv:2112.01527. Cited by: §1.
  • [10] X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen (2021) Twins: revisiting spatial attention design in vision transformers. arXiv e-prints, pp. arXiv–2104. Cited by: §2.
  • [11] CoreMLTools. (2021) Use coremltools to convert models from third-party libraries to core ml.. External Links: Link Cited by: 1st item.
  • [12] Z. Dai, H. Liu, Q. V. Le, and M. Tan (2021) Coatnet: marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems 34, pp. 3965–3977. Cited by: §2.
  • [13] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    ,
    pp. 248–255. Cited by: §1, §5.1.
  • [14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. ICLR. Cited by: §1, §2, §3.
  • [15] P. Esser, R. Rombach, and B. Ommer (2021) Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883. Cited by: §2.
  • [16] M. Fayyaz, S. A. Kouhpayegani, F. R. Jafari, E. Sommerlade, H. R. V. Joze, H. Pirsiavash, and J. Gall (2021) Ats: adaptive token sampling for efficient vision transformers. arXiv preprint arXiv:2111.15667. Cited by: §2.
  • [17] C. Gong, D. Wang, M. Li, X. Chen, Z. Yan, Y. Tian, qiang liu, and V. Chandra (2022) NASVit: neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2, §3.
  • [18] B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin, H. Jegou, and M. Douze (2021-10) LeViT: a vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12259–12269. Cited by: Appendix B, §1, §2, 1st item, 2nd item, §3, §3, §3.
  • [19] J. Guo, K. Han, H. Wu, C. Xu, Y. Tang, C. Xu, and Y. Wang (2021)

    Cmt: convolutional neural networks meet vision transformers

    .
    arXiv preprint arXiv:2107.06263. Cited by: §2.
  • [20] Q. Han, Z. Fan, Q. Dai, L. Sun, M. Cheng, J. Liu, and J. Wang (2021) On the connection between local attention and dynamic depth-wise convolution. In International Conference on Learning Representations, Cited by: §2.
  • [21] A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi (2021) Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704. Cited by: §2, §3.
  • [22] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §5.2.
  • [23] B. Heo, S. Yun, D. Han, S. Chun, J. Choe, and S. J. Oh (2021) Rethinking spatial dimensions of vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11936–11945. Cited by: §2.
  • [24] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
  • [25] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al. (2019) Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324. Cited by: §1.
  • [26] S. Jaszczur, A. Chowdhery, A. Mohiuddin, L. Kaiser, W. Gajewski, H. Michalewski, and J. Kanerva (2021) Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems 34, pp. 9895–9907. Cited by: §1.
  • [27] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1, §5.2, §5.2.
  • [28] A. Kirillov, R. Girshick, K. He, and P. Dollár (2019) Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6399–6408. Cited by: §5.2.
  • [29] N. Kitaev, L. Kaiser, and A. Levskaya (2020) Reformer: the efficient transformer. In ICLR, Cited by: §2.
  • [30] K. Lee, H. Chang, L. Jiang, H. Zhang, Z. Tu, and C. Liu (2021) Vitgan: training gans with vision transformers. arXiv preprint arXiv:2107.04589. Cited by: §2.
  • [31] S. H. Lee, S. Lee, and B. C. Song (2021) Vision transformer for small-size datasets. arXiv preprint arXiv:2112.13492. Cited by: §2.
  • [32] W. Li, X. Wang, X. Xia, J. Wu, X. Xiao, M. Zheng, and S. Wen (2022) SepViT: separable vision transformer. CoRR abs/2203.15380. Cited by: §2.
  • [33] Y. Li, C. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer (2021) Improved multiscale vision transformers for classification and detection. arXiv preprint arXiv:2112.01526. Cited by: §1.
  • [34] Y. Li, K. Zhang, J. Cao, R. Timofte, and L. Van Gool (2021) Localvit: bringing locality to vision transformers. arXiv preprint arXiv:2104.05707. Cited by: §2.
  • [35] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §5.2.
  • [36] H. Liu, K. Simonyan, and Y. Yang (2018) DARTS: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §4.2, §4.2.
  • [37] X. Liu, J. Pool, S. Han, and W. J. Dally (2018) Efficient sparse-winograd convolutional neural networks. arXiv preprint arXiv:1802.06367. Cited by: §1, §3.
  • [38] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, et al. (2021) Swin transformer v2: scaling up capacity and resolution. arXiv preprint arXiv:2111.09883. Cited by: §1.
  • [39] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022. Cited by: §1, §2, §3, §3.
  • [40] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu (2021) Video swin transformer. arXiv preprint arXiv:2106.13230. Cited by: §1.
  • [41] I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §5.1, §5.2, §5.2.
  • [42] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pp. 116–131. Cited by: 3rd item.
  • [43] S. Mehta and M. Rastegari (2021) MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178. Cited by: 1st item, §1, §2, §5.
  • [44] L. Meng, H. Li, B. Chen, S. Lan, Z. Wu, Y. Jiang, and S. Lim (2021) AdaViT: adaptive vision transformers for efficient image recognition. arXiv preprint arXiv:2111.15668. Cited by: §1.
  • [45] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    Pytorch: an imperative style, high-performance deep learning library

    .
    Advances in neural information processing systems 32. Cited by: §5.
  • [46] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár (2020) Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10428–10436. Cited by: §5.1.
  • [47] Y. Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C. Hsieh (2021) DynamicViT: efficient vision transformers with dynamic token sparsification. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [48] C. Renggli, A. S. Pinto, N. Houlsby, B. Mustafa, J. Puigcerver, and C. Riquelme (2022) Learning to merge tokens in vision transformers. CoRR abs/2202.12015. Cited by: §2.
  • [49] B. Roh, J. Shin, W. Shin, and S. Kim (2021) Sparse detr: efficient end-to-end object detection with learnable sparsity. arXiv preprint arXiv:2111.14330. Cited by: §1.
  • [50] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §1, §2.
  • [51] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: 3rd item.
  • [52] M. Tan and Q. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In

    International conference on machine learning

    ,
    pp. 6105–6114. Cited by: §2.
  • [53] I. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, M. Lucic, and A. Dosovitskiy (2021) MLP-mixer: an all-mlp architecture for vision. arXiv preprint arXiv:2105.01601. Cited by: §2.
  • [54] I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit, et al. (2021) Mlp-mixer: an all-mlp architecture for vision. Advances in Neural Information Processing Systems 34. Cited by: §2.
  • [55] H. Touvron, P. Bojanowski, M. Caron, M. Cord, A. El-Nouby, E. Grave, G. Izacard, A. Joulin, G. Synnaeve, J. Verbeek, et al. (2021) Resmlp: feedforward networks for image classification with data-efficient training. arXiv preprint arXiv:2105.03404. Cited by: §2.
  • [56] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou (2021) Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347–10357. Cited by: Appendix D, §1, §2, 2nd item, §3, §5.1.
  • [57] H. Touvron, M. Cord, A. El-Nouby, J. Verbeek, and H. Jégou (2022) Three things everyone should know about vision transformers. arXiv preprint arXiv:2203.09795. Cited by: §1.
  • [58] H. Touvron, M. Cord, and H. Jégou (2022) DeiT iii: revenge of the vit. arXiv preprint arXiv:2204.07118. Cited by: §1.
  • [59] H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou (2021) Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 32–42. Cited by: §2.
  • [60] A. Trockman and J. Z. Kolter (2022) Patches are all you need?. arXiv preprint arXiv:2201.09792. Cited by: §3.
  • [61] Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li (2022) MaxViT: multi-axis vision transformer. CoRR abs/2204.01697. Cited by: §2.
  • [62] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §1, §2.
  • [63] H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, and S. Han (2020) Hat: hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187. Cited by: 3rd item.
  • [64] W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578. Cited by: §2.
  • [65] W. Wang, L. Yao, L. Chen, B. Lin, D. Cai, X. He, and W. Liu (2021) CrossFormer: a versatile vision transformer hinging on cross-scale attention. arXiv preprint arXiv:2108.00154. Cited by: §2.
  • [66] X. Wang, L. L. Zhang, Y. Wang, and M. Yang (2022) Towards efficient vision transformer inference: a first study of transformers on mobile devices. In Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications, pp. 1–7. Cited by: §1, §2, §3, §3.
  • [67] R. Wightman (2019) PyTorch image models. GitHub. Note: https://github.com/rwightman/pytorch-image-models External Links: Document Cited by: §5.
  • [68] C. Wu, F. Wu, T. Qi, B. Jiao, D. Jiang, Y. Huang, and X. Xie (2021) Smart bird: learnable sparse attention for efficient and effective transformer. arXiv preprint arXiv:2108.09193. Cited by: §1.
  • [69] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang (2021) Cvt: introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31. Cited by: §3.
  • [70] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo (2021) SegFormer: simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203. Cited by: §1, §2, §5.2.
  • [71] T. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam (2018) Netadapt: platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 285–300. Cited by: §4.2.
  • [72] W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan (2021) Metaformer is actually what you need for vision. arXiv preprint arXiv:2111.11418. Cited by: Appendix B, Appendix B, §1, §2, 1st item, §3, §3, §4.1, §5.1, §5.2, §5.2, §5.3, §5.
  • [73] K. Yuan, S. Guo, Z. Liu, A. Zhou, F. Yu, and W. Wu (2021) Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 579–588. Cited by: §3.
  • [74] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. Tay, J. Feng, and S. Yan (2021) Tokens-to-token vit: training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567. Cited by: §2.
  • [75] Y. Zeng, H. Yang, H. Chao, J. Wang, and J. Fu (2021) Improving visual quality of image synthesis by a token-based generator with transformers. Advances in Neural Information Processing Systems 34. Cited by: §2.
  • [76] W. Zhang, Z. Huang, G. Luo, T. Chen, X. Wang, W. Liu, G. Yu, and C. Shen (2022) TopFormer: token pyramid transformer for mobile semantic segmentation. External Links: 2204.05525 Cited by: §2.
  • [77] Z. Zhang, H. Zhang, L. Zhao, T. Chen, S. Arik, and T. Pfister (2022) Nested hierarchical transformer: towards accurate, data-efficient and interpretable visual understanding. Cited by: §2.
  • [78] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §5.2.
  • [79] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba (2019) Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision 127 (3), pp. 302–321. Cited by: §5.2.
  • [80] D. Zhou, B. Kang, X. Jin, L. Yang, X. Lian, Z. Jiang, Q. Hou, and J. Feng (2021) Deepvit: towards deeper vision transformer. arXiv preprint arXiv:2103.11886. Cited by: §2.
  • [81] Q. Zhou, K. Sheng, X. Zheng, K. Li, X. Sun, Y. Tian, J. Chen, and R. Ji (2022) Training-free transformer architecture search. arXiv preprint arXiv:2203.12217. Cited by: §2.
  • [82] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. Cited by: §1.

Appendix

Appendix A Latency-Driven Slimming Algorithm

We provide the details of the proposed latency-driven fast slimming in Alg. 1. Formulations of the algorithm can be found in Sec. 4.2

. The proposed latency-driven slimming is speed-oriented, which does not require retraining for each sub-network. The importance score for each design choice is estimated based on the trainable architecture parameter

.

Latency lookup table
Final latency satisfy budget
Super-net Pretraining:
for epoch do
     for each iter do
         for  do
              ,
         end for
         

         backpropagate (

), update parameters
     end for
end for get super-net
Latency-driven slimming:
Initialize action space A {Depth Reduction (DR), Width Reduction (WR), Reduction (MR)}
Compute importance of by
while  do
end while get sub-net with target latency
Train the searched architecture from scratch:
Similar to super-net training. get final model
Algorithm 1 Fast Latency-Driven Slimming based on Importance Estimations

Appendix B Ablation Analysis

Our major conclusions and speed analysis can be found in Sec. 3 and Fig. 2. Here we include more ablation studies for different design choices, provided in Tab. 3, taking the EfficientFormer-L3 as an example. The latency is measured on iPhone 12 with CoreML, and the top-1 accuracy is obtained from the ImageNet-1K dataset.

Patch Embedding.

Compared to non-overlap large-kernel patch embedding (V5 in Tab. 3), the proposed convolution stem in EfficientFormer (V1 in Tab. 3) greatly reduces inference latency by , while provides higher accuracy. We demonstrate that convolution stem [18] is not only beneficial to model convergence and accuracy but also boosts inference speed on the mobile device by a large margin, thus can serve as a good alternative to non-overlapping patch embedding implementations.

MHSA and Latency-Driven Search.

Without the proposed D MHSA and latency-driven search, EfficientFormer downgrades to a pure D design with pool mixer, which is similar to PoolFormer [72] (the patch embeddings and normalizations are different). By comparing EfficientFormer with V1 in Tab. 3, we can observe that the integration of D MHSA and latency-driven search greatly boost top-1 accuracy by with minimal impact on the inference speed ( ms). The results prove that MHSA with the global receptive field is an essential contribution to model performance. As a result, though enjoying faster inference speed, simply removing MHSA [72] greatly limits the performance upper bound. In EfficientFormer, we smoothly integrate MHSA in a dimension consistent manner, obtaining better performance while simultaneously achieving ultra fast inference speed.

Normalization.

Apart from the CONV-BN structure in the D partition of EfficientFormer, we explore Group Normalization (GN) in the D partition as employed in the prior work [72]. Note that the channel-wise GN proposed in [72] has equivalent effect compared to LN in . By comparing V1 and V2 in Tab. 3, we can observe that the GN can only slightly improve accuracy ( top-1) but incurs latency overhead as it can not be folded at the inference stage. As a result, we apply the CONV-BN structure in the entire D partition in EfficientFormer.

Activation Functions.

We explore ReLU and HardSwish (V3 and V4 in Tab. 3) in addition to GeLU employed in this work (V1 in Tab. 3

). It is widely agreed that ReLU is the simplest and fastest activation function, while GeLU and HardSwish wield better performance. We observe that ReLU can hardly provide any speedup over GeLU on iPhone 12 with CoreMLTools, while HardSwish is significantly slower than ReLU and GeLU. We draw a conclusion that the activation function can be selected on a case-by-case basis depending on the specific hardware and compiler. In this work, we use GeLU to provide better performance than ReLU while executing faster. For a fair comparison, we modify inefficient operators in other works according to the supports from iPhone 12 and CoreMLTools,

e.g., report LeViT latency by changing HardSwish to GeLU.

Model CONV stem Norm. Activation MHSA Search Top-1 Latency (ms)
EfficientFormer BN GeLU 82.4 3.0
V1 BN GeLU 80.3 2.5
V2 GN GeLU 80.6 3.0
V3 BN ReLU 79.3 2.5
V4 BN HardSwish 80.3 32.4
V5 BN GeLU 79.6 5.8
Table 3: Ablation analysis for the design choice on EfficientFormer-L3. V1-5 refers to variants with different operator selections.

Appendix C Improving Training Recipe

We provide ImageNet-1K results with training epochs in Tab. 4 by using EfficientFormer-L1. We can observe that compared to the standard -epoch training recipe, we can further boost the performance of EfficientFormer-L1 by top-1 accuracy, which makes our ms model achieves an unprecedented over top-1 accuracy, outperforming lightweight CNNs, e.g., MobileNet, EfficientNet, by a large margin. We demonstrate that EfficientFormer still wields the potential to achieve even better performance with stronger training recipe.

Model Params (M) MACs (G) Train Epoch Top-1 (%) Latency (ms)
DeiT-T 5.9 1.2 1,000 76.6 9.2
DeiT-S 22.5 4.5 1,000 82.6 11.8
LeViT-256 18.9 1.1 1,000 81.6 11.9
EfficientFormer-L1 12.2 1.2 1,000 80.2 (+1.0) 1.6
Table 4: Performance of EfficientFormer with training epochs. EfficientFormer-L1 achieves higher accuracy over the model trained with epochs.

Appendix D Architecture of EfficientFormers

The detailed network archiecture for EfficientFormer-L1, EfficientFormer-L3, and EfficientFormer-L7 is provided in Tab. 5. We report the resolution and number of blocks for each stage. In addition, the width of EfficientFormer is specified as the embedding dimension (Embed. Dim.). As for the MHSA block, the dimension of Query and Key is provided, and we employ eight heads for all EfficientFormer variants. MLP expansion ratio is set as default (4), as in most ViT arts [56].

Stage Resolution Type Config EfficientFormer
L1 L3 L7
stem Patch Embed. Patch Size
Embed. Dim. 24 32 48
Patch Embed. Patch Size,
Embed. Dim. 48 64 96
1 Token Mixer Pool
2 Patch Embed. Patch Size
Embed. Dim. 96 128 192
Token Mixer Pool
3 Patch Embed. Patch Size
Embed. Dim. 224 320 384
Token Mixer Pool
4 Patch Embed. Patch Size
Embed. Dim. 448 512 768
Token Mixer Pool
Token Mixer MHSA
Table 5: Architecture details of EfficientFormer. refers to the expansion ratio of the MLP block. is the dimension of Queries and Keys.