[NeurIPs 2022] EfficientFormer: Vision Transformers at MobileNet Speed
Vision Transformers (ViT) have shown rapid progress in computer vision tasks, achieving promising results on various benchmarks. However, due to the massive number of parameters and model design, e.g., attention mechanism, ViT-based models are generally times slower than lightweight convolutional networks. Therefore, the deployment of ViT for real-time applications is particularly challenging, especially on resource-constrained hardware such as mobile devices. Recent efforts try to reduce the computation complexity of ViT through network architecture search or hybrid design with MobileNet block, yet the inference speed is still unsatisfactory. This leads to an important question: can transformers run as fast as MobileNet while obtaining high performance? To answer this, we first revisit the network architecture and operators used in ViT-based models and identify inefficient designs. Then we introduce a dimension-consistent pure transformer (without MobileNet blocks) as design paradigm. Finally, we perform latency-driven slimming to get a series of final models dubbed EfficientFormer. Extensive experiments show the superiority of EfficientFormer in performance and speed on mobile devices. Our fastest model, EfficientFormer-L1, achieves 79.2 ms inference latency on iPhone 12 (compiled with CoreML), which is even a bit faster than MobileNetV2 (1.7 ms, 71.8 EfficientFormer-L7, obtains 83.3 proves that properly designed transformers can reach extremely low latency on mobile devices while maintaining high performanceREAD FULL TEXT VIEW PDF
[NeurIPs 2022] EfficientFormer: Vision Transformers at MobileNet Speed
The transformer architecture Vaswani et al. (2017)
, initially designed for Natural Language Processing (NLP) tasks, introduces the Multi-Head Self Attention (MHSA) mechanism that allows the network to model long-term dependencies and is easy to parallelize. In this context, Dosovitskiyet al. Dosovitskiy et al. (2021)
adapt the attention mechanism to 2D images and propose Vision Transformer (ViT): the input image is divided into non-overlapping patches, and the inter-patch representations are learned through MHSA without inductive bias. ViTs demonstrate promising results compared to convolutional neural networks (CNNs) on computer vision tasks. Following this success, several efforts explore the potential of ViT by improving training strategiesTouvron et al. (2021b, 2022b, 2022a), introducing architecture changes Yu et al. (2021); Meng et al. (2021), redesigning attention mechanisms Liu et al. (2021b); Jaszczur et al. (2021), and elevating the performance of various vision tasks such as classification Liu et al. (2021a, c); Caron et al. (2021), segmentation Xie et al. (2021); Cheng et al. (2021), and detection Carion et al. (2020); Li et al. (2021a).
On the downside, transformer models are usually times slower than competitive CNNs Wang et al. (2022); Mehta and Rastegari (2021). There are many factors that limit the inference speed of ViT, including the massive number of parameters, quadratic-increasing computation complexity with respect to token length, non-foldable normalization layers, and lack of compiler level optimizations (e.g., Winograd for CNN Liu et al. (2018b)). The high latency makes transformers impractical for real-world applications on resource-constrained hardware, such as augmented or virtual reality applications on mobile devices and wearables. As a result, lightweight CNNs Howard et al. (2017); Sandler et al. (2018); Howard et al. (2019) remain the default choice for real-time inference.
To alleviate the latency bottleneck of transformers, many approaches have been proposed. For instance, some efforts consider designing new architectures or operations by changing the linear layers with convolutional layers (CONV) Graham et al. (2021), combining self-attention with MobileNet blocks Chen et al. (2021e), or introducing sparse attention Wu et al. (2021a); Roh et al. (2021); Zhu et al. (2020), to reduce the computational cost, while other efforts leverage network searching algorithm Gong et al. (2022) or pruning Chavan et al. (2022) to improve efficiency. Although the computation-performance trade-off has been improved by existing works, the fundamental question that relates to the applicability of transformer models remains unanswered: Can powerful vision transformers run at MobileNet speed and become a default option for edge applications? This work provides a study towards the answer through the following contributions:
First, we revisit the design principles of ViT and its variants through latency analysis (Sec. 3). Following existing work Mehta and Rastegari (2021), we utilize iPhone 12 as the testbed and publicly available CoreML CoreMLTools. (2021) as the compiler, since the mobile device is widely used and the results can be easily reproduced.
Second, based on our analysis, we identify inefficient designs and operators in ViT and propose a new dimension-consistent design paradigm for vision transformers (Sec. 4.1).
Third, starting from a supernet with the new design paradigm, we propose a simple yet effective latency-driven slimming method to obtain a new family of models, namely, EfficientFormers (Sec. 4.2). We directly optimize for inference speed instead of MACs or number of parameters Ma et al. (2018); Tan et al. (2019); Wang et al. (2020).
Our fastest model, EfficientFormer-L1, achieves top-1 accuracy on ImageNet-1K Deng et al. (2009) classification task with only ms inference time (averaged over runs), which has lower latency and higher top-1 accuracy compared to MobileNetV2 (more results in Fig. 1 and Tab. 1). The promising results demonstrate that latency is no longer an obstacle for the widespread adoption of vision transformers. Our largest model, EfficientFormer-L7, achieves accuracy with only ms latency, outperforms ViTMobileNet hybrid designs (MobileViT-XS, , ms) by a large margin. Additionally, we observe superior performance by employing EfficientFormer as the backbone in image detection and segmentation benchmarks (Tab. 2). We provide a preliminary answer to the aforementioned question, ViTs can achieve ultra fast inference speed and wield powerful performance at the same time. We hope our EfficientFormer can serve as a strong baseline and inspire followup works on the edge deployment of vision transformers.
Transformers are initially proposed to handle the learning of long sequences in NLP tasks Vaswani et al. (2017). Dosovitskiy et al. Dosovitskiy et al. (2021) and Carion et al. Carion et al. (2020) adapt the transformer architecture to classification and detection, respectively, and achieve competitive performance against CNN counterparts with stronger training techniques and larger-scale datasets. DeiT Touvron et al. (2021b) further improves the training pipeline with the aid of distillation, eliminating the need for large-scale pretraining Yuan et al. (2021b). Inspired by the competitive performance and global receptive field of transformer models, follow-up works are proposed to refine the architecture Wang et al. (2021a); Touvron et al. (2021c), explore the relationship between CONV nets and ViT Guo et al. (2021); Dai et al. (2021); Han et al. (2021), and adapt ViT to different computer vision tasks Xie et al. (2021); Zhang et al. (2022b, a); Lee et al. (2021b, a); Esser et al. (2021); Zeng et al. (2021). Other research efforts explore the essence of attention mechanism and propose insightful variants of token mixer, e.g., local attention Liu et al. (2021b), spatial MLP Touvron et al. (2021a); Tolstikhin et al. (2021a), and pooling-mixer Yu et al. (2021).
Despite the success in most vision tasks, ViT-based models cannot compete with the well-studied lightweight CNNs Sandler et al. (2018); Tan and Le (2019) when the inference speed is the major concern Tolstikhin et al. (2021b); Chen et al. (2021d); Zhou et al. (2021), especially on resource-constrained edge devices Wang et al. (2022). To accelerate ViT, many approaches have been introduced with different methodologies, such as proposing new architectures or modules Kitaev et al. (2020); Chen et al. (2021b); Hassani et al. (2021); Fayyaz et al. (2021); Li et al. (2022); Renggli et al. (2022), re-thinking self-attention and sparse-attention mechanisms Wang et al. (2021b); Heo et al. (2021); Chen et al. (2021a); Li et al. (2021b); Chu et al. (2021); Rao et al. (2021); Tu et al. (2022), and utilizing search algorithms that are widely explored in CNNs to find smaller and faster ViTs Chen et al. (2021c); Gong et al. (2022); Chavan et al. (2022); Zhou et al. (2022). Recently, LeViT Graham et al. (2021) proposes a CONV-clothing design to accelerate vision transformer. However, in order to perform MHSA, the D features need to be frequently reshaped into flat patches, which is still expensive to compute on edge resources (Fig. 2). Likewise, MobileViT Mehta and Rastegari (2021) introduces a hybrid architecture that combines lightweight MobileNet blocks (with point-wise and depth-wise CONV) and MHSA blocks; the former is placed at early stages in the network pipeline to extract low-level features, while the latter is placed in late stages to enjoy the global receptive field. Similar approach has been explored by several works Chen et al. (2021e); Gong et al. (2022) as a straightforward strategy to reduce computation.
Different from existing works, we aim at pushing the latency-performance boundary of pure vision transformers instead of relying on hybrid designs, and directly optimize for mobile latency. Through our detailed analysis (Sec. 3), we propose a new design paradigm (Sec. 4.1), which can be further elevated through architecture search (Sec. 4.2).
Most existing approaches optimize the inference speed of transformers through computation complexity (MACs) or throughput (images/sec) obtained from server GPU Graham et al. (2021); Gong et al. (2022). While such metrics do not reflect the real on-device latency. To have a clear understanding of which operations and design choices slow down the inference of ViTs on edge devices, we perform a comprehensive latency analysis over a number of models and operations, as shown in Fig. 2, whereby the following observations are drawn.
Observation 1: Patch embedding with large kernel and stride is a speed bottleneck on mobile devices.
Patch embedding with large kernel and stride is a speed bottleneck on mobile devices.
. A common belief is that the computation cost of the patch embedding layer in a transformer network is unremarkable or negligibleDosovitskiy et al. (2021); Yu et al. (2021). However, our comparison in Fig. 2 between models with large kernel and stride for patch embedding, i.e., DeiT-S Touvron et al. (2021b) and PoolFormer-s24 Yu et al. (2021), and the models without it, i.e., LeViT-256 Graham et al. (2021) and EfficientFormer, shows that patch embedding is instead a speed bottleneck on mobile devices.
Large-kernel convolutions are not well supported by most compilers and cannot be accelerated through existing algorithms like Winograd Liu et al. (2018b). Alternatively, the non-overlapping patch embedding can be replaced by a convolution stem with fast downsampling Wu et al. (2021b); Yuan et al. (2021a); Graham et al. (2021) that consists of several hardware-efficient convolutions (Fig. 3).
Observation 2: Consistent feature dimension is important for the choice of token mixer. MHSA is not necessarily a speed bottleneck.
Recent work extends ViT-based models to the MetaFormer architecture Yu et al. (2021) consisting of MLP blocks and unspecified token mixers. Selecting a token mixer is an essential design choice when building ViT-based models. The options are many—the conventional MHSA mixer with a global receptive field, more sophisticated shifted window attention Liu et al. (2021b), or a non-parametric operator like pooling Yu et al. (2021).
We narrow the comparison to the two token mixers, pooling and MHSA, where we choose the former for its simplicity and efficiency, while the latter for better performance. More complicated token mixers like shifted window Liu et al. (2021b) are currently not supported by most public mobile compilers and we leave them outside our scope. Furthermore, we do not use depth-wise convolution to replace pooling Trockman and Kolter (2022) as we focus on building architecture without the aid of lightweight convolutions.
To understand the latency of the two token mixers, we perform the following two comparisons:
First, by comparing PoolFormer-s24 Yu et al. (2021) and LeViT-256 Graham et al. (2021), we observe that the Reshape operation is a bottleneck for LeViT-256. The majority of LeViT-256 is implemented with CONV on
D tensor, requiring frequent reshaping operations when forwarding features into MHSA since the attention has to be performed on patchifiedD tensor (discarding the extra dimension of attention heads). The extensive usage of Reshape limits the speed of LeViT on mobile devices (Fig. 2). On the other hand, pooling naturally suits the D tensor when the network primarily consists of CONV-based implementations, e.g., CONV as MLP implementation and CONV stem for downsampling. As a result, PoolFormer exhibits faster inference speed.
Second, by comparing DeiT-S Touvron et al. (2021b) and LeViT-256 Graham et al. (2021), we find that MHSA does not bring significant overhead on mobiles if the feature dimensions are consistent and Reshape is not required. Though much more computation intensive, DeiT-S with a consistent 3D feature can achieve comparable speed to the new ViT variant, i.e., LeViT-256.
In this work, we propose a dimension-consistent network (Sec. 4.1) with both D feature implementation and D MHSA, but the inefficient frequent Reshape operations are eliminated.
Observation 3: CONV-BN is more latency-favorable than LN-Linear and the accuracy drawback is generally acceptable.
Choosing the MLP implementation is another essential design choice. Usually, one of the two options is selected: layer normalization (LN) with 3D linear projection (proj.) and CONV
with batch normalization (BN). CONV-BN is more latency favorable because BN can be folded into the preceding convolution for inference speedup, while LN still collects running statistics at the inference phase, thus contributing to latency. Based on our experimental results and previous workWang et al. (2022), the latency introduced by LN constitutes around latency of the whole network.
Based on our ablation study in Appendix Tab.3, CONV-BN only slightly downgrades performance compared to LN. In this work, we apply CONV-BN as much as possible (in all latent D features) for the latency gain with a negligible performance drop, while using LN for the D features, which aligns with the original MHSA design in ViT and yields better accuracy.
Observation 4: The latency of nonlinearity is hardware and compiler dependent.
Lastly, we study nonlinearity, including GeLU, ReLU, and HardSwish. Previous workWang et al. (2022) suggests GeLU is not efficient on hardware and slows down inference. However, we observe GeLU is well supported by iPhone 12 and hardly slower than its counterpart, ReLU. On the contrary, HardSwish is surprisingly slow in our experiments and may not be well supported by the compiler (LeViT-256 latency with HardSwish is ms while with GeLU ms). We conclude that nonlinearity should be determined on a case-by-case basis given specific hardware and compiler at hand. We believe that most of the activations will be supported in the future. In this work, we employ GeLU activations.
Based on the latency analysis, we propose the design of EfficientFormer, demonstrated in Fig. 3. The network consists of a patch embedding (PatchEmbed) and stack of meta transformer blocks, denoted as MB:
where is the input image with batch size as and spatial size as , is the desired output, and is the total number of blocks (depth). MB consists of unspecified token mixer (TokenMixer) followed by a MLP block and can be expressed as follows:
where is the intermediate feature that forwarded into the MB. We further define Stage (or S) as the stack of several MetaBlocks that processes the features with the same spatial size, such as in Fig. 3 denoting has MetaBlocks. The network includes Stages. Among each Stage, there is an embedding operation to project embedding dimension and downsample token length, denoted as Embedding in Fig. 3. With the above architecture, EfficientFormer is a fully transformer-based model without integrating MobileNet structures. Next, we dive into the details of the network design, specifically, the architecture details and the search algorithm.
With the observations in Sec. 3, we propose a dimension consistent design which splits the network into a 4D partition where operators are implemented in CONV-net style (), and a 3D partition where linear projections and attentions are performed over 3D tensor to enjoy the global modeling power of MHSA without sacrificing efficiency (), as shown in Fig. 3. Specifically, the network starts with 4D partition, while 3D partition is applied in the last stages. Note that Fig. 3 is just an instance, the actual length of 4D and 3D partition is specified later through architecture search.
First, input images are processed by a CONV stem with two convolutions with stride 2 as patch embedding,
where is the channel number (width) of the stage. Then the network starts with with a simple Pool mixer to extract low level features,
where refers to whether the convolution is followed by BN and GeLU, respectively. Note here we do not employ Group or Layer Normalization (LN) before the Pool mixer as in Yu et al. (2021), since the 4D partition is CONV-BN based design, thus there exists a BN in front of each Pool mixer.
After processing all the blocks, we perform a one-time reshaping to transform the features size and enter 3D partition. follows conventional ViT structure, as in Fig. 3. Formally,
where denotes the Linear followed by GeLU, and
where represents query, key, and values learned by the linear projection, and is parameterized attention bias as position encodings.
Design of Supernet. Based on the dimension-consistent design, we build a supernet for searching efficient models of the network architecture shown in Fig. 3 (Fig. 3 shows an example of searched final network). In order to represent such a supernet, we define the MetaPath (MP), which is the collection of possible blocks:
where represents identity path, denotes the Stage, and denotes the block. The supernet can be illustrated by replacing MB in Fig. 3 with MP.
As in Eqn. 7, in and of the supernet, each block can select from or , while in and , the block can be , , or . We only enable in the last two Stages for two reasons. First, since the computation of MHSA grows quadratically with respect to token length, integrating it in early Stages would largely increase the computation cost. Second, applying the global MHSA to the last Stages aligns with the intuition that early stages in the networks capture low-level features, while late layers learn long-term dependencies.
Searching Space. Our searching space includes (the width of each Stage), (the number of blocks in each Stage, i.e., depth), and last blocks to apply .
Searching Algorithm. Previous hardware-aware network searching methods generally rely on hardware deployment of each candidate in search space to obtain the latency, which is time consuming Yang et al. (2018). In this work, we propose a simple, fast yet effective gradient-based search algorithm to obtain a candidate network that just needs to train the supernet for once. The algorithm has three major steps.
First, we train the supernet with Gumble Softmax sampling Liu et al. (2018a) to get the importance score for the blocks within each MP, which can be expressed as
where evaluates the importance of each block in MP
as it represents the probability to select a block,e.g., or for the block. ensures exploration, is the temperature, and represents the type of blocks in MP, i.e., for and , and for and . By using Eqn. 8, the derivatives with respect to network weights and can be computed easily. The training follows the standard recipe (see Sec. 5.1) to obtain the trained weights and architecture parameter .
Second, we build a latency lookup table by collecting the on-device latency of and with different widths (multiples of ).
Finally, we perform network slimming on the supernet obtained from the first step through latency evaluation using the lookup table. Note that a typical gradient-based searching algorithm simply select the block with largest Liu et al. (2018a), which does not fit our scope as it cannot search the width . In fact, constructing a multiple-width supernet is memory-consuming and even unrealistic given that each MP has several branches in our design. Instead of directly searching on the complex searching space, we perform a gradual slimming on the single-width supernet as follows.
We first define the importance score for as and for and , respectively. Similarly, the importance score for each Stage can be obtained by summing up the scores for all MP within the Stage. With the importance score, we define the action space that includes three options: 1) select for the least import MP, 2) remove the first , and 3) reduce the width of the least important Stage (by multiples of ). Then, we calculate the resulting latency of each action through lookup table, and evaluate the accuracy drop of each action. Lastly, we choose the action based on per-latency accuracy drop (). This process is performed iteratively until target latency is achieved. We show more details of the algorithm in Appendix.
We implement EfficientFormer through PyTorch 1.11Paszke et al. (2019) and Timm library Wightman (2019), which is the common practice in recent arts Mehta and Rastegari (2021); Yu et al. (2021). Our models are trained on a cluster with NVIDIA A100 and V100 GPUs. The mobile speed is averaged over runs on iPhone 12 equipped with an A14 bionic chip, with all available computing resources (NPU). CoreMLTools is used to deploy the run-time model. We provide detailed network architecture and more ablations in Appendix.
All EfficientFormer models are trained from scratch on ImageNet-1K dataset Deng et al. (2009) to perform the image classification task. We employ standard image size () for both training and testing. We follow the training recipe from DeiT Touvron et al. (2021b) but mainly report results with training epochs to have the comparison with other ViT-based models. We use AdamW optimizer Kingma and Ba (2014); Loshchilov and Hutter (2017), warm-up training with epochs, and a cosine annealing learning rate schedule. The initial learning rate is set as and the minimum learning rate is . The teacher model for distillation is RegNetY-16GF Radosavovic et al. (2020) pretrained on ImageNet with top-1 accuracy. Results are demonstrated in Tab. 1 and Fig. 1
Comparison to CNNs. Compared with the widely used CNN-based models, EfficientFormer achieves a better trade-off between accuracy and latency. For example, the EfficientFormer-L1 runs at MobileNetV2 speed while achieving relative higher top-1 accuracy. In addition, EfficientFormer-L3 runs at a similar speed to EfficientNet-B0 while achieving relative higher top-1 accuracy. For the models with high performance ( top-1), EfficientFormer-L7 runs more than faster than EfficientNet-B5, demonstrating the advantageous performance of our models. These results allow us to answer the central question raised earlier; ViTs do not need to sacrifice latency to achieve good performance, and an accurate ViT can still have ultra-fast inference speed as lightweight CNNs do.
Comparison to ViTs. Conventional ViTs are still under-performing CNNs in terms of latency. For instance, DeiT-Tiny achieves similar accuracy to EfficientNet-B0 while it runs slower. However, EfficientFormer performs like other transformer models while running times faster. For instance, EfficientFormer-L3 achieves higher accuracy than DeiT-Small ( vs. ) while being faster. It is notable that though the recent transformer variant, PoolFormer Yu et al. (2021), naturally has a consistent D architecture and runs faster compared to typical ViTs, the absence of global MHSA greatly limits the performance upper-bound. With higher inference latency, PoolFormer-S36 still underperforms EfficientFormer-L3 by top-1 accuracy.
Comparison to Hybrid Designs. Existing hybrid designs, e.g., LeViT-256 and MobileViT, still struggle with the latency bottleneck of ViTs and can hardly outperform lightweight CNNs. For example, LeViT-256 runs slower than DeiT-Small while having lower top-1 accuracy. For MobileViT, which is a hybrid model with both MHSA and MobileNet blocks, we observe that it is significantly slower than CNN counterparts, e.g., MobileNetV2 and EfficientNet-B0, while the accuracy is not satisfactory either ( lower than EfficientNet-B0). Thus, simply trading-off MHSA with MobileNet blocks can hardly push forward the Pareto curve, as in Fig. 1. In contrast, EfficientFormer, as pure transformer-based model, can maintain high performance while achieving ultra-fast inference speed. At a similar inference time, EfficientFormer-L7 outperforms MobileViT-XS by top-1 accuracy on ImageNet, demonstrating the superiority of our design.
We follow the implementation of Mask-RCNN He et al. (2017)
to integrate EfficientFormer as the backbone and verify performance. We experiment over COCO-2017Lin et al. (2014) which contains training and validations sets of K and K images, respectively. The EfficientFormer backbone is initialized with ImageNet-1K pretrained weights. Similar to prior work Yu et al. (2021), we use AdamW optimizer Kingma and Ba (2014); Loshchilov and Hutter (2017) with initial learning rate of , and train the model for epochs. We set the input size as .
The results for detection and instance segmentation are shown in Tab. 2. EfficientFormers consistently outperform CNN (ResNet) and transformer (PoolFormer) backbones. With similar computation cost, EfficientFormer-L3 outperforms ResNet50 backbone by box AP and mask AP, and outperforms PoolFormer-S24 backbone with box AP and mask AP, proving that EfficientFormer generalizes well as a strong backbone in vision tasks.
|Backbone||Detection & Instance Segmentation||Semantic|
instance segmentation are obtained from COCO 2017. Results on semantic segmentation are obtained from ADE20K.
We further validate the performance of EfficientFormer on the semantic segmentation task. We use the challenging scene parsing dataset, ADE20K Zhou et al. (2017, 2019), which contains K training images and K validation ones covering class categories. Similar to existing work Yu et al. (2021), we build EfficientFormer as backbone along with Semantic FPN Kirillov et al. (2019) as segmentation decoder for fair comparison. The backbone is initialized with pretrained weights on ImageNet-1K and the model is trained for K iterations with a total batch size of over GPUs. We follow the common practice in segmentation Yu et al. (2021); Xie et al. (2021), use AdamW optimizer Kingma and Ba (2014); Loshchilov and Hutter (2017), and apply a poly learning rate schedule with power , starting from a initial learning rate . We resize and crop input images to for training and shorter side as for testing (on validation set).
As shown in Tab. 2, EfficientFormer consistently outperforms CNN- and transformer-based backbones by a large margin under a similar computation budget. For example, EfficientFormer-L3 outperforms PoolFormer-S24 by mIoU. We show that with global attention, EfficientFormer learns better long-term dependencies, which is beneficial in high-resolution dense prediction tasks.
Relationship to MetaFormer. The design of EfficientFormer is partly inspired by the MetaFormer concept Yu et al. (2021). Compared to PoolFormer, EfficientFormer addresses the dimension mismatch problem, which is a root cause of inefficient edge inference, thus being capable of utilizing global MHSA without sacrificing speed. Consequently, EfficientFormer exhibits advantageous accuracy performance over PoolFormer. In spite of its fully D design, PoolFormer employs inefficient patch embedding and group normalization (Fig. 2), leading to increased latency. Instead, our redesigned D partition of EfficientFormer (Fig. 3) is more hardware friendly and exhibits better performance across several tasks.
Limitations. (i) Though most designs in EfficientFormer are general-purposed, e.g., dimension-consistent design and D block with CONV-BN fusion, the actual speed of EfficientFormer may vary on other platforms. For instance, if GeLU is not well supported while HardSwish is efficiently implemented on specific hardware and compiler, the operator may need to be modified accordingly. (ii) The proposed latency-driven slimming is simple and fast. However, better results may be achieved if search cost is not a concern and an enumeration-based brute search is performed.
In this work, we show that Vision Transformer can operate at MobileNet speed on mobile devices. Starting from a comprehensive latency analysis, we identify inefficient operators in a series of ViT-based architectures, whereby we draw important observations that guide our new design paradigm. The proposed EfficientFormer complies with a dimension consistent design that smoothly leverages hardware-friendly D MetaBlocks and powerful D MHSA blocks. We further propose a fast latency-driven slimming method to derive optimized configurations based on our design space. Extensive experiments on image classification, object detection, and segmentation tasks show that EfficientFormer models outperform existing transformer models while being faster than most competitive CNNs. The latency-driven analysis of ViT architecture and the experimental results validate our claim: powerful vision transformers can achieve ultra-fast inference speed on the edge. Future research will further explore the potential of EfficientFormer on several resource-constrained devices.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1, §5.1.
Cmt: convolutional neural networks meet vision transformers. arXiv preprint arXiv:2107.06263. Cited by: §2.
Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32. Cited by: §5.
International conference on machine learning, pp. 6105–6114. Cited by: §2.
. The proposed latency-driven slimming is speed-oriented, which does not require retraining for each sub-network. The importance score for each design choice is estimated based on the trainable architecture parameter.
Our major conclusions and speed analysis can be found in Sec. 3 and Fig. 2. Here we include more ablation studies for different design choices, provided in Tab. 3, taking the EfficientFormer-L3 as an example. The latency is measured on iPhone 12 with CoreML, and the top-1 accuracy is obtained from the ImageNet-1K dataset.
Compared to non-overlap large-kernel patch embedding (V5 in Tab. 3), the proposed convolution stem in EfficientFormer (V1 in Tab. 3) greatly reduces inference latency by , while provides higher accuracy. We demonstrate that convolution stem  is not only beneficial to model convergence and accuracy but also boosts inference speed on the mobile device by a large margin, thus can serve as a good alternative to non-overlapping patch embedding implementations.
Without the proposed D MHSA and latency-driven search, EfficientFormer downgrades to a pure D design with pool mixer, which is similar to PoolFormer  (the patch embeddings and normalizations are different). By comparing EfficientFormer with V1 in Tab. 3, we can observe that the integration of D MHSA and latency-driven search greatly boost top-1 accuracy by with minimal impact on the inference speed ( ms). The results prove that MHSA with the global receptive field is an essential contribution to model performance. As a result, though enjoying faster inference speed, simply removing MHSA  greatly limits the performance upper bound. In EfficientFormer, we smoothly integrate MHSA in a dimension consistent manner, obtaining better performance while simultaneously achieving ultra fast inference speed.
Apart from the CONV-BN structure in the D partition of EfficientFormer, we explore Group Normalization (GN) in the D partition as employed in the prior work . Note that the channel-wise GN proposed in  has equivalent effect compared to LN in . By comparing V1 and V2 in Tab. 3, we can observe that the GN can only slightly improve accuracy ( top-1) but incurs latency overhead as it can not be folded at the inference stage. As a result, we apply the CONV-BN structure in the entire D partition in EfficientFormer.
). It is widely agreed that ReLU is the simplest and fastest activation function, while GeLU and HardSwish wield better performance. We observe that ReLU can hardly provide any speedup over GeLU on iPhone 12 with CoreMLTools, while HardSwish is significantly slower than ReLU and GeLU. We draw a conclusion that the activation function can be selected on a case-by-case basis depending on the specific hardware and compiler. In this work, we use GeLU to provide better performance than ReLU while executing faster. For a fair comparison, we modify inefficient operators in other works according to the supports from iPhone 12 and CoreMLTools,e.g., report LeViT latency by changing HardSwish to GeLU.
|Model||CONV stem||Norm.||Activation||MHSA Search||Top-1||Latency (ms)|
We provide ImageNet-1K results with training epochs in Tab. 4 by using EfficientFormer-L1. We can observe that compared to the standard -epoch training recipe, we can further boost the performance of EfficientFormer-L1 by top-1 accuracy, which makes our ms model achieves an unprecedented over top-1 accuracy, outperforming lightweight CNNs, e.g., MobileNet, EfficientNet, by a large margin. We demonstrate that EfficientFormer still wields the potential to achieve even better performance with stronger training recipe.
|Model||Params (M)||MACs (G)||Train Epoch||Top-1 (%)||Latency (ms)|
The detailed network archiecture for EfficientFormer-L1, EfficientFormer-L3, and EfficientFormer-L7 is provided in Tab. 5. We report the resolution and number of blocks for each stage. In addition, the width of EfficientFormer is specified as the embedding dimension (Embed. Dim.). As for the MHSA block, the dimension of Query and Key is provided, and we employ eight heads for all EfficientFormer variants. MLP expansion ratio is set as default (4), as in most ViT arts .
|stem||Patch Embed.||Patch Size|
|Patch Embed.||Patch Size,|
|2||Patch Embed.||Patch Size|
|3||Patch Embed.||Patch Size|
|4||Patch Embed.||Patch Size|