FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization

The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that - our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At similar latency, our model obtains 4.2 MobileOne. Our model consistently outperforms competing architectures across several tasks – image classification, detection, segmentation and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is highly robust to out-of-distribution samples and corruptions, improving over competing robust models.

READ FULL TEXT

page 4

page 14

research
06/08/2022

An Improved One millisecond Mobile Backbone

Efficient neural network backbones for mobile devices are often optimize...
research
10/10/2021

NViT: Vision Transformer Compression and Parameter Redistribution

Transformers yield state-of-the-art results across many tasks. However, ...
research
04/12/2022

TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation

Although vision transformers (ViTs) have achieved great success in compu...
research
01/30/2023

SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation

Since the introduction of Vision Transformers, the landscape of many com...
research
08/22/2023

TurboViT: Generating Fast Vision Transformers via Generative Architecture Search

Vision transformers have shown unprecedented levels of performance in ta...
research
12/15/2022

Rethinking Vision Transformers for MobileNet Size and Speed

With the success of Vision Transformers (ViTs) in computer vision tasks,...
research
05/06/2022

EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers

Self-attention based models such as vision transformers (ViTs) have emer...

Please sign up or login with your details

Forgot password? Click here to reset