EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

06/21/2022
by   Muhammad Maaz, et al.
18

In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths of both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups and utilizes depth-wise convolution along with self-attention across channel dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive experiments on classification, detection and segmentation tasks, reveal the merits of the proposed approach, outperforming state-of-the-art methods with comparatively lower compute requirements. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute gain of 2.2% with 28% reduction in FLOPs. Further, our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K. The code and models are publicly available at https://t.ly/_Vu9.

READ FULL TEXT
research
12/06/2022

IncepFormer: Efficient Inception Transformer with Pyramid Pooling for Semantic Segmentation

Semantic segmentation usually benefits from global contexts, fine locali...
research
10/04/2022

MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

This paper presents MOAT, a family of neural networks that build on top ...
research
04/07/2023

PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift

Vision Transformer (ViT) has shown great potential for various visual ta...
research
12/20/2021

Lite Vision Transformer with Enhanced Self-Attention

Despite the impressive representation capacity of vision transformer mod...
research
04/17/2017

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

We present a class of efficient models called MobileNets for mobile and ...
research
10/03/2022

Efficient Spiking Transformer Enabled By Partial Information

Spiking neural networks (SNNs) have received substantial attention in re...
research
08/30/2021

A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Convolutional neural networks (CNN) are the dominant deep neural network...

Please sign up or login with your details

Forgot password? Click here to reset