MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models

10/04/2022
by   Chenglin Yang, et al.
4

This paper presents MOAT, a family of neural networks that build on top of MObile convolution (i.e., inverted residual blocks) and ATtention. Unlike the current works that stack separate mobile convolution and transformer blocks, we effectively merge them into a MOAT block. Starting with a standard Transformer block, we replace its multi-layer perceptron with a mobile convolution block, and further reorder it before the self-attention operation. The mobile convolution block not only enhances the network representation capacity, but also produces better downsampled features. Our conceptually simple MOAT networks are surprisingly effective, achieving 89.1 ImageNet-1K with ImageNet-22K pretraining. Additionally, MOAT can be seamlessly applied to downstream tasks that require large resolution inputs by simply converting the global attention to window attention. Thanks to the mobile convolution that effectively exchanges local information between pixels (and thus cross-windows), MOAT does not need the extra window-shifting mechanism. As a result, on COCO object detection, MOAT achieves 59.2 parameters (single-scale inference, and hard NMS), and on ADE20K semantic segmentation, MOAT attains 57.6 inference). Finally, the tiny-MOAT family, obtained by simply reducing the channel sizes, also surprisingly outperforms several mobile-specific transformer-based models on ImageNet. We hope our simple yet effective MOAT will inspire more seamless integration of convolution and self-attention. Code is made publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/20/2021

Lite Vision Transformer with Enhanced Self-Attention

Despite the impressive representation capacity of vision transformer mod...
research
01/24/2022

UniFormer: Unifying Convolution and Self-attention for Visual Recognition

It is a challenging task to learn discriminative representation from ima...
research
06/21/2022

EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications

In the pursuit of achieving ever-increasing accuracy, large and complex ...
research
04/13/2023

Dynamic Mobile-Former: Strengthening Dynamic Convolution with Attention and Residual Connection in Kernel Space

We introduce Dynamic Mobile-Former(DMF), maximizes the capabilities of d...
research
11/28/2018

Partial Convolution based Padding

In this paper, we present a simple yet effective padding scheme that can...
research
05/28/2021

ResT: An Efficient Transformer for Visual Recognition

This paper presents an efficient multi-scale vision Transformer, called ...
research
07/17/2023

Scale-Aware Modulation Meet Transformer

This paper presents a new vision Transformer, Scale-Aware Modulation Tra...

Please sign up or login with your details

Forgot password? Click here to reset