Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window

06/23/2023
by   Jinkyu Koo, et al.
0

Transformer models have shown great potential in computer vision, following their success in language tasks. Swin Transformer is one of them that outperforms convolution-based architectures in terms of accuracy, while improving efficiency when compared to Vision Transformer (ViT) and its variants, which have quadratic complexity with respect to the input size. Swin Transformer features shifting windows that allows cross-window connection while limiting self-attention computation to non-overlapping local windows. However, shifting windows introduces memory copy operations, which account for a significant portion of its runtime. To mitigate this issue, we propose Swin-Free in which we apply size-varying windows across stages, instead of shifting windows, to achieve cross-connection among local windows. With this simple design change, Swin-Free runs faster than the Swin Transformer at inference with better accuracy. Furthermore, we also propose a few of Swin-Free variants that are faster than their Swin Transformer counterparts.

READ FULL TEXT
research
03/25/2021

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

This paper presents a new vision Transformer, called Swin Transformer, t...
research
06/07/2021

Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

Very recently, Window-based Transformers, which computed self-attention ...
research
06/20/2021

More than Encoder: Introducing Transformer Decoder to Upsample

General segmentation models downsample images and then upsample to resto...
research
10/01/2019

Lineage-Aware Temporal Windows: Supporting Set Operations in Temporal-Probabilistic Databases

In temporal-probabilistic (TP) databases, the combination of the tempora...
research
12/21/2021

Learned Queries for Efficient Local Attention

Vision Transformers (ViT) serve as powerful vision models. Unlike convol...
research
04/06/2022

MixFormer: Mixing Features across Windows and Dimensions

While local-window self-attention performs notably in vision tasks, it s...
research
06/08/2021

Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight

Vision Transformer (ViT) attains state-of-the-art performance in visual ...

Please sign up or login with your details

Forgot password? Click here to reset