Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

03/29/2021
by   Pengchuan Zhang, et al.
9

This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer, which significantly enhances the ViT of <cit.> for encoding high-resolution images using two techniques. The first is the multi-scale model structure, which provides image encodings at multiple scales with manageable computational cost. The second is the attention mechanism of vision Longformer, which is a variant of Longformer <cit.>, originally developed for natural language processing, and achieves a linear complexity w.r.t. the number of input tokens. A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer from a concurrent work <cit.>, on a range of vision tasks, including image classification, object detection, and segmentation. The models and source code used in this study will be released to public soon.

READ FULL TEXT
research
08/07/2023

Recurrent Multi-scale Transformer for High-Resolution Salient Object Detection

Salient Object Detection (SOD) aims to identify and segment the most con...
research
11/01/2021

HRViT: Multi-Scale High-Resolution Vision Transformer

Vision transformers (ViTs) have attracted much attention for their super...
research
09/08/2023

On the Efficacy of Multi-scale Data Samplers for Vision Applications

Multi-scale resolution training has seen an increased adoption across mu...
research
10/28/2022

Grafting Vision Transformers

Vision Transformers (ViTs) have recently become the state-of-the-art acr...
research
03/20/2022

Vision Transformer with Convolutions Architecture Search

Transformers exhibit great advantages in handling computer vision tasks....
research
03/15/2022

HUMUS-Net: Hybrid unrolled multi-scale network architecture for accelerated MRI reconstruction

In accelerated MRI reconstruction, the anatomy of a patient is recovered...
research
01/02/2021

VinVL: Making Visual Representations Matter in Vision-Language Models

This paper presents a detailed study of improving visual representations...

Please sign up or login with your details

Forgot password? Click here to reset