ResFormer: Scaling ViTs with Multi-Resolution Training

12/01/2022
by   Rui Tian, et al.
0

Vision Transformers (ViTs) have achieved overwhelming success, yet they suffer from vulnerable resolution scalability, i.e., the performance drops drastically when presented with input resolutions that are unseen during training. We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions. In particular, ResFormer operates on replicated images of different resolutions and enforces a scale consistency loss to engage interactive information across different scales. More importantly, to alternate among varying resolutions, we propose a global-local positional embedding strategy that changes smoothly conditioned on input sizes. This allows ResFormer to cope with novel resolutions effectively. We conduct extensive experiments for image classification on ImageNet. The results provide strong quantitative evidence that ResFormer has promising scaling abilities towards a wide range resolutions. For instance, ResFormer-B-MR achieves a Top-1 accuracy of 75.86 resolutions respectively (i.e., 96 and 640), which are 48 than DeiT-B. We also demonstrate, among other things, ResFormer is flexible and can be easily extended to semantic segmentation and video action recognition.

READ FULL TEXT

page 5

page 8

research
07/13/2020

Learning to Learn Parameterized Classification Networks for Scalable Input Images

Convolutional Neural Networks (CNNs) do not have a predictable recogniti...
research
01/23/2023

Improving Performance of Object Detection using the Mechanisms of Visual Recognition in Humans

Object recognition systems are usually trained and evaluated on high res...
research
09/08/2023

On the Efficacy of Multi-scale Data Samplers for Vision Applications

Multi-scale resolution training has seen an increased adoption across mu...
research
07/09/2016

Combining multiple resolutions into hierarchical representations for kernel-based image classification

Geographic object-based image analysis (GEOBIA) framework has gained inc...
research
10/27/2022

How To Overcome Richness Axiom Fallacy

The paper points at the grieving problems implied by the richness axiom ...
research
07/12/2023

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

The ubiquitous and demonstrably suboptimal choice of resizing images to ...
research
07/19/2020

Resolution Switchable Networks for Runtime Efficient Image Recognition

We propose a general method to train a single convolutional neural netwo...

Please sign up or login with your details

Forgot password? Click here to reset