Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

07/12/2023
by   Mostafa Dehghani, et al.
0

The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution before processing them with computer vision models has not yet been successfully challenged. However, models such as the Vision Transformer (ViT) offer flexible sequence-based modeling, and hence varying input sequence lengths. We take advantage of this with NaViT (Native Resolution ViT) which uses sequence packing during training to process inputs of arbitrary resolutions and aspect ratios. Alongside flexible model usage, we demonstrate improved training efficiency for large-scale supervised and contrastive image-text pretraining. NaViT can be efficiently transferred to standard tasks such as image and video classification, object detection, and semantic segmentation and leads to improved results on robustness and fairness benchmarks. At inference time, the input resolution flexibility can be used to smoothly navigate the test-time cost-performance trade-off. We believe that NaViT marks a departure from the standard, CNN-designed, input and modelling pipeline used by most computer vision models, and represents a promising direction for ViTs.

READ FULL TEXT

page 8

page 20

page 24

research
06/01/2022

Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

Self-supervised learning for computer vision has achieved tremendous pro...
research
11/18/2021

Swin Transformer V2: Scaling Up Capacity and Resolution

We present techniques for scaling Swin Transformer up to 3 billion param...
research
12/17/2020

Toward Transformer-Based Object Detection

Transformers have become the dominant model in natural language processi...
research
07/25/2022

TransCL: Transformer Makes Strong and Flexible Compressive Learning

Compressive learning (CL) is an emerging framework that integrates signa...
research
12/01/2022

ResFormer: Scaling ViTs with Multi-Resolution Training

Vision Transformers (ViTs) have achieved overwhelming success, yet they ...
research
06/11/2023

Stable Remaster: Bridging the Gap Between Old Content and New Displays

The invention of modern displays has enhanced the viewer experience for ...
research
02/16/2023

Towards Reliable Assessments of Demographic Disparities in Multi-Label Image Classifiers

Disaggregated performance metrics across demographic groups are a hallma...

Please sign up or login with your details

Forgot password? Click here to reset