SeiT: Storage-Efficient Vision Training with Tokens Using 1 Storage

03/20/2023
by   Song Park, et al.
0

We need billion-scale images to achieve more generalizable and ground-breaking vision models, as well as massive dataset storage to ship the images (e.g., the LAION-4B dataset needs 240TB storage space). However, it has become challenging to deal with unlimited dataset storage with limited storage infrastructure. A number of storage-efficient training methods have been proposed to tackle the problem, but they are rarely scalable or suffer from severe damage to performance. In this paper, we propose a storage-efficient training strategy for vision classifiers for large-scale datasets (e.g., ImageNet) that only uses 1024 tokens per instance without using the raw level pixels; our token storage only needs <1 pixels. We also propose token augmentations and a Stem-adaptor module to make our approach able to use the same architecture as pixel-based approaches with only minimal modifications on the stem layer and the carefully tuned optimization settings. Our experimental results on ImageNet-1k show that our method significantly outperforms other storage-efficient training methods with a large gap. We further show the effectiveness of our method in other practical scenarios, storage-efficient pre-training, and continual learning. Code is available at https://github.com/naver-ai/seit

READ FULL TEXT

page 1

page 4

page 5

page 8

page 9

page 10

research
01/28/2021

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Transformers, which are popular for language modeling, have been explore...
research
03/08/2023

Centroid-centered Modeling for Efficient Vision Transformer Pre-training

Masked Image Modeling (MIM) is a new self-supervised vision pre-training...
research
04/22/2021

So-ViT: Mind Visual Tokens for Vision Transformer

Recently the vision transformer (ViT) architecture, where the backbone p...
research
02/16/2022

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Vision Transformers (ViTs) take all the image patches as tokens and cons...
research
05/29/2023

DiffRate : Differentiable Compression Rate for Efficient Vision Transformers

Token compression aims to speed up large-scale vision transformers (e.g....
research
07/27/2017

A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets

The original ImageNet dataset is a popular large-scale benchmark for tra...
research
04/19/2016

Improving Raw Image Storage Efficiency by Exploiting Similarity

To improve the temporal and spatial storage efficiency, researchers have...

Please sign up or login with your details

Forgot password? Click here to reset