Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training

12/07/2021
by   Haofei Zhang, et al.
0

Recently, vision Transformers (ViTs) are developing rapidly and starting to challenge the domination of convolutional neural networks (CNNs) in the realm of computer vision (CV). With the general-purpose Transformer architecture for replacing the hard-coded inductive biases of convolution, ViTs have surpassed CNNs, especially in data-sufficient circumstances. However, ViTs are prone to over-fit on small datasets and thus rely on large-scale pre-training, which expends enormous time. In this paper, we strive to liberate ViTs from pre-training by introducing CNNs' inductive biases back to ViTs while preserving their network architectures for higher upper bound and setting up more suitable optimization objectives. To begin with, an agent CNN is designed based on the given ViT with inductive biases. Then a bootstrapping training algorithm is proposed to jointly optimize the agent and ViT with weight sharing, during which the ViT learns inductive biases from the intermediate features of the agent. Extensive experiments on CIFAR-10/100 and ImageNet-1k with limited training data have shown encouraging results that the inductive biases help ViTs converge significantly faster and outperform conventional CNNs with even fewer parameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/13/2022

How to Train Vision Transformer on Small-scale Datasets?

Vision Transformer (ViT), a radically different architecture than convol...
research
01/15/2021

LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning

While designing inductive bias in neural architectures has been widely s...
research
05/15/2023

Theoretical Analysis of Inductive Biases in Deep Convolutional Networks

In this paper, we study the inductive biases in convolutional neural net...
research
10/04/2022

Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling

There are two de facto standard architectures in recent computer vision:...
research
06/15/2022

SP-ViT: Learning 2D Spatial Priors for Vision Transformers

Recently, transformers have shown great potential in image classificatio...
research
05/15/2021

Are Convolutional Neural Networks or Transformers more like human vision?

Modern machine learning models for computer vision exceed humans in accu...
research
06/17/2020

Neural Anisotropy Directions

In this work, we analyze the role of the network architecture in shaping...

Please sign up or login with your details

Forgot password? Click here to reset