Early Convolutions Help Transformers See Better

06/28/2021
by   Tete Xiao, et al.
1

Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are far easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p pxp convolution (p=16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks. To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3x3 convolutions. While the vast majority of computation in the two ViT designs is identical, we find that this small change in early visual processing results in markedly different training behavior in terms of the sensitivity to optimization settings as well as the final model accuracy. Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance (by  1-2 ImageNet-1k), while maintaining flops and runtime. The improvement can be observed across the wide spectrum of model complexities (from 1G to 36G flops) and dataset scales (from ImageNet-1k to ImageNet-21k). These findings lead us to recommend using a standard, lightweight convolutional stem for ViT models as a more robust architectural choice compared to the original ViT model design.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2021

Going deeper with Image Transformers

Transformers have been recently adapted for large scale image classifica...
research
11/22/2022

Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

This paper does not attempt to design a state-of-the-art method for visu...
research
07/07/2022

More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity

Transformers have quickly shined in the computer vision world since the ...
research
02/04/2021

A Deeper Look into Convolutions via Pruning

Convolutional neural networks (CNNs) are able to attain better visual re...
research
06/07/2021

Efficient Training of Visual Transformers with Small-Size Datasets

Visual Transformers (VTs) are emerging as an architectural paradigm alte...
research
03/18/2022

Three things everyone should know about Vision Transformers

After their initial success in natural language processing, transformer ...
research
06/30/2021

What can linear interpolation of neural network loss landscapes tell us?

Studying neural network loss landscapes provides insights into the natur...

Please sign up or login with your details

Forgot password? Click here to reset