Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block

10/11/2021
by   Durvesh Malpure, et al.
0

In recent developments in the field of Computer Vision, a rise is seen in the use of transformer-based architectures. They are surpassing the state-of-the-art set by CNN architectures in accuracy but on the other hand, they are computationally very expensive to train from scratch. As these models are quite recent in the Computer Vision field, there is a need to study it's transfer learning capabilities and compare it with CNNs so that we can understand which architecture is better when applied to real world problems with small data. In this work, we follow a simple yet restrictive method for fine-tuning both CNN and Transformer models pretrained on ImageNet1K on CIFAR-10 and compare them with each other. We only unfreeze the last transformer/encoder or last convolutional block of a model and freeze all the layers before it while adding a simple MLP at the end for classification. This simple modification lets us use the raw learned weights of both these neural networks. From our experiments, we find out that transformers-based architectures not only achieve higher accuracy than CNNs but some transformers even achieve this feat with around 4 times lesser number of parameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2023

A survey of the Vision Transformers and its CNN-Transformer based Variants

Vision transformers have recently become popular as a possible alternati...
research
09/28/2021

Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models

Transformers are state-of-the-art deep learning models that are composed...
research
03/18/2022

Three things everyone should know about Vision Transformers

After their initial success in natural language processing, transformer ...
research
07/22/2022

Applying Spatiotemporal Attention to Identify Distracted and Drowsy Driving with Vision Transformers

A 20 result of increased distraction and drowsiness. Drowsy and distract...
research
12/24/2021

Spoiler in a Textstack: How Much Can Transformers Help?

This paper presents our research regarding spoiler detection in reviews....
research
06/15/2022

Efficient Adaptive Ensembling for Image Classification

In recent times, except for sporadic cases, the trend in Computer Vision...
research
09/15/2022

Number of Attention Heads vs Number of Transformer-Encoders in Computer Vision

Determining an appropriate number of attention heads on one hand and the...

Please sign up or login with your details

Forgot password? Click here to reset