Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers

08/18/2023
by   Tobias Christian Nauen, et al.
0

The growing popularity of Vision Transformers as the go-to models for image classification has led to an explosion of architectural modifications claiming to be more efficient than the original ViT. However, a wide diversity of experimental conditions prevents a fair comparison between all of them, based solely on their reported results. To address this gap in comparability, we conduct a comprehensive analysis of more than 30 models to evaluate the efficiency of vision transformers and related architectures, considering various performance metrics. Our benchmark provides a comparable baseline across the landscape of efficiency-oriented transformers, unveiling a plethora of surprising insights. For example, we discover that ViT is still Pareto optimal across multiple efficiency metrics, despite the existence of several alternative approaches claiming to be more efficient. Results also indicate that hybrid attention-CNN models fare particularly well when it comes to low inference memory and number of parameters, and also that it is better to scale the model size, than the image size. Furthermore, we uncover a strong positive correlation between the number of FLOPS and the training memory, which enables the estimation of required VRAM from theoretical measurements alone. Thanks to our holistic evaluation, this study offers valuable insights for practitioners and researchers, facilitating informed decisions when selecting models for specific applications. We publicly release our code and data at https://github.com/tobna/WhatTransformerToFavor

READ FULL TEXT

page 1

page 6

page 7

research
05/21/2022

Vision Transformers in 2022: An Update on Tiny ImageNet

The recent advances in image transformers have shown impressive results ...
research
02/16/2023

Efficiency 360: Efficient Vision Transformers

Transformers are widely used for solving tasks in natural language proce...
research
07/26/2023

MiDaS v3.1 – A Model Zoo for Robust Monocular Relative Depth Estimation

We release MiDaS v3.1 for monocular depth estimation, offering a variety...
research
06/01/2022

Fair Comparison between Efficient Attentions

Transformers have been successfully used in various fields and are becom...
research
03/22/2023

Q-HyViT: Post-Training Quantization for Hybrid Vision Transformer with Bridge Block Reconstruction

Recently, vision transformers (ViT) have replaced convolutional neural n...
research
01/20/2023

Holistically Explainable Vision Transformers

Transformers increasingly dominate the machine learning landscape across...
research
02/02/2023

Mnemosyne: Learning to Train Transformers with Transformers

Training complex machine learning (ML) architectures requires a compute ...

Please sign up or login with your details

Forgot password? Click here to reset