The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

03/12/2022
by   Tianlong Chen, et al.
0

Vision transformers (ViTs) have gained increasing popularity as they are commonly believed to own higher modeling capacity and representation flexibility, than traditional convolutional networks. However, it is questionable whether such potential has been fully unleashed in practice, as the learned ViTs often suffer from over-smoothening, yielding likely redundant models. Recent works made preliminary attempts to identify and alleviate such redundancy, e.g., via regularizing embedding similarity or re-injecting convolution-like structures. However, a "head-to-toe assessment" regarding the extent of redundancy in ViTs, and how much we could gain by thoroughly mitigating such, has been absent for this field. This paper, for the first time, systematically studies the ubiquitous existence of redundancy at all three levels: patch embedding, attention map, and weight space. In view of them, we advocate a principle of diversity for training ViTs, by presenting corresponding regularizers that encourage the representation diversity and coverage at each of those levels, that enabling capturing more discriminative information. Extensive experiments on ImageNet with a number of ViT backbones validate the effectiveness of our proposals, largely eliminating the observed ViT redundancy and significantly boosting the model generalization. For example, our diversified DeiT obtains 0.70 with highly reduced similarity. Our codes are fully available in https://github.com/VITA-Group/Diverse-ViT.

READ FULL TEXT

page 3

page 5

page 6

page 7

research
04/26/2021

Improve Vision Transformers Training by Suppressing Over-smoothing

Introducing the transformer structure into computer vision tasks holds t...
research
09/15/2021

PnP-DETR: Towards Efficient Visual Analysis with Transformers

Recently, DETR pioneered the solution of vision tasks with transformers,...
research
01/12/2022

Uniformer: Unified Transformer for Efficient Spatiotemporal Representation Learning

It is a challenging task to learn rich and multi-scale spatiotemporal se...
research
10/21/2022

Boosting vision transformers for image retrieval

Vision transformers have achieved remarkable progress in vision tasks su...
research
06/16/2023

Group Orthogonalization Regularization For Vision Models Adaptation and Robustness

As neural networks become deeper, the redundancy within their parameters...
research
09/19/2020

Redundancy of Hidden Layers in Deep Learning: An Information Perspective

Although the deep structure guarantees the powerful expressivity of deep...
research
10/11/2020

Complexity-based speciation and genotype representation for neuroevolution

This paper introduces a speciation principle for neuroevolution where ev...

Please sign up or login with your details

Forgot password? Click here to reset