Group Generalized Mean Pooling for Vision Transformer

12/08/2022
by   ByungSoo Ko, et al.
0

Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, such as max and GeM pooling, can be considered. Despite their effectiveness, the existing pooling strategies do not consider the architecture of ViT and the channel-wise difference in the activation maps, aggregating the crucial and trivial channels with the same importance. In this paper, we present Group Generalized Mean (GGeM) pooling as a simple yet powerful pooling strategy for ViT. GGeM divides the channels into groups and computes GeM pooling with a shared pooling parameter per group. As ViT groups the channels via a multi-head attention mechanism, grouping the channels by GGeM leads to lower head-wise dependence while amplifying important channels on the activation maps. Exploiting GGeM shows 0.1 compared to the baselines and achieves state-of-the-art performance for ViT-Base and ViT-Large models in ImageNet-1K classification task. Moreover, GGeM outperforms the existing pooling strategies on image retrieval and multi-modal representation learning tasks, demonstrating the superiority of GGeM for a variety of tasks. GGeM is a simple algorithm in that only a few lines of code are necessary for implementation.

READ FULL TEXT

page 3

page 15

research
09/16/2022

Self-Attentive Pooling for Efficient Deep Learning

Efficient custom pooling techniques that can aggressively trim the dimen...
research
06/13/2022

MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

Convolutional Neural Networks (CNNs) have been regarded as the go-to mod...
research
03/11/2017

Viraliency: Pooling Local Virality

In our overly-connected world, the automatic recognition of virality - t...
research
08/12/2019

LIP: Local Importance-based Pooling

Spatial downsampling layers are favored in convolutional neural networks...
research
09/04/2023

ExMobileViT: Lightweight Classifier Extension for Mobile Vision Transformer

The paper proposes an efficient structure for enhancing the performance ...
research
03/19/2021

Scalable Visual Transformers with Hierarchical Pooling

The recently proposed Visual image Transformers (ViT) with pure attentio...
research
02/11/2016

Attentive Pooling Networks

In this work, we propose Attentive Pooling (AP), a two-way attention mec...

Please sign up or login with your details

Forgot password? Click here to reset