Learning Patch-to-Cluster Attention in Vision Transformer

03/22/2022
by   Ryan Grainger, et al.
1

The Vision Transformer (ViT) model is built on the assumption of treating image patches as "visual tokens" and learning patch-to-patch attention. The patch embedding based tokenizer is a workaround in practice and has a semantic gap with respect to its counterpart, the textual tokenizer. The patch-to-patch attention suffers from the quadratic complexity issue, and also makes it non-trivial to explain learned ViT models. To address these issues in ViT models, this paper proposes to learn patch-to-cluster attention (PaCa) based ViT models. Queries in our PaCaViT are based on patches, while keys and values are based on clustering (with a predefined small number of clusters). The clusters are learned end-to-end, leading to better tokenizers and realizing joint clustering-for-attention and attention-for-clustering when deployed in ViT models. The quadratic complexity is relaxed to linear complexity. Also, directly visualizing the learned clusters can reveal how a trained ViT model learns to perform a task (e.g., object detection). In experiments, the proposed PaCa-ViT is tested on CIFAR-100 and ImageNet-1000 image classification, and MS-COCO object detection and instance segmentation. Compared with prior arts, it obtains better performance in classification and comparable performance in detection and segmentation. It is significantly more efficient in COCO due to the linear complexity. The learned clusters are also semantically meaningful and shed light on designing more discriminative yet interpretable ViT models.

READ FULL TEXT

page 2

page 3

page 7

page 8

page 10

research
07/30/2021

DPT: Deformable Patch-based Transformer for Visual Recognition

Transformer has achieved great success in computer vision, while how to ...
research
03/06/2022

Multi-class Token Transformer for Weakly Supervised Semantic Segmentation

This paper proposes a new transformer-based framework to learn class-spe...
research
09/21/2022

PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification

Automatic pavement distress classification facilitates improving the eff...
research
11/18/2020

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

Object detection with transformers (DETR) reaches competitive performanc...
research
03/08/2022

Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention

Recently, Transformers have shown promising performance in various visio...
research
05/08/2023

Understanding Gaussian Attention Bias of Vision Transformers Using Effective Receptive Fields

Vision transformers (ViTs) that model an image as a sequence of partitio...
research
11/04/2022

Patch DCT vs LeNet

This paper compares the performance of a NN taking the output of a DCT (...

Please sign up or login with your details

Forgot password? Click here to reset