Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

04/19/2022
by   Wang Zeng, et al.
9

Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally important in human-centric vision tasks, e.g., the human body needs a fine representation with many tokens, while the image background can be modeled by a few tokens. To address this problem, we propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer), which merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes. The tokens in TCFormer can not only focus on important areas but also adjust the token shapes to fit the semantic concept and adopt a fine resolution for regions containing critical details, which is beneficial to capturing detailed information. Extensive experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets, including whole-body pose estimation on COCO-WholeBody and 3D human mesh reconstruction on 3DPW. Code is available at https://github.com/zengwang430521/TCFormer.git

READ FULL TEXT

page 7

page 8

page 15

page 16

page 17

page 18

page 19

page 20

research
12/05/2021

Dynamic Token Normalization Improves Vision Transformer

Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieve...
research
08/03/2021

Vision Transformer with Progressive Sampling

Transformers with powerful global relation modeling abilities have been ...
research
04/01/2023

Vision Transformers with Mixed-Resolution Tokenization

Vision Transformer models process input images by dividing them into a s...
research
03/20/2022

Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows

This paper presents a new vision Transformer, named Iwin Transformer, wh...
research
03/23/2023

MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

Mobile monocular 3D object detection (Mono3D) (e.g., on a vehicle, a dro...
research
11/24/2021

An Image Patch is a Wave: Phase-Aware Vision MLP

Different from traditional convolutional neural network (CNN) and vision...
research
09/05/2023

Extract-and-Adaptation Network for 3D Interacting Hand Mesh Recovery

Understanding how two hands interact with each other is a key component ...

Please sign up or login with your details

Forgot password? Click here to reset