Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers

06/15/2023
by   Dominick Reilly, et al.
0

Human perception of surroundings is often guided by the various poses present within the environment. Many computer vision tasks, such as human action recognition and robot imitation learning, rely on pose-based entities like human skeletons or robotic arms. However, conventional Vision Transformer (ViT) models uniformly process all patches, neglecting valuable pose priors in input videos. We argue that incorporating poses into RGB data is advantageous for learning fine-grained and viewpoint-agnostic representations. Consequently, we introduce two strategies for learning pose-aware representations in ViTs. The first method, called Pose-aware Attention Block (PAAB), is a plug-and-play ViT block that performs localized attention on pose regions within videos. The second method, dubbed Pose-Aware Auxiliary Task (PAAT), presents an auxiliary pose prediction task optimized jointly with the primary ViT task. Although their functionalities differ, both methods succeed in learning pose-aware representations, enhancing performance in multiple diverse downstream tasks. Our experiments, conducted across seven datasets, reveal the efficacy of both pose-aware methods on three video analysis tasks, with PAAT holding a slight edge over PAAB. Both PAAT and PAAB surpass their respective backbone Transformers by up to 9.8 multi-view robotic video alignment. Code is available at https://github.com/dominickrei/PoseAwareVT.

READ FULL TEXT

page 2

page 4

page 5

page 8

research
06/23/2022

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Humans are remarkably flexible in understanding viewpoint changes due to...
research
01/08/2022

QuadTree Attention for Vision Transformers

Transformers have been successful in many vision tasks, thanks to their ...
research
03/29/2021

3D Human Pose Estimation with Spatial and Temporal Transformers

Transformer architectures have become the model of choice in natural lan...
research
05/17/2021

VPN++: Rethinking Video-Pose embeddings for understanding Activities of Daily Living

Many attempts have been made towards combining RGB and 3D poses for the ...
research
12/06/2019

Gaussian Process Priors for View-Aware Inference

We derive a principled framework for encoding prior knowledge of informa...
research
09/14/2023

Interpretability-Aware Vision Transformer

Vision Transformers (ViTs) have become prominent models for solving vari...
research
09/19/2023

MAGIC-TBR: Multiview Attention Fusion for Transformer-based Bodily Behavior Recognition in Group Settings

Bodily behavioral language is an important social cue, and its automated...

Please sign up or login with your details

Forgot password? Click here to reset