EgoViT: Pyramid Video Transformer for Egocentric Action Recognition

03/15/2023
by   Chenbin Pan, et al.
0

Capturing interaction of hands with objects is important to autonomously detect human actions from egocentric videos. In this work, we present a pyramid video transformer with a dynamic class token generator for egocentric action recognition. Different from previous video transformers, which use the same static embedding as the class token for diverse inputs, we propose a dynamic class token generator that produces a class token for each input video by analyzing the hand-object interaction and the related motion information. The dynamic class token can diffuse such information to the entire model by communicating with other informative tokens in the subsequent transformer layers. With the dynamic class token, dissimilarity between videos can be more prominent, which helps the model distinguish various inputs. In addition, traditional video transformers explore temporal features globally, which requires large amounts of computation. However, egocentric videos often have a large amount of background scene transition, which causes discontinuities across distant frames. In this case, blindly reducing the temporal sampling rate will risk losing crucial information. Hence, we also propose a pyramid architecture to hierarchically process the video from short-term high rate to long-term low rate. With the proposed architecture, we significantly reduce the computational cost as well as the memory requirement without sacrificing from the model performance. We perform comparisons with different baseline video transformers on the EPIC-KITCHENS-100 and EGTEA Gaze+ datasets. Both quantitative and qualitative results show that the proposed model can efficiently improve the performance for egocentric action recognition.

READ FULL TEXT

page 1

page 4

page 8

page 9

page 10

research
12/14/2021

Co-training Transformer with Videos and Images Improves Action Recognition

In learning action recognition, models are typically pre-trained on obje...
research
11/23/2021

Efficient Video Transformers with Spatial-Temporal Token Selection

Video transformers have achieved impressive results on major video recog...
research
07/01/2021

VideoLightFormer: Lightweight Action Recognition using Transformers

Efficient video action recognition remains a challenging problem. One la...
research
08/25/2023

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

Vision Transformers achieve impressive accuracy across a range of visual...
research
08/25/2022

Visualizing the Passage of Time with Video Temporal Pyramids

What can we learn about a scene by watching it for months or years? A vi...
research
03/25/2023

Selective Structured State-Spaces for Long-Form Video Understanding

Effective modeling of complex spatiotemporal dependencies in long-form v...
research
12/07/2022

Multimodal Vision Transformers with Forced Attention for Behavior Analysis

Human behavior understanding requires looking at minute details in the l...

Please sign up or login with your details

Forgot password? Click here to reset