Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition

06/09/2021
by   Ziyuan Huang, et al.
0

With the recent surge in the research of vision transformers, they have demonstrated remarkable potential for various challenging computer vision applications, such as image recognition, point cloud classification as well as video understanding. In this paper, we present empirical results for training a stronger video vision transformer on the EPIC-KITCHENS-100 Action Recognition dataset. Specifically, we explore training techniques for video vision transformers, such as augmentations, resolutions as well as initialization, etc. With our training recipe, a single ViViT model achieves the performance of 47.4% on the validation set of EPIC-KITCHENS-100 dataset, outperforming what is reported in the original paper by 3.4 especially good at predicting the noun in the verb-noun action prediction task. This makes the overall action prediction accuracy of video transformers notably higher than convolutional ones. Surprisingly, even the best video transformers underperform the convolutional networks on the verb prediction. Therefore, we combine the video vision transformers and some of the convolutional video networks and present our solution to the EPIC-KITCHENS-100 Action Recognition competition.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/14/2021

Co-training Transformer with Videos and Images Improves Action Recognition

In learning action recognition, models are typically pre-trained on obje...
research
09/13/2022

Vision Transformers for Action Recognition: A Survey

Vision transformers are emerging as a powerful tool to solve computer vi...
research
11/18/2021

Evaluating Transformers for Lightweight Action Recognition

In video action recognition, transformers consistently reach state-of-th...
research
06/10/2019

The role of ego vision in view-invariant action recognition

Analysis and interpretation of egocentric video data is becoming more an...
research
09/08/2022

Video Vision Transformers for Violence Detection

Law enforcement and city safety are significantly impacted by detecting ...
research
05/28/2017

Continuous Video to Simple Signals for Swimming Stroke Detection with Convolutional Neural Networks

In many sports, it is useful to analyse video of an athlete in competiti...
research
02/17/2023

Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer

Human Action Recognition (HAR) involves the task of categorizing actions...

Please sign up or login with your details

Forgot password? Click here to reset