Co-training Transformer with Videos and Images Improves Action Recognition

12/14/2021
by   Bowen Zhang, et al.
0

In learning action recognition, models are typically pre-trained on object recognition with images, such as ImageNet, and later fine-tuned on target action recognition with videos. This approach has achieved good empirical performance especially with recent transformer-based video architectures. While recently many works aim to design more advanced transformer architectures for action recognition, less effort has been made on how to train video transformers. In this work, we explore several training paradigms and present two findings. First, video transformers benefit from joint training on diverse video datasets and label spaces (e.g., Kinetics is appearance-focused while SomethingSomething is motion-focused). Second, by further co-training with images (as single-frame videos), the video transformers learn even better video representations. We term this approach as Co-training Videos and Images for Action Recognition (CoVeR). In particular, when pretrained on ImageNet-21K based on the TimeSFormer architecture, CoVeR improves Kinetics-400 Top-1 Accuracy by 2.4 pretrained on larger-scale image datasets following previous state-of-the-art, CoVeR achieves best results on Kinetics-400 (87.2 Kinetics-700 (79.8 (46.1

READ FULL TEXT

page 3

page 5

page 10

research
06/09/2021

Towards Training Stronger Video Vision Transformers for EPIC-KITCHENS-100 Action Recognition

With the recent surge in the research of vision transformers, they have ...
research
08/21/2023

Joint learning of images and videos with a single Vision Transformer

In this study, we propose a method for jointly learning of images and vi...
research
03/15/2023

EgoViT: Pyramid Video Transformer for Egocentric Action Recognition

Capturing interaction of hands with objects is important to autonomously...
research
07/10/2020

AViD Dataset: Anonymized Videos from Diverse Countries

We introduce a new public video dataset for action recognition: Anonymiz...
research
12/13/2021

Real Time Action Recognition from Video Footage

Crime rate is increasing proportionally with the increasing rate of the ...
research
11/18/2021

Evaluating Transformers for Lightweight Action Recognition

In video action recognition, transformers consistently reach state-of-th...
research
05/22/2017

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

The paucity of videos in current action classification datasets (UCF-101...

Please sign up or login with your details

Forgot password? Click here to reset