Joint learning of images and videos with a single Vision Transformer

08/21/2023
by   Shuki Shimizu, et al.
0

In this study, we propose a method for jointly learning of images and videos using a single model. In general, images and videos are often trained by separate models. We propose in this paper a method that takes a batch of images as input to Vision Transformer IV-ViT, and also a set of video frames with temporal aggregation by late fusion. Experimental results on two image datasets and two action recognition datasets are presented.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/14/2021

Co-training Transformer with Videos and Images Improves Action Recognition

In learning action recognition, models are typically pre-trained on obje...
research
07/05/2023

Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

In the research field of few-shot learning, the main difference between ...
research
04/19/2022

On the Performance Evaluation of Action Recognition Models on Transcoded Low Quality Videos

In the design of action recognition models, the quality of videos in the...
research
11/23/2022

SVFormer: Semi-supervised Video Transformer for Action Recognition

Semi-supervised action recognition is a challenging but critical task du...
research
06/16/2022

OmniMAE: Single Model Masked Pretraining on Images and Videos

Transformer-based architectures have become competitive across a variety...
research
04/23/2022

VISTA: Vision Transformer enhanced by U-Net and Image Colorfulness Frame Filtration for Automatic Retail Checkout

Multi-class product counting and recognition identifies product items fr...
research
05/10/2012

Hajj and Umrah Event Recognition Datasets

In this note, new Hajj and Umrah Event Recognition datasets (HUER) are p...

Please sign up or login with your details

Forgot password? Click here to reset