Video Transformer Network

02/01/2021
by   Daniel Neimark, et al.
0

This paper presents VTN, a transformer-based framework for video recognition. Inspired by recent developments in vision transformers, we ditch the standard approach in video action recognition that relies on 3D ConvNets and introduce a method that classifies actions by attending to the entire video sequence information. Our approach is generic and builds on top of any given 2D spatial network. In terms of wall runtime, it trains 16.1× faster and runs 5.1× faster during inference while maintaining competitive accuracy compared to other state-of-the-art methods. It enables whole video analysis, via a single end-to-end pass, while requiring 1.5× fewer GFLOPs. We report competitive results on Kinetics-400 and present an ablation study of VTN properties and the trade-off between accuracy and inference speed. We hope our approach will serve as a new baseline and start a fresh line of research in the video recognition domain. Code and models will be available soon.

READ FULL TEXT

page 2

page 6

page 11

research
03/19/2022

DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition

Human action recognition has recently become one of the popular research...
research
06/24/2021

Video Swin Transformer

The vision community is witnessing a modeling shift from CNNs to Transfo...
research
05/21/2019

Lightweight Network Architecture for Real-Time Action Recognition

In this work we present a new efficient approach to Human Action Recogni...
research
02/17/2023

Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer

Human Action Recognition (HAR) involves the task of categorizing actions...
research
06/07/2023

Optimizing ViViT Training: Time and Memory Reduction for Action Recognition

In this paper, we address the challenges posed by the substantial traini...
research
07/08/2022

VidConv: A modernized 2D ConvNet for Efficient Video Recognition

Since being introduced in 2020, Vision Transformers (ViT) has been stead...
research
05/18/2021

Vision Transformer for Fast and Efficient Scene Text Recognition

Scene text recognition (STR) enables computers to read text in natural s...

Please sign up or login with your details

Forgot password? Click here to reset