VideoLightFormer: Lightweight Action Recognition using Transformers

07/01/2021
by   Raivo Koot, et al.
0

Efficient video action recognition remains a challenging problem. One large model after another takes the place of the state-of-the-art on the Kinetics dataset, but real-world efficiency evaluations are often lacking. In this work, we fill this gap and investigate the use of transformers for efficient action recognition. We propose a novel, lightweight action recognition architecture, VideoLightFormer. In a factorized fashion, we carefully extend the 2D convolutional Temporal Segment Network with transformers, while maintaining spatial and temporal video structure throughout the entire model. Existing methods often resort to one of the two extremes, where they either apply huge transformers to video features, or minimal transformers on highly pooled video features. Our method differs from them by keeping the transformer models small, but leveraging full spatiotemporal feature structure. We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-Something-V2 (SSV2) datasets and find that it achieves a better mix of efficiency and accuracy than existing state-of-the-art models, apart from the Temporal Shift Module on SSV2.

READ FULL TEXT
research
11/18/2021

Evaluating Transformers for Lightweight Action Recognition

In video action recognition, transformers consistently reach state-of-th...
research
03/17/2023

Dual-path Adaptation from Image to Video Transformers

In this paper, we efficiently transfer the surpassing representation pow...
research
08/25/2023

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

Vision Transformers achieve impressive accuracy across a range of visual...
research
05/21/2020

SpotFast Networks with Memory Augmented Lateral Transformers for Lipreading

This paper presents a novel deep learning architecture for word-level li...
research
07/29/2022

Forensic License Plate Recognition with Compression-Informed Transformers

Forensic license plate recognition (FLPR) remains an open challenge in l...
research
03/15/2023

EgoViT: Pyramid Video Transformer for Egocentric Action Recognition

Capturing interaction of hands with objects is important to autonomously...
research
06/09/2021

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

In video transformers, the time dimension is often treated in the same w...

Please sign up or login with your details

Forgot password? Click here to reset