Selective Volume Mixup for Video Action Recognition

09/18/2023
by   Yi Tan, et al.
0

The recent advances in Convolutional Neural Networks (CNNs) and Vision Transformers have convincingly demonstrated high learning capability for video action recognition on large datasets. Nevertheless, deep models often suffer from the overfitting effect on small-scale datasets with a limited number of training videos. A common solution is to exploit the existing image augmentation strategies for each frame individually including Mixup, Cutmix, and RandAugment, which are not particularly optimized for video data. In this paper, we propose a novel video augmentation strategy named Selective Volume Mixup (SV-Mix) to improve the generalization ability of deep models with limited training videos. SV-Mix devises a learnable selective module to choose the most informative volumes from two videos and mixes the volumes up to achieve a new training video. Technically, we propose two new modules, i.e., a spatial selective module to select the local patches for each spatial position, and a temporal selective module to mix the entire frames for each timestamp and maintain the spatial pattern. At each time, we randomly choose one of the two modules to expand the diversity of training samples. The selective modules are jointly optimized with the video action recognition framework to find the optimal augmentation strategy. We empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks and consistently boot the performances of both CNN-based and transformer-based models.

READ FULL TEXT

page 1

page 3

page 8

page 9

page 10

research
11/23/2022

SVFormer: Semi-supervised Video Transformer for Action Recognition

Semi-supervised action recognition is a challenging but critical task du...
research
05/08/2017

Temporal Segment Networks for Action Recognition in Videos

Deep convolutional networks have achieved great success for image recogn...
research
11/09/2022

Extending Temporal Data Augmentation for Video Action Recognition

Pixel space augmentation has grown in popularity in many Deep Learning a...
research
09/25/2020

Online Learnable Keyframe Extraction in Videos and its Application with Semantic Word Vector in Action Recognition

Video processing has become a popular research direction in computer vis...
research
04/02/2020

Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention

Attentive video modeling is essential for action recognition in unconstr...
research
12/07/2020

VideoMix: Rethinking Data Augmentation for Video Classification

State-of-the-art video action classifiers often suffer from overfitting....

Please sign up or login with your details

Forgot password? Click here to reset