Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

01/19/2023
by   Jiazheng Xing, et al.
0

Spatial and temporal modeling is one of the most core aspects of few-shot action recognition. Most previous works mainly focus on long-term temporal relation modeling based on high-level spatial representations, without considering the crucial low-level spatial features and short-term temporal relations. Actually, the former feature could bring rich local semantic information, and the latter feature could represent motion characteristics of adjacent frames, respectively. In this paper, we propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner. First, to exploit the low-level spatial features, we design a feature fusion architecture search module to automatically search for the best combination of the low-level and high-level spatial features. Next, inspired by the recent transformer, we introduce a long-term temporal modeling module to model the global temporal relations based on the extracted spatial appearance features. Meanwhile, we design another short-term temporal modeling module to encode the motion characteristics between adjacent frame representations. After that, the final predictions can be obtained by feeding the embedded rich spatial-temporal features to a common frame-level class prototype matcher. We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51. It achieves favorable results against state-of-the-art methods in all datasets.

READ FULL TEXT

page 1

page 4

page 7

page 11

research
06/30/2021

Long-Short Temporal Modeling for Efficient Action Recognition

Efficient long-short temporal modeling is key for enhancing the performa...
research
10/12/2021

Video Is Graph: Structured Graph Module for Video Action Recognition

In the field of action recognition, video clips are always treated as or...
research
09/22/2022

FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial Video Classification

Unmanned aerial vehicles (UAVs) are now widely applied to data acquisiti...
research
03/14/2023

You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos

Given an untrimmed video, temporal sentence grounding (TSG) aims to loca...
research
08/30/2020

Finding Action Tubes with a Sparse-to-Dense Framework

The task of spatial-temporal action detection has attracted increasing a...
research
06/02/2018

Squeeze-and-Excitation on Spatial and Temporal Deep Feature Space for Action Recognition

Spatial and temporal features are two key and complementary information ...
research
08/18/2023

Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching

Class prototype construction and matching are core aspects of few-shot a...

Please sign up or login with your details

Forgot password? Click here to reset