Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

07/05/2023
by   Fei Guo, et al.
0

In the research field of few-shot learning, the main difference between image-based and video-based is the additional temporal dimension for videos. In recent years, many approaches for few-shot action recognition have followed the metric-based methods, especially, since some works use the Transformer to get the cross-attention feature of the videos or the enhanced prototype, and the results are competitive. However, they do not mine enough information from the Transformer because they only focus on the feature of a single level. In our paper, we have addressed this problem. We propose an end-to-end method named "Task-Specific Alignment and Multiple Level Transformer Network (TSA-MLT)". In our model, the Multiple Level Transformer focuses on the multiple-level feature of the support video and query video. Especially before Multiple Level Transformer, we use task-specific TSA to filter unimportant or misleading frames as a pre-processing. Furthermore, we adopt a fusion loss using two kinds of distance, the first is L2 sequence distance, which focuses on temporal order alignment. The second one is Optimal transport distance, which focuses on measuring the gap between the appearance and semantics of the videos. Using a simple fusion network, we fuse the two distances element-wise, then use the cross-entropy loss as our fusion loss. Extensive experiments show our method achieves state-of-the-art results on the HMDB51 and UCF101 datasets and a competitive result on the benchmark of Kinetics and something-2-something V2 datasets. Our code will be available at the URL: https://github.com/cofly2014/tsa-mlt.git

READ FULL TEXT

page 1

page 2

page 4

page 10

research
08/21/2023

Joint learning of images and videos with a single Vision Transformer

In this study, we propose a method for jointly learning of images and vi...
research
01/15/2021

Temporal-Relational CrossTransformers for Few-Shot Action Recognition

We propose a novel approach to few-shot action recognition, finding temp...
research
12/13/2021

SVIP: Sequence VerIfication for Procedures in Videos

In this paper, we propose a novel sequence verification task that aims t...
research
04/08/2021

Few-Shot Action Recognition with Compromised Metric via Optimal Transport

Although vital to computer vision systems, few-shot action recognition i...
research
07/10/2021

TTAN: Two-Stage Temporal Alignment Network for Few-shot Action Recognition

Few-shot action recognition aims to recognize novel action classes (quer...
research
06/28/2021

Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection

With rapidly evolving internet technologies and emerging tools, sports r...
research
08/22/2023

Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition

We are concerned with a challenging scenario in unpaired multiview video...

Please sign up or login with your details

Forgot password? Click here to reset