Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action Recognition

08/03/2022
by   Mei Chee Leong, et al.
0

Fine-grained action recognition is a challenging task in computer vision. As fine-grained datasets have small inter-class variations in spatial and temporal space, fine-grained action recognition model requires good temporal reasoning and discrimination of attribute action semantics. Leveraging on CNN's ability in capturing high level spatial-temporal feature representations and Transformer's modeling efficiency in capturing latent semantics and global dependencies, we investigate two frameworks that combine CNN vision backbone and Transformer Encoder to enhance fine-grained action recognition: 1) a vision-based encoder to learn latent temporal semantics, and 2) a multi-modal video-text cross encoder to exploit additional text input and learn cross association between visual and text semantics. Our experimental results show that both our Transformer encoder frameworks effectively learn latent temporal semantics and cross-modality association, with improved recognition performance over CNN vision model. We achieve new state-of-the-art performance on the FineGym benchmark dataset for both proposed architectures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2021

Joint Learning On The Hierarchy Representation for Fine-Grained Human Action Recognition

Fine-grained human action recognition is a core research topic in comput...
research
08/20/2019

Action recognition with spatial-temporal discriminative filter banks

Action recognition has seen a dramatic performance improvement in the la...
research
09/03/2022

Dynamic Spatio-Temporal Specialization Learning for Fine-Grained Action Recognition

The goal of fine-grained action recognition is to successfully discrimin...
research
05/05/2015

Contextual Action Recognition with R*CNN

There are multiple cues in an image which reveal what action a person is...
research
03/17/2023

Video Action Recognition with Attentive Semantic Units

Visual-Language Models (VLMs) have significantly advanced action video r...
research
06/30/2023

SpATr: MoCap 3D Human Action Recognition based on Spiral Auto-encoder and Transformer Network

Recent advancements in technology have expanded the possibilities of hum...
research
07/25/2021

Adaptive Recursive Circle Framework for Fine-grained Action Recognition

How to model fine-grained spatial-temporal dynamics in videos has been a...

Please sign up or login with your details

Forgot password? Click here to reset