DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition

03/19/2022
by   Thanh-Dat Truong, et al.
0

Human action recognition has recently become one of the popular research topics in the computer vision community. Various 3D-CNN based methods have been presented to tackle both the spatial and temporal dimensions in the task of video action recognition with competitive results. However, these methods have suffered some fundamental limitations such as lack of robustness and generalization, e.g., how does the temporal ordering of video frames affect the recognition results? This work presents a novel end-to-end Transformer-based Directed Attention (DirecFormer) framework for robust action recognition. The method takes a simple but novel perspective of Transformer-based approach to understand the right order of sequence actions. Therefore, the contributions of this work are three-fold. Firstly, we introduce the problem of ordered temporal learning issues to the action recognition problem. Secondly, a new Directed Attention mechanism is introduced to understand and provide attentions to human actions in the right order. Thirdly, we introduce the conditional dependency in action sequence modeling that includes orders and classes. The proposed approach consistently achieves the state-of-the-art (SOTA) results compared with the recent action recognition methods, on three standard large-scale benchmarks, i.e. Jester, Kinetics-400 and Something-Something-V2.

READ FULL TEXT

page 3

page 5

page 8

research
11/19/2021

Action Recognition with Domain Invariant Features of Skeleton Image

Due to the fast processing-speed and robustness it can achieve, skeleton...
research
02/01/2021

Video Transformer Network

This paper presents VTN, a transformer-based framework for video recogni...
research
07/01/2021

Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition

Deep neural networks based purely on attention have been successful acro...
research
12/29/2013

Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition

Systems based on bag-of-words models from image features collected at ma...
research
03/21/2023

Automatic evaluation of herding behavior in towed fishing gear using end-to-end training of CNN and attention-based networks

This paper considers the automatic classification of herding behavior in...
research
12/14/2021

Temporal Transformer Networks with Self-Supervision for Action Recognition

In recent years, 2D Convolutional Networks-based video action recognitio...
research
03/21/2022

LocATe: End-to-end Localization of Actions in 3D with Transformers

Understanding a person's behavior from their 3D motion is a fundamental ...

Please sign up or login with your details

Forgot password? Click here to reset