Symbiotic Attention with Privileged Information for Egocentric Action Recognition

02/08/2020
by   Xiaohan Wang, et al.
0

Egocentric video recognition is a natural testbed for diverse interaction reasoning. Due to the large action vocabulary in egocentric video datasets, recent studies usually utilize a two-branch structure for action recognition, ie, one branch for verb classification and the other branch for noun classification. However, correlation studies between the verb and the noun branches have been largely ignored. Besides, the two branches fail to exploit local features due to the absence of a position-aware attention mechanism. In this paper, we propose a novel Symbiotic Attention framework leveraging Privileged information (SAP) for egocentric video recognition. Finer position-aware object detection features can facilitate the understanding of actor's interaction with the object. We introduce these features in action recognition and regard them as privileged information. Our framework enables mutual communication among the verb branch, the noun branch, and the privileged information. This communication process not only injects local details into global features but also exploits implicit guidance about the spatio-temporal position of an on-going action. We introduce novel symbiotic attention (SA) to enable effective communication. It first normalizes the detection guided features on one branch to underline the action-relevant information from the other branch. SA adaptively enhances the interactions among the three sources. To further catalyze this communication, spatial relations are uncovered for the selection of most action-relevant information. It identifies the most valuable and discriminative feature for classification. We validate the effectiveness of our SAP quantitatively and qualitatively. Notably, it achieves the state-of-the-art on two large-scale egocentric video datasets.

READ FULL TEXT

page 1

page 4

page 7

research
04/01/2023

DOAD: Decoupled One Stage Action Detection Network

Localizing people and recognizing their actions from videos is a challen...
research
11/18/2022

Look More but Care Less in Video Recognition

Existing action recognition methods typically sample a few frames to rep...
research
11/16/2016

Joint Network based Attention for Action Recognition

By extracting spatial and temporal characteristics in one network, the t...
research
02/18/2020

Knowledge Integration Networks for Action Recognition

In this work, we propose Knowledge Integration Networks (referred as KIN...
research
03/17/2023

Video Action Recognition with Attentive Semantic Units

Visual-Language Models (VLMs) have significantly advanced action video r...
research
08/12/2021

Learning Visual Affordance Grounding from Demonstration Videos

Visual affordance grounding aims to segment all possible interaction reg...
research
11/25/2022

Mutual Guidance and Residual Integration for Image Enhancement

Previous studies show the necessity of global and local adjustment for i...

Please sign up or login with your details

Forgot password? Click here to reset