Free-Form Composition Networks for Egocentric Action Recognition

07/13/2023
by   Haoran Wang, et al.
0

Egocentric action recognition is gaining significant attention in the field of human action recognition. In this paper, we address data scarcity issue in egocentric action recognition from a compositional generalization perspective. To tackle this problem, we propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations, and then use them to compose new samples in the feature space for rare classes of action videos. First, we use a graph to capture the spatial-temporal relations among different hand/object instances in each action video. We thus decompose each action into a set of verb and preposition spatial-temporal representations using the edge features in the graph. The temporal decomposition extracts verb and preposition representations from different video frames, while the spatial decomposition adaptively learns verb and preposition representations from action-related instances in each frame. With these spatial-temporal representations of verbs and prepositions, we can compose new samples for those rare classes in a free-form manner, which is not restricted to a rigid form of a verb and a noun. The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance. We evaluated our method on three popular egocentric action recognition datasets, Something-Something V2, H2O, and EPIC-KITCHENS-100, and the experimental results demonstrate the effectiveness of the proposed method for handling data scarcity problems, including long-tailed and few-shot egocentric action recognition.

READ FULL TEXT

page 1

page 3

page 8

page 9

page 12

research
11/18/2016

An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data

Human action recognition is an important task in computer vision. Extrac...
research
07/20/2022

Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action Recognition

A primary challenge faced in few-shot action recognition is inadequate v...
research
08/27/2016

Spatio-temporal Aware Non-negative Component Representation for Action Recognition

This paper presents a novel mid-level representation for action recognit...
research
12/12/2022

Reconstructing Humpty Dumpty: Multi-feature Graph Autoencoder for Open Set Action Recognition

Most action recognition datasets and algorithms assume a closed world, w...
research
07/04/2022

Disentangled Action Recognition with Knowledge Bases

Action in video usually involves the interaction of human with objects. ...
research
04/07/2020

Temporal Pyramid Network for Action Recognition

Visual tempo characterizes the dynamics and the temporal scale of an act...
research
07/13/2020

Universal-to-Specific Framework for Complex Action Recognition

Video-based action recognition has recently attracted much attention in ...

Please sign up or login with your details

Forgot password? Click here to reset