Spot What Matters: Learning Context Using Graph Convolutional Networks for Weakly-Supervised Action Detection

07/28/2021
by   Michail Tsiaousis, et al.
0

The dominant paradigm in spatiotemporal action detection is to classify actions using spatiotemporal features learned by 2D or 3D Convolutional Networks. We argue that several actions are characterized by their context, such as relevant objects and actors present in the video. To this end, we introduce an architecture based on self-attention and Graph Convolutional Networks in order to model contextual cues, such as actor-actor and actor-object interactions, to improve human action detection in video. We are interested in achieving this in a weakly-supervised setting, i.e. using as less annotations as possible in terms of action bounding boxes. Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training. We evaluate how well our model highlights the relevant context by introducing a quantitative metric based on recall of objects retrieved by attention maps. Our model relies on a 3D convolutional RGB stream, and does not require expensive optical flow computation. We evaluate our models on the DALY dataset, which consists of human-object interaction actions. Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP. Code is available at <https://github.com/micts/acgcn>

READ FULL TEXT
research
04/05/2018

Guess Where? Actor-Supervision for Spatiotemporal Action Localization

This paper addresses the problem of spatiotemporal localization of actio...
research
02/14/2019

Predicting Ergonomic Risks During Indoor Object Manipulation Using Spatiotemporal Convolutional Networks

Automated real-time prediction of the ergonomic risks of manipulating ob...
research
09/16/2019

Where are the Keys? -- Learning Object-Centric Navigation Policies on Semantic Maps with Graph Convolutional Networks

Emerging object-based SLAM algorithms can build a graph representation o...
research
12/30/2018

Actor Conditioned Attention Maps for Video Action Detection

Interactions with surrounding objects and people contain important infor...
research
10/07/2021

Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

We introduce the task of weakly supervised learning for detecting human ...
research
08/27/2022

Actor-identified Spatiotemporal Action Detection – Detecting Who Is Doing What in Videos

The success of deep learning on video Action Recognition (AR) has motiva...
research
09/02/2021

SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric Videos

Action anticipation in egocentric videos is a difficult task due to the ...

Please sign up or login with your details

Forgot password? Click here to reset