Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition

07/31/2018
by   Swathikiran Sudhakaran, et al.
0

In this paper we propose an end-to-end trainable deep neural network model for egocentric activity recognition. Our model is built on the observation that egocentric activities are highly characterized by the objects and their locations in the video. Based on this, we develop a spatial attention mechanism that enables the network to attend to regions containing objects that are correlated with the activity under consideration. We learn highly specialized attention maps for each frame using class-specific activations from a CNN pre-trained for generic image recognition, and use them for spatio-temporal encoding of the video with a convolutional LSTM. Our model is trained in a weakly supervised setting using raw video-level activity-class labels. Nonetheless, on standard egocentric activity benchmarks our model surpasses by up to +6 leverages hand segmentation and object location strong supervision for training. We visually analyze attention maps generated by the network, revealing that the network successfully identifies the relevant objects present in the video frames which may explain the strong recognition performance. We also discuss an extensive ablation analysis regarding the design choices.

READ FULL TEXT

page 2

page 3

page 7

page 8

page 9

page 11

page 17

page 18

research
11/26/2018

LSTA: Long Short-Term Attention for Egocentric Action Recognition

Egocentric activity recognition is one of the most challenging tasks in ...
research
07/12/2021

Human-like Relational Models for Activity Recognition in Video

Video activity recognition by deep neural networks is impressive for man...
research
05/04/2018

Object and Text-guided Semantics for CNN-based Activity Recognition

Many previous methods have demonstrated the importance of considering se...
research
04/05/2022

Detector-Free Weakly Supervised Group Activity Recognition

Group activity recognition is the task of understanding the activity con...
research
02/21/2021

Efficient Two-Stream Network for Violence Detection Using Separable Convolutional LSTM

Automatically detecting violence from surveillance footage is a subset o...
research
02/21/2023

Weakly Supervised Temporal Convolutional Networks for Fine-grained Surgical Activity Recognition

Automatic recognition of fine-grained surgical activities, called steps,...
research
03/22/2017

R-C3D: Region Convolutional 3D Network for Temporal Activity Detection

We address the problem of activity detection in continuous, untrimmed vi...

Please sign up or login with your details

Forgot password? Click here to reset