Recurrent Mixture Density Network for Spatiotemporal Visual Attention

03/27/2016
by   Loris Bazzani, et al.
0

In many computer vision tasks, the relevant information to solve the problem at hand is mixed to irrelevant, distracting information. This has motivated researchers to design attentional models that can dynamically focus on parts of images or videos that are salient, e.g., by down-weighting irrelevant pixels. In this work, we propose a spatiotemporal attentional model that learns where to look in a video directly from human fixation data. We model visual attention with a mixture of Gaussians at each frame. This distribution is used to express the probability of saliency for each pixel. Time consistency in videos is modeled hierarchically by: 1) deep 3D convolutional features to represent spatial and short-term time relations and 2) a long short-term memory network on top that aggregates the clip-level representation of sequential clips and therefore expands the temporal domain from few frames to seconds. The parameters of the proposed model are optimized via maximum likelihood estimation using human fixations as training data, without knowledge of the action in each video. Our experiments on Hollywood2 show state-of-the-art performance on saliency prediction for video. We also show that our attentional model trained on Hollywood2 generalizes well to UCF101 and it can be leveraged to improve action classification accuracy on both datasets.

READ FULL TEXT

page 4

page 14

research
11/12/2015

Action Recognition using Visual Attention

We propose a soft attention based model for the task of action recogniti...
research
11/18/2017

Excitation Backprop for RNNs

Deep models are state-of-the-art for many vision tasks including video a...
research
01/02/2021

Video Captioning in Compressed Video

Existing approaches in video captioning concentrate on exploring global ...
research
01/09/2019

A Biologically Inspired Visual Working Memory for Deep Networks

The ability to look multiple times through a series of pose-adjusted gli...
research
01/02/2020

Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network

Due to a variety of motions across different frames, it is highly challe...
research
06/05/2020

Egocentric Object Manipulation Graphs

We introduce Egocentric Object Manipulation Graphs (Ego-OMG) - a novel r...
research
11/09/2018

Semantic and Contrast-Aware Saliency

In this paper, we proposed an integrated model of semantic-aware and con...

Please sign up or login with your details

Forgot password? Click here to reset