Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM

09/19/2017
by   Lai Jiang, et al.
0

Over the past few years, deep neural networks (DNNs) have exhibited great success in predicting the saliency of images. However, there are few works that apply DNNs to predict the saliency of generic videos. In this paper, we propose a novel DNN-based video saliency prediction method. Specifically, we establish a large-scale eye-tracking database of videos (LEDOV), which provides sufficient data to train the DNN models for predicting video saliency. Through the statistical analysis of our LEDOV database, we find that human attention is normally attracted by objects, particularly moving objects or the moving parts of objects. Accordingly, we propose an object-to-motion convolutional neural network (OM-CNN) to learn spatio-temporal features for predicting the intra-frame saliency via exploring the information of both objectness and object motion. We further find from our database that there exists a temporal correlation of human attention with a smooth saliency transition across video frames. Therefore, we develop a two-layer convolutional long short-term memory (2C-LSTM) network in our DNN-based method, using the extracted features of OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can be generated, which consider the transition of attention across video frames. Finally, the experimental results show that our method advances the state-of-the-art in video saliency prediction.

READ FULL TEXT

page 4

page 5

page 6

page 7

page 10

page 11

research
01/26/2018

Supersaliency: Predicting Smooth Pursuit-Based Attention with Slicing CNNs Improves Fixation Prediction for Naturalistic Videos

Predicting attention is a popular topic at the intersection of human and...
research
06/06/2019

Removing Rain in Videos: A Large-scale Database and A Two-stream ConvLSTM Approach

Rain removal has recently attracted increasing research attention, as it...
research
02/19/2019

Predicting tongue motion in unlabeled ultrasound videos using convolutional LSTM neural network

A challenge in speech production research is to predict future tongue mo...
research
03/31/2017

Semantic-driven Generation of Hyperlapse from 360^∘ Video

We present a system for converting a fully panoramic (360^∘) video into ...
research
09/21/2018

SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection

Data-driven saliency detection has attracted strong interest as a result...
research
11/28/2018

Future-State Predicting LSTM for Early Surgery Type Recognition

This work presents a novel approach for the early recognition of the typ...
research
05/20/2019

Are all the frames equally important?

In this work, we address the problem of measuring and predicting tempora...

Please sign up or login with your details

Forgot password? Click here to reset