Deep Learning for Saliency Prediction in Natural Video

The purpose of this paper is the detection of salient areas in natural video by using the new deep learning techniques. Salient patches in video frames are predicted first. Then the predicted visual fixation maps are built upon them. We design the deep architecture on the basis of CaffeNet implemented with Caffe toolkit. We show that changing the way of data selection for optimisation of network parameters, we can save computation cost up to 12 times. We extend deep learning approaches for saliency prediction in still images with RGB values to specificity of video using the sensitivity of the human visual system to residual motion. Furthermore, we complete primary colour pixel values by contrast features proposed in classical visual attention prediction models. The experiments are conducted on two publicly available datasets. The first is IRCCYN video database containing 31 videos with an overall amount of 7300 frames and eye fixations of 37 subjects. The second one is HOLLYWOOD2 provided 2517 movie clips with the eye fixations of 19 subjects. On IRCYYN dataset, the accuracy obtained is of 89.51 saliency of patches show the improvement up to 2 The resulting accuracy of 76, 6 predicted saliency maps with visual fixation maps shows the increase up to 16 on a sample of video clips from this dataset.



There are no comments yet.


page 7

page 8

page 16

page 18


Understanding spatial correlation in eye-fixation maps for visual attention in videos

In this paper, we present an analysis of recorded eye-fixation data from...

A Naturalness Evaluation Database for Video Prediction Models

The study of video prediction models is believed to be a fundamental app...

A Learning-Based Visual Saliency Prediction Model for Stereoscopic 3D Video (LBVS-3D)

Over the past decade, many computational saliency prediction models have...

ATSal: An Attention Based Architecture for Saliency Prediction in 360 Videos

The spherical domain representation of 360 video/image presents many cha...

Saccade Sequence Prediction: Beyond Static Saliency Maps

Visual attention is a field with a considerable history, with eye moveme...

Saliency-guided video classification via adaptively weighted learning

Video classification is productive in many practical applications, and t...

MemX: An Attention-Aware Smart Eyewear System for Personalized Moment Auto-capture

This work presents MemX: a biologically-inspired attention-aware eyewear...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has emerged as a new field of research in machine learning, providing learning at multiple levels of abstraction for mining the data such as images, sound and text


. Although, it is hierarchically created usually on the basis of neural networks, deep learning presents a philosophy to model the complex relationships between data

New2 , New18 . Since recently, deep learning has become the most exciting field which attracts many researchers. First, to understand the new deep networks in itself (New13 ,New33 , New34 , New35 , New37 , New38 , New39 ,), such as the important question in building a deep convolutional network, is the optimization of pooling layer New14 . Second, to use that deep network in their original domain such as object recognition New40 , New9 , New36 , New42 , multi-task learning New40

. As a definition, neural networks are generally multilayer generative networks formed to maximize the probability of input data with regard to target classes.

The predictive power of Deep Convolutional Neural Networks (CNN) is interesting for the use in the problem of prediction of visual attention in visual content, i.e. saliency of the latter. Indeed, several saliency models have been proposed in various fields of research such as psychology and neurobiology, which are based on the feature integration theory (

New19 , New23 , New24 , New25 , New26 , New27 , New28 , New29 , New30 , New31 , New32 , New8 ,..). These research models the so-called ”bottom-up” saliency with the theory that suggests the visual characteristics of low-level as luminance, color, orientation and movement to provoke human gaze attraction New48 . The ”bottom-up” models have been extensively studied in the literature New48 . They suffer from insufficiency of low-level features in the feature integration theory framework, especially when the scene contains significant content and semantic objects. In this case, the so-called ”top-down” attention New49 becomes prevalent, the human subject observes visual content progressively with increasing the time of looking of the visual sequence. Supervised machine learning techniques help in detection of salient regions in images predicting attractors on the basis of seen dataNew6 . Various recent research is directed towards the creation of a basic deep learning model ensuring the detection of salient areas. We can cite here New4 , New5 and New6

. While a significant effort has been already done for building such models from still images, very few models have been built for video content for saliency prediction with supervised learning

New52 . It has a supplementary dimension: the temporality expressed by apparent motion in the image plane.

In this paper, we present a new approach with Deep CNN that ensures the learning of salient areas in order to predict the saliency maps in videos. The paper is organized as follows. Section 2 describes the related work of different deep learning models used to detect salient areas in images or to classify images by content. Section 3 presents our proposed method for detection of salient regions with a deep learning approach. Pixel-wise computation of predicted visual attention/saliency maps is then introduced. In section 4 we present results and comparison with reference methods of the state-of-the-art. Section 5 concludes the paper and outlines the perspectives of this research.

2 Related work

Deep learning architectures which have been recently proposed for the prediction of salient areas in images differ essentially by the quantity of convolution and pooling layers, by the input data, by pooling strategies, by the nature of the final classifiers and the loss functions to optimize, but also by the formulation of the problem. The attempt to predict visual attention reveals the binary classification problem of areas in images as ”salient” and ”non-salient”. It corresponds to the visual experiment with free instructions, when the subjects are simply asked to look at the content. Shen


proposed a deep learning model to extract salient areas in images. It allows firstly to learn the relevant characteristics of the saliency of natural images, and secondly to predict the eye fixations on objects with semantic content. The proposed model is formed by three layer sequences of ”filtering” and ”pooling”, followed by a layer of linear SVM classifier providing ranked ”salient” or ”non-salient” regions of the input image. With the filtering by sparse coding and the max pooling, this model approximates human gaze fixations.

In Simonyan’s work New5

the saliency of image pixels is defined with regard to a given class in image taxonomy as a relevance of the image for the class. Therefore the classification problem is multi-class, and can be expressed as a ”task-dependent” visual experiment, where the subjects are asked to look for an object of a given taxonomy in the images. The creation of the saliency map for each class using deep CNN with optimisation of parameters by stochastic gradient descent, presents the challenge of this research

New5 . After a step of generating the map that maximizes the score of the specific class, the saliency map of each class is defined by the amplitude of the weight calculated from the convolution network with a single layer.

The learning model of salient areas proposed by Vig New6 tackles prediction of saliency of pixels for a human visual system (HVS) and corresponds to a free-viewing visual experience. It comprises two phases. First, a random bank of uniform filters is used to generate multiple representations of localized input images. The second phase provides the combination of different localized representations. The training step is summarized by the random token, from the combined representation of each image, of regions composed of ten pixels, and granted to each region a saliency class by reference to the density fixations map. The integration of this set in a SVM classifier allows the creation of the learning model. The learning model of salient areas is composed by the SVM trained on the combination of feature maps that are obtained using of different architectures of deep network.

In our work we also seek for predicting saliency of image regions for HVS. While in New5 only primary RGB pixel values are taken for class-based saliency prediction, we use several combinations of primary (input) features such as residual motion and primary spatial features, inspired by feature integration theory as in New15 , New8 , New16 , New32 . The sensitivity of HVS to residual motion in dynamic visual scenes is used for saliency prediction in video New32 . For training of deep CNN in our two class classification problem we use human fixations maps as in New6 to select positive and negative samples.

3 Prediction of visual saliency with deep CNN

Hence we design a deep CNN to classify regions in video frames into two classes salient and non-salient. Then on the basis of these classifications, a visual fixation map will be predicted. Before describing the architecture of our proposed deep CNN, we introduce the definition of saliency of regions and explain how we extract positive and negative examples for training the CNN.

3.1 Extraction of salient and non-salient patches.

We define a salient patch in a video frame on the basis of interest expressed by the subjects. The latter is measured by the magnitude of a visual attention map built upon gaze fixations which are recorded during a psycho-visual experiment in free-viewing conditions. The maps are built by the method of Wooding New12 . Such a map represents a multi-Gaussian surface normalized by its global maximum. To train the network it is necessary to extract salient and non-salient patches from training video frames with available Wooding maps. A squared patch of parametrized size is considered ”salient” if the visual attention map value in its center is above a threshold. A patch

is a vector in

, where stands for the quantity of primary feature maps serving as an input to the CNN. In case when RGB planes of a colour video sequences are used, . The choice of the parameter obviously depends on the resolution of video, but also is constrained by the computational capacity to process a huge amount of data. In this work we considered for SD video. More formally, a binary label is associated with pixels of each patch using equation (1):


with the coordinates of the center of the patch. We select a set of thresholds, starting by the global maximum value of the normalized attention map and then relax threshold as in equation(2):


Here is a relaxation parameter, , and limits the relaxation of saliency. It was chosen experimentally as , while .

In such a manner, salient patches are progressively selected up to non-salient areas, where non-salient patches are extracted randomly. The process of extraction of salient patches in the frames of training videos is illustrated in figure 1.

The tables 1, 2 present the group of salient patches on the left and non-salient patches on the right, each row presents some examples of patches taken from each frame of video sequence denoted by ”SRC” in IRCCYN 666available in dataset, and ”actioncliptrain” in the HOLLYWOOD777available in data set.

Figure 1: Extraction of salient patches for training
Salient patch       Non-salient patch
Table 1: Training data from IRCCYN data base
Salient patch       Non-salient patch
Table 2: Training data from HOLLYWOOD data base

3.2 Primary feature maps for saliency prediction in video

On the contrary to still natural images where saliency is ”spatial”, based on color contrasts, saturation contrasts, intensity contrasts , the saliency of the video is also based on the motion information of the objects with regard to the background. Therefore, in the following we present primary motion features we consider and then briefly describe spatial primary features (colours, contrasts) we use.

3.2.1 Motion feature maps

Visual attention is not attracted by the motion in general, but by the difference between the global motion in the scene, expressing the camera work, and the ”local” motion, that one of a moving object. This difference is called the ”residual motion”New28 . To create the feature map of residual motion in videos, we used the model developed in New10 , New28 , New46

. This model allows the calculation of the residual motion in three steps: the optical flow estimation

, the estimation of the global motion , from optical flow accordingly to the first order complete affine model and finally, the computation of residual motion according to equation(3):


The sensitivity of HVS to motion is selective. Daly New17 proposes a non-linear model of sensitivity accordingly to the speed of motion. In our work, we use a simplified version: the primary motion feature is the magnitude of residual motion (3) in a given pixel, and leave the decision on the saliency of the patch to the CNN classifier. For spatial primary features we resort to the work in New10 which yeilds coherent results accordingly to our studies in New32 .

3.2.2 Primary spatial features

The choice of features from New10 is conditioned by relatively low computational cost and their good performance we have stated in New32 . The authors propose seven color contrast descriptors. As the color space ’Hue Saturation Intensity’ (HSI) is more appropriate to describe the perception and color interpretation by humans, the descriptors of the spatial saliency are built in this color space. Five of these seven local descriptors depend on the value of the hue, saturation and/or intensity of the pixel. These values are determined for each frame of a video sequence, from a saturation factor and an intensity factor , calculated using the equations (4),(5):


Here is the saturation of the pixel at coordinates and the value at is the saturation of the pixel at coordinates adjacent to the pixel . The constant sets the minimum value for the protection of the interaction of pixel when the saturation approaches zero New10 . Contrast descriptors are calculated by equations (6 13):

1. color contrast: the first input of the saliency of a pixel is obtained from the two factors of saturation and intensity. This descriptor is calculated for each pixel and its eight connected neighbors of the frame , as in equation(6):


2. hue contrast: a hue angle difference on the color wheel can produce a contrast. In other words, this descriptor is related to the pixels having a hue value far from their neighbors (the largest angle difference value is equal to ), see equation (7):


The difference in color between the pixel and its neighbor is calculated accordingly to equations (8) and (9) :


3. contrast of opponents: the colors located on the opposite sides of the hue wheel creating a very high contrast. An important difference in tone level will make the contrast between active color () and passive, more salient. This contribution to the salience of the pixel is defined by equation (10):


4. contrast of saturation: occurs when low and high color saturation regions are close. Highly saturated colors tend to attract visual attention, unless a low saturation region is surrounded by a very saturated area. It is defined by equation (11):


with denoting the saturation difference between the pixel and its neighbor , see equation (12):


5.contrast of intensity: a contrast is visible when dark colors and shiny ones coexist. The bright colors attract visual attention unless a dark region is completely surrounded by highly bright regions. The contrast of intensity is defined by equation (13):


With denotes the difference of intensity between the pixel and its neighbor


6. dominance of warm colors: the warm colors -red, orange and yellow- are visually attractive. These colors () are still visually appealing, although the lack of contrast (hot and cold colors in the area) is observed in the surroundings. This feature is defined by equation (15):


7. dominance of brightness and saturation: highly bright, saturated colors are considered attractive regardless of their hue value. The feature is defined by equation (16):


The normalization ( ) of the first five descriptors () by the number of neighboring pixels () is performed. In New32 , New17 it is reported that mixing a large quantity of different features increases the performance of prediction. This is why it is attractive to mix primary features (1-7) with those which have been used in previous works of saliency prediction New5 , that is simple RGB planes of a video frame.

3.3 The network design

In this section we present the architecture of a deep CNN we designed for our two class classification problem: prediction of a saliency of a patch in a given video frame. It includes five layers of convolution, three layers of pooling, five layers of Rectified Linear Units (RELU), two normalisation layers, and one layer of Inner product followed by a loss layer as illustrated in Figure

2. The final classification is ensured by a soft-max classifier in equation (17). This function is a generalization of the logistic function that compresses a vector of arbitrary real values of dimension to a vector of the same size but with actual values in the range .


Figure 3 shows the order of layers in our proposed network. The CNN architecture was implemented using the Caffe software New13 .

Figure 2: Architecture and design of the deep saliency framework.

We created our network architecture made on three patterns (see figure 3 with a step of normalisation between each one. Each pattern contains a linear/nonlinear cascading operation (convolution, pooling, RELU). For the first pattern we chose a cascading operation different than the two following patterns. The first operation cascade is represented as the succession of convolution layer, pooling layer followed by a RELU layer. In fact, the applying of the pooling operation before the RELU layer does not change the final results because the two layers compute the function of maximum, however, it ensures the decrease of the execution time of the prediction as the step of pooling reduces the number of nodes. The two convolution layers stacked before the pooling layer for the followed pattern ensures the development of more complex features that will be more ”expressive” before the destructive Pool operation.

Figure 3: Architecture of video saliency convolution network

In the following, we will describe the most crucial layers which are convolution, pooling and local response normalisation.

3.3.1 Convolution layers

In order to extract the most important information for further analysis or exploitation of image patches, the convolution with a fixed number of filters that is based on the natural functioning of the HVS is needed. It is necessary to determine the size of the convolution kernel to be applied to each pixel of the input image to highlight areas of the image. Gaussian filters were used to create all of the feature maps of the convolution layer. The number of filters, in other words the number of kernels, convolved with the input image is the number of the obtained feature maps. Three stages are conceptually necessary to create the convolution layer. The first refers to the convolution of the input image by linear filters. The second is to add a bias term. And finally, the application of a nonlinear function (here we have used the rectified linear function ). Generally, the equation of convolution can be written as(18):


with : the activity of the unit according to the layer ,

represents a selection of the input feature maps,

is the additive bias of the unit in the features maps of the layer ,

: presents the synaptic weights between unit of the layer and .

3.3.2 Pooling layers

To reduce the computational complexity for the upper layers, and provide a form of translation invariance, pooling summarizes the outputs of neighboring groups of neurons on the same kernel map. The size of the region of ’pooling’ reduces the size of each feature map as input by the acquisition of a value for each region. We use max-pooling, see equation (



Here denotes the neighbourhood of (x,y).

3.3.3 Local response normalization layers

LRN layer normalizes values of feature maps which are calculated through the neurons having unbounded activations to detect the high-frequency characteristics with a high response of the neuron, and to amortize answers that are uniformly greater in a local area. The output computation is presented is presented in equation 20:


Here represents the value of the feature map at coordinates and the sums are taken in the neighbourhood of of size , and regulate normalisation strength.

3.4 Training and validation of the model

To solve the learning problem and to validate the network with the purpose to generate a robust model of salient area recognition, the solver of Caffe New13 is iteratively optimizing the network parameters in forward-backward loop. The optimisation method used is that one of stochastic gradient. The parameterization of the solver requires setting the learning rate and the number of iterations at training and testing steps.

The numbers of training and testing iterations are defined according to the ”batch size” parameter of Caffe New13 . The batch size presents the number of images that is salient and non-salient patches in our case, processed at an iteration. This number depends on two parameters:

  • The power of the GPU/RAM of the used machine,

  • The number of patches available for each database.

The number of iterations is computed according to equation (21):


here represents the number of images for each network switching, presents how many times the totality of the dataset is switched by the network.

It is interesting to visualize the purely spatial features computed by the designed CNN in case when the network is configured to predict saliency only with primary RGB values as this it the goal instead of aspiration of the overall deep learning approach to saliency prediction. As the feature integration theory states, the HVS is sensitive to orientations and contrasts. This is what we observe in features going through layers of the network. The output of convolution layers (see figures 4, 5 and 6) yields more and more contrasted and structured patterns. In this figure and stands for consecutive convolution layers without pooling layers in between.

(a) (b) (c)
Figure 4: (a) Input patch, (b) the output of first convolution layer and (c) the output of the first pooling layer.


Figure 5: The output of the 2nd convolution layer data of ’ Conv2’ and ’Conv22’ .


Figure 6: The output of third convolution layer ’ Conv3’ and ’Conv33’.

3.5 Generation of a pixel-wise saliency map

The designed and trained Deep CNN predicts for a patch in a video frame if it is salient for a human observer. Despite the interest of this problem for selection of important areas in images for further pattern recognition tasks, for finer, pixel-wise saliency prediction in video, the transformation of sparse classifier responses into a dense predicted saliency map is needed. The response for each patch is given by the soft-max classifier, see figure

2 and equation (17) in section 3.3. The value of classifier which is interpreted as a probability to belong to the saliency class, can be considered as a predicted saliency of a patch. Then a Gaussian is centred on the patch center with a pick value of with the spread parameter

chosen of a half-size of the patch. Hence a sparse saliency map is predicted. In order to densify the map we classify densely sampled patches with a half-patch overlap and then interpolate obtained values. Examples of predicted saliency maps using RGB only features (3K model), RGB features and Residual motion features(4Kmodel), Wooding gaze-fixation maps and popular saliency prediction models of Itti

New8 (named ”GBVS”) and HarellNew15 (named ”SignatureSal”) are depicted in table 3. Visual evaluation of the maps shows that the proposed method yields maps more similar to Wooding maps built on gaze fixations. Indeed GBVS and SignatureSal are pixel-wise maps, while our maps are built upon salient patches. Further evaluation will be presented in the next section 4.

Frame Wooding Deep3k Deep4k GBVS SignatureSal
Table 3: Different saliency map of testing frame from videos of IRCCYN database.

4 Experiments and results

4.1 Datasets

To learn the model, we have used two different datasets, the IRCCYN New43 and the HOLLYWOODNew44 New45 .

IRCCYN database contains SD videos and gaze fixations of subjects. From the overall set of frames, we have extracted salient patches and non-salient patches. We have used patches ( were salient and were non-salient) at the training step. For the testing step we have used patches ( salient patches and non-salient ones) respectively.

The HOLLYWOOD database contains training videos and video for the validation step. The number of subjects with recorded gaze fixations varies according to each video up to subjects. The spatial resolution of videos varies as well. The distribution of resolutions is presented in figures 7 and 8). In another terms the HOLLYWOOD dataset contains frames for training and frames for testing. From the frames of training step we have extracted salient patches and non-salient patches. During the testing phase, we have used salient patches and non-salient patches respectively.

Figure 7: Histogram of video resolutions of ”HOLLYWOOD” database in training step.
Figure 8: Histogram of video resolutions of ”HOLLYWOOD” database in testing step.

4.2 Evaluation of patches prediction with deep CNN

The network was implemented using a powerful graphic card Tesla K40m and processor ( cores). Therefore a sufficiently large amount of patches, , was used per iteration, see the parameter in equation (21). After a fixed number of training iterations, a model validation step is implemented. At this stage the accuracy of the model at the current iteration is computed.

. To evaluate our deep network and to prove the importance of the addition of the residual motion map, we have created two models with the same parameter settings and architecture of the network: the first one contained R, G and B, primary pixel values in patches. We denote it as . The presents the model using RGB and the normalized magnitude of residual motion as input data. The following figures 9 and 10 illustrate the variations of the accuracy along iterations of the both models 3k and 4k for each used database ”IRCCYN” and ”HOLLYWOOD”.

Figure 9: Accuracy vs iterations of the both models 3k and 4k for ”IRCCYN” database.
Figure 10: Accuracy vs iterations of the both models 3k and 4k for ”HOLLYWOOD” database.
3kmodel 4kmodel
Table 4: The accuracy results on IRCCYYN and HOLLYWOOD dataset in the first experiment

In the IRCCYN database, we found a higher accuracy with both models used. The maximum value of accuracy obtained on the IRCCYN dataset is at the iteration with the 3k model and at the iteration on the 4k model, see table 4. We can explain the not improvement of the accuracy by the low number of videos in the IRCCYN dataset New50 .

For the HOLLYWOOD database, adding residual motion map improves the accuracy with almost on the 4k model compared to the 3k model. The resulting accuracy of our proposed network along a fixed number of iterations shows the interest of adding the residual motion as a new feature together with spatial feature maps R, G and B. Nevertheless, the essential of accuracy is obtained with purely spatial features(RGB). This is why we add spatial contrast features which have been proposed in classical visual saliency prediction framework New10 in the second experiment below.

. The second experiment for saliency prediction is conducted when limiting the maximal number of iterations to prevent us from falling into overfitting problem. Instead of increasing the number of training iterations with a limited number of data samples before each validation iteration, as this is the case in the work of New33 , we pass all the training set before the validation of the parameters and limit the maximal number of iterations in the whole training process. This drastically decreases (12 times approximately) the training complexity, without the loss of accuracy (see tables 4 and 5 for 3k and 4k models). In order to evaluate the performance of contrast features in a deep learning spatio-temporal model, we test the ”8K” model first. Its input layers are composed of contrast features, as described in section 3.2.2 and of the residual motion map. The results are presented in table 5 and illustrated in figure 11. It can be seen, that contrasts features only combined with motion yield poorer performance with regard to 3K and 4K models. Therefore, we keep primary colour information in the further HSV8K and RGB8K models.

Figure 11: Second experiment: Learning of contrast feature - Accuracy vs iterations of 3k, 4k, 8k, RGB8k and HSV8k for ”HOLLYWOOD” database.
3kmodel 4kmodel 8kmodel RGB8kmodel HSV8kmodel
Table 5: The accuracy results on HOLLYWOOD dataset during the second experiment

4.3 Evaluation of predicted visual saliency maps

In the literature, various evaluation criteria were used to determine the level of similarity between visual attention maps and gaze fixations of subjects like the normalized scanpath saliency ’NSS’, Pearson Correlation Coefficient ’PCC’, and the area under the ROC curve ’AUC’ New20 New21 . The «Area under the ROC Curve» measures the precision and accuracy of a system with the goal of categorizing entities into two distinct groups based on their features. The image pixels may belong either to the category of pixels fixated by subjects, either to the category of pixels that are not fixated by any subject. More the area is large, more the curve deviates from the line of the random classifier (area ) and approaches to the ideal bend of the classifier (area ). A value of AUC close to indicates a correspondence between the predicted saliency and the eye positions. While a value close to presents a random generation of the salient areas by the model computing the saliency maps. Therefore the objective and subjective saliency differs strongly. In our work, visual saliency being predicted by a deep CNN classifier, we have computed the hybrid AUC metric between predicted saliency maps and gaze-fixations as in New53 . The results of the experiments are presented in the tables 6 and 7 below on an arbitrary chosen subset of 12 videos from HOLLYWOOD dataset. The figures depicted in the tables correspond to the maximum value obtained during the training and validation (as presented in tables 4 and 5). For the first experiment the maximal number of iterations was set to 174000 and for the second experiment, this number was fixed times lower. From table 6 it can be stated that i)adding primary motion features, such as residual motion improves the quality of predicted visual attention maps whatever is the training of the network. The improvement is systematic and goes up to in case of clipTest105 (in the first experiment);ii) the way to train the network, we propose with lower number of iterations and all training data used does not strongly affect the performances. Indeed, with 4k model the results are better for almost all clips, see highlighted figures in table 6. In table 7 we compare all our predicted saliency models with gaze fixations. It comes out that more complex models yield better results: up to of improvement in clipTest250. The quality of the prediction of patches (see table 4, 5 and figure 11) DeepRGB8K outperforms DeepHSV8k. Therefore, for comparison with reference models from the state of the art, , and spatio-temporal model by Seo New31 , named ”Seo” we use model, see table 9 below.

First Experiment Second Experiment
VideoName Gaze-fix vs Deep3k Gaze-fix vs Deep4k Gaze-fix vs Deep3k Gaze-fix vs Deep4k
Table 6: The comparison, with AUC metric, of the two experiments for 3K and 4K saliency models vs gaze fixations ’Gaze-fix’ on a subset of HOLLYWOOD dataset
VideoName Gaze-fix vs Deep3k Gaze-fix vs Deep4k Gaze-fix vs Deep8k Gaze-fix vs DeepRGB8k Gaze-fix vs DeepHSV8k
Table 7: The comparison metric of gaze fixations ’Gaze-fix’ vs Deep saliency ’3k’, ’4k’, ’8k’ , ’RGB8k’ and ’HSV8k’) for the video from HOLLYWOOD
VideoName Gaze-fix vs GBVS Gaze-fix vs SignatureSal Gaze-fix vs Seo Gaze-fix vs DeepRGB8k
Table 8: The comparison of AUC metric gaze fixations ’Gaze-fix’ vs predicted saliency ’GBVS’, ’SignatureSal’ and ’Seo’) and our DeepRGB8k for the videos from HOLLYWOOD dataset

Proposed DeepRGB8K saliency model turns to be winner more systématically (6/12 clips) than each reference model.

4.4 Discussion

Visual saliency prediction with deep CNN is still a recent while intensive research. The major bottle-neck in it is the computation power and memory requirements. We have shown, that a very large amount of iterations - hundreds of thousands are not needed for prediction of interesting patches in video frames. Indeed, to get better maximal accuracy smaller amount of iterations is needed, and the maximal number of iterations can be limited (17400 in our case) accompanied by another data selection strategy: all data from training set are passed before each validation iteration of the learning, see tables 4, 5. Next, we have shown that in case of a sufficient training set, adding primary motion features improves prediction accuracy up to in average on a very large data set (HOLLYWOOD test) containing video frames. Hence the deep CNN captures the sensitivity of Human Visual System to motion.

When applying a supervised learning approach to visual saliency prediction in video, one has to keep in mind that gaze-fixation maps, which serve for selection of training ”salient” regions in video frames, not only express the ”bottom-up” attention. Humans are attracted by stimuli, but in case of video when understanding a visual scene with time, they focus on the objects of interest, thus reinforcing the ”top-down” mechanisms of visual attentionNew52 . Hence, the prediction of patches of interest by a supervised learning, we mix all mechanisms: bottom-up and top-down.

In order to re-inforce the bottom-up sensitivity of HVS to contrasts, we completed the input data layers by specific contrast features well studied in classical saliency prediction models. As we could not state the improvement of performance in prediction of saliency of patches in video frames in average (see table 5) a more detailed experience clip - by- clip was performed on a sample of clips from HOLLYWOOD dataset when comparing resulting predicted saliency maps. This series of experiments resumed in table 9, shows that indeed adding features, expressing local color contrast slightly improves performances with regard to the reference bottom-up spatial (GBVS, SignatureSal) and spatio-temporal models (Seo)). Hence, the mean improvement of the complete model with motion, contrast features and primary HSV colour pixel values with regard to Itti, Harell and Seo models are , , respectively.

(DeepRGB8k - GBVS) (DeepRGB8k - SignatureSal) (DeepRGB8k - Seo)
Table 9: The mean improvement of the complete model.

5 Conclusion

Hence, in this paper, we proposed a deep convolutional network to predict salient areas (patches) in video content and built dense predicted visual saliency maps upon them. We built an adequate architecture on the basis of Caffe CNN. While the aspiration of the community consisted in the use of primary features such as RGB planes only for visual attention prediction in images, we have shown that for video, adding of features expressing sensitivity of the human visual system to residual motion, is important. Furthermore, we also completed the RGB pixel values by low-level features of contrast and colour which are easy to compute and have proven efficient in former spatio-temporal predictors of visual attention. The results are better, nevertheless, the gain is not strong. Therefore, it is clear that for further research it is important to better explore the link between known physiological mechanisms of human vision and the design of a CNN. The central bias hypothesis namely needs to be explored.



  • (1) L. Deng, D. Yu, DEEP LEARNING: Methods and Applications, Tech. Rep. MSR-TR-2014-21 (May 2014).
  • (2) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-Based Learning Applied to Document Recognition, no. 86(11), 1998, pp. 2278–2324.
  • (3) J. Bruna, S. Mallat, Invariant Scattering Convolution Networks, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1872–1886.
  • (4) Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional Architecture for Fast Feature Embedding, arXiv preprint arXiv:1408.5093.
  • (5)

    A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, in: F. Pereira, C. Burges, L. Bottou, K. Weinberger (Eds.), Advances in Neural Information Processing Systems 25.

  • (6) Y. Bengio, A. Courville, P. Vincent, Representation Learning: A Review and New Perspectives, IEEE Transactions on Pattern Analysis and Machine Intelligence (35 (8)) (2014) 1798–1828.
  • (7) X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Vol. 9, 2010.
  • (8)

    H. Schulz, A. Muller, S. Behnke, Exploiting Local Structure in Boltzmann Machines, Neurocomputing, special issue on ESANN, Elsevier.

  • (9) D. Scherer, S. Behnke, Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors, large scale machine learning :Parralelism and massive datasets.
  • (10) D. Scherer, A. Muller, S. Behnke, Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition, 2010.
  • (11)

    Avila, Sandra and Thome, Nicolas and Cord, Matthieu and Valle, Eduardo and De A. AraúJo, Arnaldo, Pooling in image representation: The visual codeword point of view, Computer Vision and Image Understanding 117 (5) (2013) 453–465.

  • (12) R. Uetz, S. Behnke, Large-scale Object Recognition with CUDA-accelerated Hierarchical Neural Networks, 2009.
  • (13)

    N. Pinto, D. D. Cox, Beyond Simple Features: A Large-Scale Feature Search Approach to Unconstrained Face Recognition, in: IEEE Automatic Face and Gesture Recognition, 2011.

  • (14) K. Simonyan, A. Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos, in: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014, pp. 568–576.
  • (15) J. Cheng, H. Liu, F. Wang, H. Li, C. Zhu, Silhouette Analysis for Human Action Recognition Based on Supervised Temporal t-SNE and Incremental Learning, Image Processing, IEEE Transactions on 24 (10) (2015) 3203–3217.
  • (16) A. M. Treisman, G. Gelade, A Feature-Integration Theory of Attention, Cognitive Psychology 12 (1) (1980) 97–136.
  • (17) N. Bruce, J. Tsotsos, Saliency based on information maximization., In Advances in Neural Information Processing Systems (18) (2006) 155–162.
  • (18) D. Gao, V. Mahadevan, N. Vasconcelos, On the plausibility of the discriminant center-surround hypothesis for visual saliency., Journal of Vision (8(7):13) (2008) 1–18.
  • (19) D. Gao, N. Vasconcelos, Integrated learning of saliency, complex features, and object detectors from cluttered scenes., IEEE Conference on Computer Vision and Pattern Recognition. (17).
  • (20) X. Hou, L. Zhang, Dynamic visual attention: Searching for coding length increments., In Advances in Neural Information Processing Systems, (21) (2008) 681–688.
  • (21) C. Kanan, M. Tong, L. Zhang, G. Cottrell, SUN: Top-down saliency using natural statistics. , Visual Cognition (17) (2009) 979–1003.
  • (22) S. Marat, T. Phuoc, L. Granjon, N. Guyader, D. Pellerin, A. Guerin-Dugue, Modelling spatiotemporal saliency to predict gaze direction for short videos., International Journal of Computer Vision (82) (2009) 231–243.
  • (23) O. Meur, P. L. Callet, D. Barba, Predicting visual fixations on video based on low-level visual features., Vision Research (47) (2007) 2483–2498.
  • (24)

    A. Torralba, R. Fergus, W. Freeman, 80 million tiny images: A large dataset for non-parametric object and scene recognition., IEEE Transactions on Pattern Analysis and Machine Intelligence (30) (2008) 1958–1970.

  • (25) H. J. Seo, P. Milanfar, Static and space-time visual saliency detection by self-resemblance, Journal of Vision (9(12):15) (2009) 1–27.
  • (26) H. Boujut, R. Mégret, J. Benois-Pineau, Fusion of Multiple Visual Cues for Visual Saliency Extraction from Wearable Camera Settings with Strong Motion, in: Computer Vision - ECCV 2012. Workshops and Demonstrations - Florence, Italy, October 7-13, 2012, Proceedings, Part III, 2012, pp. 436–445.
  • (27) L. Itti, C. Koch, E. Niebur, A Model of Saliency-Based Visual Attention for Rapid Scene Analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (11) (1998) 1254–1259.
  • (28)

    A. Borji, L. Itti, State-of-the-art in Visual Attention Modeling, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1) (2013) 185–207.

  • (29) Y. Pinto, A. R. van der Leij, I. G. Sligte, V. F. Lamme, H. S. Scholte, Bottom-up and top-down attention are independent, Journal of Vision 13 (3) (2013) 16.
  • (30) E. Vig, M. Dorr, D. Cox, Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images , IEEE Computer Vision and Pattern Recognition (CVPR).
  • (31) C. Shen, Q. Zhao, Learning to Predict Eye Fixations for Semantic Contents Using Multi-layer Sparse Network, Neurocomputing 138 (2014) 61–68.
  • (32) K. Simonyan, A. Vedaldi, A. Zisserman, Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, CoRR abs/1312.6034.
  • (33) J. Han, L. Sun, X. Hu, J. Han, L. Shao, Spatial and temporal visual attention prediction in videos using eye movement data, Neurocomputing 145 (2014) 140–153.
  • (34) J. Harel, C. Koch, P. Perona, Graph-Based Visual Saliency, in: Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, 2006, pp. 545–552.
  • (35) A. M. Treisman, G. Gelade, A feature-integration theory of attention, Cognitive Psychology 12 (1) (1980) 97–136.
  • (36) D. S. Wooding, Eye movements of large populations: II. Deriving regions of interest, coverage, and similarity using fixation maps, Behavior Research Methods, Instruments, & Computers 34 (4) (2002) 518–528.
  • (37) O. Brouard, V. Ricordel, D. Barba, Cartes de saillance spatio-temporelle basées contrastes de couleur et mouvement relatif, in: Compression et représentation des signaux audiovisuels, CORESA 2009, Toulouse, France, 2009.
  • (38) I. González-Díaz, J. Benois-Pineau, V. Buso, H. Boujut, Fusion of Multiple Visual Cues for Object Recognition in Videos (2014) 79–107.
  • (39) S. J. Daly, Engineering observations from spatiovelocity and spatiotemporal visual models (1998).
  • (40) F. Boulos, W. Chen, B. Parrein, P. Le Callet, Region-of-Interest Intra Prediction for H.264/AVC Error Resilience, in: IEEE International Conference on Image Processing, Cairo, Egypt, 2009, pp. 3109–3112.
  • (41) M. Marszałek, I. Laptev, C. Schmid, Actions in Context, in: IEEE Conference on Computer Vision & Pattern Recognition, 2009.
  • (42) S. Mathe, C. Sminchisescu, Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 37.
  • (43) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, J. Mach. Learn. Res. 15 (1) (2014) 1929–1958.
  • (44) S. Marat, Modèles de saillance visuelle par fusion d’informations sur la luminance, le mouvement et les visages pour la prédiction de mouvements oculaires lors de l’exploration de vidéos, Ph.D. thesis, université de grenoble (Feb. 2010).
  • (45) U. Engelke, H. Lieu, J. Wang, P. L. callet, I. Heynderickx, H. j. Zepernick, A. Maeder, Comparative study of fixation density maps, IEEE Trans. Image Processing 22 (3).
  • (46) O. Le Meur, T. Baccino, Methods for comparing scanpaths and saliency maps: strengths and weaknesses, Behavior Research Methods 45 (1) 251–266.