The model of "Predicting Video Saliency using Object-to-Motion CNN and Two-layer Convolutional LSTM"
Over the past few years, deep neural networks (DNNs) have exhibited great success in predicting the saliency of images. However, there are few works that apply DNNs to predict the saliency of generic videos. In this paper, we propose a novel DNN-based video saliency prediction method. Specifically, we establish a large-scale eye-tracking database of videos (LEDOV), which provides sufficient data to train the DNN models for predicting video saliency. Through the statistical analysis of our LEDOV database, we find that human attention is normally attracted by objects, particularly moving objects or the moving parts of objects. Accordingly, we propose an object-to-motion convolutional neural network (OM-CNN) to learn spatio-temporal features for predicting the intra-frame saliency via exploring the information of both objectness and object motion. We further find from our database that there exists a temporal correlation of human attention with a smooth saliency transition across video frames. Therefore, we develop a two-layer convolutional long short-term memory (2C-LSTM) network in our DNN-based method, using the extracted features of OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can be generated, which consider the transition of attention across video frames. Finally, the experimental results show that our method advances the state-of-the-art in video saliency prediction.READ FULL TEXT VIEW PDF
The model of "Predicting Video Saliency using Object-to-Motion CNN and Two-layer Convolutional LSTM"
The large-scale eye-tracking database called LEDOV for video salinecy
A foveated mechanism  in the Human Visual System (HVS) indicates that only small fovea region captures most visual attention at high resolution, while other peripheral regions receive little attention at low resolution. To predict human attention, saliency detection has been widely studied in recent years, with multiple applications  in object recognition, object segmentation, action recognition, image caption, image/video compression, etc. In this paper, we focus on predicting video saliency at pixel level, which models attention on each video frame.
In the early time,the traditional methods of video saliency prediction mainly follow the integration theory [3, 4, 5, 6, 7], i.e., saliency of video frames can be detected by two steps: (1) Extract spatial and temporal features from videos for obtaining conspicuous maps; (2) Conduct a fusion strategy to combine conspicuous maps of different feature channels together for generating saliency maps. Benefitting from the state-of-the-art image saliency prediction, a great deal of spatial features have been incorporated to predict video saliency [8, 9, 10]. Additionally, some works focused on designing temporal features for video saliency prediction, mainly in three aspects: motion based features [11, 6, 12], temporal contrast features [3, 13, 4] and compressed domain features [14, 15]8, 7, 15], probabilistic model [16, 13, 17] and phase spectrum analysis [18, 19].
Different from the integration theory, deep natural networks (DNN) based methods have been recently proposed to learn human attention in an end-to-end manner, significantly boosting the accuracy of image saliency prediction [20, 21, 22, 23, 24, 25, 26]. However, only a few works managed to apply DNN in video saliency prediction [27, 28, 29]. Specifically, Cagdas et al.  applied a two-stream CNN structure taking both RGB frames and motion maps as the inputs, for video saliency prediction. Bazzani et al.  leveraged a deep Convolutional 3D (C3D) network to learn the representations of human attention on 16 consecutive frames, and then a Long Short-Term Memory (LSTM) network connected with a mixture density network was learned to generate saliency maps in Gaussian mixture distribution. However, the above DNN based methods111None of the above methods makes the code available online. In contrast, our code is accessible in https://github.com/remega/OMCNN_2CLSTM. for video saliency prediction is still in infancy due to the following drawbacks: (1) Insufficient eye-tracking data for training DNN; (2) Lack of sophisticated network architecture simultaneously learning to combine information of object and motion; (3) Neglect of dynamic pixel-wise transition of video saliency across video frames.
To avoid the above drawbacks, this paper proposes a new DNN based method to predict video saliency with spatio-temporal representation and dynamic saliency modeling, benefitting from the analysis of our eye-tracking database. Specifically, we first establish a large-scale video database, which contains eye-tracking data of 32 subjects on viewing 538 diverse-content videos. Through analysis on our database, we find that human attention is normally attracted by objects in a video, especially by the moving objects or moving parts in the objects. In light of this finding, a novel Object-to-Motion Convolutional Neural Network (OM-CNN) is constructed to learn spatio-temporal features for video saliency prediction, considering both objectness and object motion information. For dynamic saliency modeling, a Two-layer Convolutional LSTM (2C-LSTM) network is developed to predict the pixel-wise transition of video saliency across frames, with input of spatio-temporal features from OM-CNN. Different from the conventional LSTM network, this structure is capable of keeping spatial information through the convolutional connections.
To summarize, the main contributions of our work are listed in the following:
We establish an eye tracking database consisting of 538 videos in diverse content, with the thorough analysis and findings on our database.
We propose the novel OM-CNN structure to predict saliency of intra-frame, which integrates both objectness and object motion in a uniform deep structure.
We develop the 2C-LSTM network with Bayesian dropout to learn the saliency transition across inter-frame at pixel-wise level.
The rest of this paper is organized as follows. In Section II, we briefly review the related works and eye-tracking databases for video saliency prediction. In Section II-B, we establish and analyze our large-scale eye-tracking database. According to the findings on our database, we propose a DNN for video saliency prediction in Section IV, including both OM-CNN and 2C-LSTM. Section V shows the experiment results to validate the performance of our method. Section VI concludes this paper.
In this section, we briefly review the recent works and eye-tracking databases for video saliency prediction.
rely on the integration theory consisting of two main steps: feature extraction and feature fusion. In the task of image saliency prediction, many effective spatial features succeed in predicting human attention with either top-down[8, 30] or bottom-up [31, 9] strategy. However, video saliency prediction is more challenging, because temporal features also play an important role in drawing human attention. To achieve this, motion based features [11, 6, 12], temporal difference [3, 13, 4] and compressed domain methods [32, 15] are widely used in the existing works of video saliency prediction. Taking motion as additional temporal features, Zhong et al.  proposed to predict video saliency using modified optical flow with restriction of dynamic consistence. Similarly, Zhou et al.  extended motion feature by computing center motion, foreground motion, velocity motion and acceleration motion in their saliency prediction method. In addition to motion, other methods [3, 13, 4] make use of the temporal changes in videos for saliency prediction, through figuring out the contrast between successive frames. For example, Ren et al. 
proposed to estimate the temporal difference of each patch by finding the minimal reconstruction error of sparse representation over the co-located patches of neighboring frames. Similarly, in, the temporal difference is obtained via adding pre-designed exponential filters on spatial features of successive frames. Taking advantage from sophisticated video coding standards, the compressed domain features are also explored as spatio-temporal features for video saliency prediction [32, 15].
In addition to feature extraction, many works focus on the fusion strategy to generate video saliency maps. Specifically, a set of probability models were constructed to calculate the posterior/prior beliefs
, joint probability distribution of features and candidate transition probability , in predicting video saliency. Similarly, Li et al.  developed a probabilistic multi-task learning method to incorporate the task-related prior in video saliency prediction. Besides, other machine learning algorithms, such as SVM and neutral network, were also applied for linearly  or non-linearly  combining the saliency related features. Other advanced methods [18, 19] apply phase spectrum analysis in the fusion model to bridge the gap between features and video saliency. For instance , Guo et al. 
applied Phase spectrum of Quaternion Fourier Transform (PQFT) on four feature channels (two color channels, one intensity channel, and one motion channel) to predict video saliency.
Most recently, DNN has succeeded in many computer vision tasks, such as image classification, action recognition  and object detection . In the field of saliency prediction, DNN has also been successfully incorporated to automatically learn spatial features for predicting saliency of images [21, 22, 23, 24, 25, 26, 37]. Specifically, as one of the pioneering works, Deepfix  proposed a DNN based structure on VGG-16 
and inception module to learn multi-scales semantic representation for saliency prediction. In Deppfix, a dilated convolutional structure was developed to extend receptive field, and then a location biased convolutional layer was proposed to learn the centre-bias pattern for saliency prediction. Similarly, SALICON 
was also proposed to fine tune the existing object recognition DNNs, and developed an efficient loss function for training the DNN model in saliency prediction. Later, some advanced DNN methods[23, 24, 25, 37] were proposed to improve the performance of image saliency prediction.
However, only a few works manage to apply DNN in video saliency prediction [39, 27, 28, 29, 40]. In these DNNs, the dynamic characteristics were explored in two ways: adding temporal information in CNN structures [39, 27, 29] or developing dynamic structure with LSTM [28, 40]. For adding temporal information, a four-layer CNN in  and a two-stream CNN in  were trained, respectively, with both RGB frames and motion maps as the inputs. Similarly, in , the pair of video frames concatenated with a static saliency map (generated by the static CNN) are input to the dynamic CNN for video saliency prediction, allowing CNN to generalize more temporal features through the representation learning of DNN. Instead, we find that human attention is more likely to be attracted by the moving objects or moving parts of the objects. As such, to explore the semantic temporal features for video saliency prediction, a motion subnet in our OM-CNN is trained under the guidance of the objectness subnet.
For developing the dynamic structure, Bazzani et al.  and Liu et al.  applied the LSTM networks to predict human attention, relying on both short- and long-term memory. However, the fully connections in LSTM limit dimensions of both input and output, unable to obtain end-to-end saliency map. As such, the strong prior knowledge need to be assumed for the distribution of saliency. To be more specific, in 
, the human attention is assumed to distribute as Gaussian Mixture Model (GMM), then the LSTM is constructed to learn parameters of GMM. Similarly,
focuses on predicting the saliency of conference videos and assume that the saliency in each face is a Gaussian distribution. In, the face saliency transition across video frames is learned by LSTM, and the final saliency map is generated via combining the saliency of all faces in video. In our work, we first explore 2C-LSTM with Bayesian dropout, to directly predict saliency maps in an end-to-end manner. This allows learning the more complex distribution of human attention, rather than pre-assumed distribution of saliency.
The eye-tracking databases of videos collect the fixations of subjects on each video frame, which can be used as the ground truth for video saliency prediction. The existing eye-tracking databases benefit from the mature eye-tracking technology. In particular, an eye tracker is used to obtain the fixations of subjects on videos, by tracking the pupil and corneal reflections . The pupil locations are then mapped to the real-world stimuli, i.e., video frames, through a pre-defined calibration matrix. As such, fixations can be located in each video frame, indicating where people pay attention.
Now, we review the existing video eye-tracking databases. Table I summarizes the basic properties of these databases. To the best of our knowledge, CRCNS , SFU , DIEM  and Hollywood  are the most popular databases, widely used in the most of recent video saliency prediction works [46, 32, 14, 17, 5, 6, 19, 39, 47]. In the following, they are reviewed in more details.
CRCNS  is one of the earlist video eye-tracking databases established by Itti et al. in 2004. It is still used as a benchmark in the recent video saliency prediction works, such as . CRCNS contains 50 videos mainly including outdoor scenes, TV shows and video games. The length of each video ranges from 5.5 to 93.9 seconds, and the frame rate of all videos is 30 frames per second (fps). For each video, 4 to 6 subjects were asked to look at the main actors or actions. Afterward, they were required to depict the main content of video. Thus, CRNS is a task-driven eye-tracking database for videos. Later, a new database  was established, by manually cutting all 50 videos of CRCNS into 523 “clippets” with 1-3 second duration, according to the abrupt cinematic cuts. Another 8 subjects were recruited to view these video clippets, with their eye-tracking data recorded in .
SFU  is a public video database containing eye-tracking data of 12 uncompressed YUV videos, which are frequently used as the standard test set for video compression and processing algorithms. Each video is in the CIF resolution (), and is with 3-10 seconds at a frame rate of 30 fps. All eye-tracking data were collected, when 15 non-expert subjects were free viewing all 12 videos twice.
DIEM  is another widely used database, designed to evaluate the contributions of different visual features on gaze clustering. DIEM comprises 84 videos sourced from publicly accessible videos including advertisement, game trailer, movie trailer and news clip. Most of these videos have frequent cinematic cuts. Each video lasts for 27-217 seconds at 30 fps. The free-viewing fixations of around 50 subjects were tracked for each video.
Hollywood  is a large-scale eye-tracking database for video saliency prediction, which contains all videos from two action recognition databases: Hollywood-2  and UCF sports . All of 1707 videos in Hollywood-2 were selected from 69 movies, according to 12 action classes, such as answering phone, eating and shaking hands. UCF sports is another action database including 150 videos with 9 sport action classes. The human fixations of 19 subjects were captured under 3 conditions: free viewing (3 subjects), action recognition task (12 subjects), and context recognition task (4 subjects). Although the video number of Hollywood is large, its video content is not diverse, constrained by human actions. Besides, it mainly focuses on task-driven viewing mode, rather than free viewing.
As discussed in Section II-B
, video saliency prediction may benefit from the recent development of deep learning. Unfortunately, as seen in TableI, the existing databases for video saliency prediction are lack of sufficient eye-tracking data to train DNN. Although Hollywood  has 1857 videos, it mainly focuses on task-driven visual saliency. Besides, the video content of Hollywood is limited, only involving human actions of movies. In fact, a large-scale eye-tracking database for video should have 3 criteria: 1) a large number of videos, 2) sufficient subjects, and 3) various video content. In this paper, we establish a large-scale eye-tracking database of videos, satisfying the above three criteria. The detail of our large-scale databases is to be discussed in Section III.
|Daily action||Sports||Social activity||Art performance|
|Number of sub-classes*||23||17||21||19||51||27||148|
|Number of videos||74||58||69||59||156||122||538|
Here, we also report the number of sub-class in each class of videos. For example, in animal videos, the sub-classes include penguin, rabbit, elephant, etc.
In this section, a new Large-scale Eye-tracking Database of Videos (LEDOV) is established, which is available online for facilitating the future research. More details and analysis about our LEDOV database are discussed in the following.
We present our LEDOV database from the aspects of stimuli, apparatus, participant and procedure.
Stimuli. 538 videos, in total 179,336 frames and 6,431 seconds, were collected, according to the following four criteria.
Including at least one object. Only videos with at least one object were qualified for our database. Table III reports the numbers of videos with different amount of objects in our database.
High quality. We ensured high quality of videos in our database by choosing those with at least 720p resolution and 24 Hz frame rate. To avoid quality degradation, the bit rates were maintained when converting videos to the uniform MP4 format.
Stable shot. The videos with unsteady camera motions and frequent cinematic cuts were not included in LEDOV. Specifically, there are 212 videos with stable camera motion. Other 316 videos are without any camera motion.
Apparatus&Participants. For monitoring the binocular eye movements, an eye tracker, Tobii TX300 , was used in our experiment. TX300 is an integrated eye tracker with a 23” TFT monitor at screen resolution of . During the experiment, TX300 captured gaze data at 300 Hz. According to , the gaze accuracy can reach 0.4 vision degree (around 15 pixels in stimuli) in the ideal working condition222The ideal condition is that the illumination in working environment is constant at 300 lux, and that the distance between subjects and the monitor is fixed at 65 cm. Such condition was satisfied in our eye-tracking experiment.. For more details about Tobii TX300, refer to . Moreover, 32 participants (18 males and 14 females), aging from 20 to 56 (32 in average), were recruited to take part in the eye-tracking experiment. All participants were non-expert for the eye-tracking experiment, with normal/corrected-to-normal vision. It is worth pointing out that only those who passed the calibration of the eye tracker and had less than fixation dropping rate, were quantified for our eye tracking experiment. As a result, 32 among 60 subjects were selected in our experiment.
Procedure. Since visual fatigue may arise after viewing videos for a long time, 538 videos in LEDOV were equally divided into 6 non-overlapping groups with similar numbers of videos in content (i.e., human, animal and man-made object). During the experiment, each subject was seated on an adjustable chair around 65 cm from the screen, followed by a 9-point calibration. Then, the subject was required to free view 6 groups of videos in a random order. In each group, the videos were also displayed at random. Between two successive videos, we inserted a 3-second rest period with black screen and a 2-second guidance image with a red circle in screen center. As such, the eyes can be relaxed, and then the initial gaze location can be reset at center. After viewing a group of videos, the subject was asked to take a break until he/she was ready for viewing the next group of videos. Finally, 5,058,178 fixations (saccades and other eye movements have been removed) were recorded from 32 subjects on 538 videos, for our LEDOV database.
In this section, we mine our database to analyze human attention on videos. More details are introduced as follows.
It is interesting to explore the temporal correlation of attention across consecutive frames. In Figure 2, we show human fixation maps along with some consecutive frames, for 3 selected videos. As we can see from Figure 2, there exists high temporal correlation of attention across consecutive frames of videos. To quantify such correlation, we further measure the linear correlation coefficient (CC) of fixation map between two consecutive frames. Assume that and are fixation maps of current and previous frames. Then, the CC value of fixation maps averaged over a video can be calculated as follows,
In (1), is the set of all frames in the video, while is the set of consecutive frames before frame c. Additionally, , and
are covariance, standard deviation and mean operators. For, we choose 4 sets of previous frames, i.e., 0-0.5s before, 0.5-1s before, 1-1.5s before and 1.5-2s before. Then, in Figure 3, we plot the CC results of these 4 sets of , which are averaged over all videos in our LEDOV database. We also show in Figure 3 one-vs-all results, which is the baseline of averaged CC between fixation maps of one subject and the rest (indicating attention correlation between humans). We can see from this figure that the CC value of temporal consistency is much higher than that of one-vs-all baseline. This implies high temporal correlation across consecutive frames of video. We can further find that temporal correlation of attention decreases, when increasing the distance between the current and previous frames. Consequently, the long- and short-term dependency of attention across frames of video can be verified.
It is intuitive that people may be attracted by objects rather than background when watching videos. Therefore, we investigate how much attention is related to object regions. First, we apply a CNN-based objection detection method YOLO  to detect main objects in each video frame. Here, we generate different numbers of candidate objects in YOLO, via setting thresholds of confidence probability and non-maximum suppression. Figure 4-(b) shows the examples of one detected object, while Figure 4-(c) shows the results for more than one object. We can observe from Figure 4-(b) that attention is normally attended to object regions. We can also see from Figure 4-(c) that more human fixations can be included along with increased number of detected candidate objects. To quantify the correlation between human attention and objectness, we measure the proportion of fixations falling into object regions to those of all regions. In Figure 5-(a), we show the fixation proportion at increased number of candidate objects, averaged over all videos in LEDOV. We can observe from this figure that fixation proportion hitting on object regions is much higher than that hitting on random region. This implies that there exists high correlation between objectness and human attention when viewing videos. Figure 5-(a) also shows that fixation proportion increases alongside more candidate objects, which indicates that human attention may be attracted by more than one object.
In addition, one may find from Figure 4 that human attention is attended to only small parts of object regions. Therefore, we measure the proportion of fixation area333The fixation area is computed by a pre-set threshold on the fixation map. The fixation map is generated by the fixation points convolved with a Gaussian filter as . inside the object to the whole object area. Figure 5-(b) shows the results of such proportion, at increased number of detected candidate objects. We can see from this figure that the proportion of fixation area proportion decreases as at more candidate objects.
From our LEDOV database, we find that human attention trends to focus on moving objects or the moving parts of objects. Specifically, as shown in the first row of Figure 6, human attention is transited to the big penguin, when it suddenly falls with a rapid motion. Besides, the second row of Figure 6 shows that, in the scene with a single salient object, the intensive moving parts of the player may considerably attract more fixations than other parts. It is interesting to further explore the correlation between motion and human attention inside the regions of objects. Here, we apply FlowNet , a CNN based optical flow method, to measure the motion intensity in all frames (some results are shown in Figure 4-(d)). At each frame, pixels are ranked according to the descending order of motion intensity. Subsequently, we cluster the ranked pixels into 10 groups with equal number of pixels, over all video frames in the LEDOV database. For example, the first group includes pixels with top ranked motion intensity. The numbers of fixations falling into each group are shown in Figure 7. We can see from Figure 7 that fixations belong to the group with top high-valued motion intensity. This implies the high correlation between motion and human attention within the region of objects.
For video saliency prediction, we develop a new DNN architecture that combines OM-CNN and 2C-LSTM together. Inspired by the second and third findings of Sections III-B, OM-CNN integrates both regions and motions of objects to predict video saliency through two subnets, i.e., the subnets of objectness and motion. In OM-CNN, the objectness subnet yields a coarse objectness map, which is used to mask the features output from the convolutional layers in the motion subnet. Then, the spatial features from the objectness subnet and temporal features from the motion subnet are concatenated to generate spatio-temporal features of OM-CNN. The architecture of OM-CNN is shown in Figure 8. According to the first finding of Section III-B, 2C-LSTM with Bayesian dropout is developed to learn dynamic saliency of video clips, in which the spatio-temporal features of OM-CNN work as the input. Finally, the saliency map of each frame is generated from 2 deconvolutional layers of 2C-LSTM. The architecture of 2C-LSTM is shown in Figure 9.
In the following, we present the detail of OM-CNN via introducing the objectness subnet in Section IV-B and the motion subnet in Section IV-C. In addition, the detail of 2C-LSTM is discussed in Section IV-D. The training process of our DNN method is presented in Section IV-E.
The second finding of Section III-B has shown that objects draw extensive attention in videos. Therefore, OM-CNN includes an objectness subnet, for extracting multi-scale spatial features related to objectness information. The basic structure of the objectness subnet is based on a pre-trained YOLO . Note that YOLO is a state-of-the-art CNN architecture, capable of detecting video object with high accuracy. In OM-CNN, the YOLO structure is utilized to learn spatial features of the input frame for saliency prediction. To avoid over-fitting, the fast version of YOLO is applied in the objectness subnet, including 9 convolutional layers, 5 pooling layers and 2 Fully Connected layers (FC). To further avoid over-fitting, an additional batch-normalization layer is added to each convolutional layer. Assuming that and
are the max pooling and convolution operations, the output of the-th convolutional layer in the objectness subnet can be computed as
where and indicates the kernel parameters of weight and bias at the -th convolutional layer. Additionally,
is a leaky ReLU activation with leakage coefficient of.
For leveraging the multi-scale information with various receptive fields, feature normalization (FN) operation is introduced in OM-CNN to normalize and concatenate certain convolutional layers of the objectness subnet. As shown in Figure 8-(b), FN includes a convolutional layer and a bilinear layer to normalize the input features into 128 channels with size of . Specifically, in the objectness subnet, the outputs of the -th, -th, -th and last convolutional layer are normalized by FN to obtain 4 sets of spatial features , referring to multiple scales. Besides, the output of the last FC layer in YOLO indicates the sizes, class probabilities and confidence of candidate objects in each grid. Then, the output of the last FC layer needs to be reshaped into a spatial feature with size of (representing 30 channels with the size of
). After bilinearly interpolating on thespatial feature, the high level spatial feature can be obtained with size of . At last, the final spatial features are generated by concatenating .
Given the spatial features , an inference module is designed to generate a coarse objectness map :
The inference module is a CNN structure consisting of 4 convolutional layers and 2 deconvolutional layers
with stride of. The architecture of is shown in Figure 8-(b). Consequently, the coarse objectness map can be obtained to encode the objectness information, roughly related to salient regions.
Next, a motion subnet, also shown in Figure 8-(a), is incorporated in OM-CNN to extract multi-scale temporal features from the pair of neighboring frames. According to the third findings of Section III-B, attention is more likely to be attracted by the moving objects or moving parts of the objects. Therefore, following the object subnet, the motion subnet is developed to extract motion features within object regions. The motion subnet is based on FlowNet , a CNN structure to estimate optical flow. In the motion subnet, only the first 10 convolutional layers of FlowNet are applied, in order to reduce the number of parameters. To model motion in object regions, the coarse objectness map in (3) is used to mask the outputs of first 6 convolutional layers of the motion subnet. As such, the output of the -th convolutional layer can be computed as
In (IV-C), and indicate the kernel parameters of weight and bias at the -th convolutional layer in the motion subnet; () is the adjustable parameter to control the mask degree, mapping the range of from to . Note that the last 4 convolutional layers are not masked with the coarse objectness map, for considering the motion of non-object region in saliency prediction. Afterwards, similar to the objectness subnet, the outputs of the -th, -th, -th and -th convolutional layers are computed by FN, such that 4 sets of temporal features with size of are achieved.
Then, given the extracted features and from two subnets of OM-CNN, another inference module is constructed to generate a fine saliency map , modeling the intra-frame saliency of a video frame. Mathematically, can be computed as
Here, is also used to train the OM-CNN model, to be discussed in Section IV-E. Besides, the architecture of is same as that of . As shown in Figure 8-(b), in , the output of convolutional layer with size of is viewed as the final spatio-temporal features, denoted as , to predict intra-frame saliency. Next, is fed into 2C-LSTM, to be presented in the following.
In this section, we develop the 2C-LSTM network for learning to predict dynamic saliency of a video clip, since the first finding of Section III-B illustrates that there exists dynamic transition of attention across video frames. At frame , taking the OM-CNN features as the input (denoted as ), 2C-LSTM leverages both long- and short-term correlation of the input features, through the memory cells () and hidden states () of 1-st and 2-nd LSTM layers at the last frame. Then, the hidden states of the 2-nd LSTM layer are fed into 2 deconvolutional layers to generate final saliency map at frame t. The architecture of 2C-LSTM is shown in Figure 9.
The same as , we extend the conventional LSTM via replacing Hadamard product (denoted as ) by the convolutional operator (denoted as ), in order to consider spatial correlation of input OM-CNN features in the dynamic model. Taking the first LSTM layer as example, a single LSTM cell at frame can be written as
are the activation functions of sigmoid and hyperbolic tangent. In (IV-D), and denote the kernel parameters of weight and bias at the corresponding convolutional layer; , and are the the gates of input, forget and output for frame ; , and
are the input modulation, memory cells and hidden states. They are all represented by 3-D tensors with size ofin 2C-LSTM.
In our method, we adopt two-layer LSTM cells to learn temporal correlation of high-dimensional features (
). On the other hand, the two-layer LSTM cells decrease the generalization ability. Thus, to improve generalization ability, we apply the Bayesian inference based dropout in each convolutional LSTM cell. Then, the LSTM cell of (IV-D) can be rewritten as
In (IV-D), and are 2 sets of random masks for the hidden states and input features before convolutional operation. These masks are generated by a -times Monte Carlo integration, with hidden dropout rate and feature dropout rate , respectively.
For training OM-CNN, we utilize the Kullback-Leibler (KL) divergence based loss function to update the parameters. It is because  has proved that the KL divergence is more effective than other metrics in training DNN to predict saliency. Regarding the saliency map as a probability distribution of attention, we can measure the KL divergence between fine saliency map of OM-CNN and ground truth distribution of human fixations as follows,
In (8), the smaller KL divergence indicates higher accuracy in saliency prediction. Furthermore, the KL divergence between coarse objectness map of OM-CNN and ground truth is also used as an auxiliary function to train OM-CNN. This is based on the assumption that the object regions are correlated with salient regions. Then, the OM-CNN model is trained by minimizing the following loss function,
In (9), is a hyper-parameter to control the weights of two KL divergences. Note that OM-CNN is pre-trained on YOLO and FlowNet, and the rest parameters of OM-CNN are initialized by the Xavier initializer .
To train 2C-LSTM, the training videos are cut into clips with the same length . Besides, when training 2C-LSTM, the parameters of OM-CNN are fixed to extract the spatio-temporal features of each -frame video clip. Then, the loss function of 2C-LSTM is defined as the averaged KL divergence over frames
In (10), are the final saliency maps generated by 2C-LSTM, and are the ground truth of attention maps. For each LSTM cell, the kernel parameters are initialized by the Xavier initializer, while the memory cells and hidden states are initialized by zeros.
In this section, the experimental results are presented to validate the performance of our method in video saliency prediction. Section V-A introduces the settings in our experiments. Section V-B and V-C compare the accuracy of saliency prediction on LEDOV and other 2 public databases, respectively. Furthermore, the results of ablation experiment are discussed in Section V-D, to analyze the effectiveness of each individual component proposed in our method.
In our experiment, 538 videos in the LEDOV database are randomly divided into training (436 videos), validation (41 videos) and test (41 videos) sets. Specifically, to learn 2C-LSTM, we temporally segment 456 training videos into 24,685 clips, all of which have frames. The overlap of 10 frames is allowed in cutting video clips, for the purpose of data augmentation. Before inputting to OM-CNN, the RGB channels of each frame are resized to
, with their mean values being removed. In training OM-CNN and 2C-LSTM, we learn the parameters using the stochastic gradient descent algorithm with Adam optimizer. Here, the hyper-parameters of OM-CNN and 2C-LSTM are tuned to minimize the KL divergence of saliency prediction over the validation set. The tuned values of some key hyper-parameters are listed in Table IV. Given the trained models of OM-CNN and 2C-LSTM, all 41 test videos in LEDOV are used to evaluate the performance of our method, compared with other 8 state-of-the-art methods. All experiments are conducted on a computer with Intel(R) Core(TM) i7-4770 CPU@3.4 GHz, 16 GB RAM and a single GPU of Nvidia GeForce GTX 1080. Benefitting from the acceleration of GPU, our method is able to make the real-time prediction for video saliency at a speed of 30 fps.
|Hyper-parameters in OM-CNN||Objectness mask parameter in (IV-C)||0.5|
|KL divergences weight in (9)||0.5|
|Initial learning rate|
Training epochs (iterations)
|Hyper-parameters in 2C-LSTM||Bayesian dropout rates and in 2C-LSTM||0.25&0.25|
|Times of Monte Carlo integration in in 2C-LSTM||100|
|Initial learning rate|
|Training epochs (iterations)|
|Our||GBVS ||PQFT ||Rudoy ||OBDL ||SALICON ||Xu ||BMS ||SalGAN |
In this section, we compare the accuracy of video saliency predicted by our and other 8 state-of-the-art methods, including GBVS , PQFT , Rudoy , OBDL , SALICON , Xu , BMS  and SalGAN . Among them, , , ,  and  are 5 state-of-the-art saliency prediction methods for videos. Besides, we compare two DNN based methods,  and 
. In our experiments, we apply four metrics to measure the accuracy of saliency prediction: the area under receiver operating characteristic curve (AUC), normalized scanpath saliency (NSS), CC, and KL divergence. Note that the larger value of AUC, NSS or CC indicates more accurate prediction of saliency, and the smaller KL divergence means better saliency prediction. TableV tabulates the results of AUC, NSS, CC and KL divergence for our and other 8 methods, which are averaged over 41 test videos of the LEDOV database. We can see from this table that our method performs much better than all other methods in all 4 metrics. More specifically, our method achieves at least 0.03, 0.61, 0.13 and 0.40 improvements in AUC, NSS, CC and KL. Besides, two DNN based methods, SALICON  and SalGAN , outperform other conventional methods. This verifies the effectiveness of saliency related features that are automatically learned by DNN, better than hand crafted features. On the other hand, our method is significantly superior to  and . The main reasons are as follows: (1) Our method embeds the objectness subnet to make use of objectness information in saliency prediction; (2) The object motion is explored in the motion subnet to predict video saliency. (3)The network of 2C-LSTM is leveraged to model saliency transition across video frames. Section V-D analyzes the above three reasons with more details.
Next, we move to the comparison of subjective results in video saliency prediction. Figure 10 demonstrates the saliency maps of 8 randomly selected videos in test set, detected by our and other 8 methods. In this figure, one frame is selected for each video. One may observe from Figure 10 that our method is capable of well locating the salient regions, much closer to the ground-truth maps of human fixations. In contrast, most of other methods fail to accurately predict the regions attracting human attention. In addition, Figure 11 shows the saliency maps of some frames selected from one test video. As seen in this figure, our method is able to model human fixation with smooth transition than other 8 methods. In summary, our method is superior to other state-of-the-art methods in both objective and subjective results, tested on our LEDOV database.
To evaluate the generalization capability of our method, we further compare the performance of our and other 8 methods on two widely used databases SFU  and DIEM , which are available online. The same as , 20 videos of DIEM and all videos of SFU are tested to assess the saliency prediction performance. In our experiments, the models of OMM-CNN and 2C-LSTM, learned from the training set of LEDOV, are directly used to predict saliency of test videos from the DIEM and SFU databases. Table VI presents the averaged results of AUC, NSS, CC and KL for our and other 8 methods over SFU and DIEM, respectively. We can see from this table that our method again outperforms other 8 methods, especially in the DIEM database. In particular, there are at least 0.04, 0.44, 0.09 and 0.25 improvements in AUC, NSS, CC and KL. Such improvements are comparable to those in our LEDOV database. This implies the generalization capability of our method in video saliency prediction.
|Our||GBVS ||PQFT ||Rudoy ||OBDL ||SALICON ||Xu ||BMS ||SalGAN |
|Our||GBVS ||PQFT ||Rudoy ||OBDL ||SALICON ||Xu ||BMS ||SalGAN |
Since the OM-CNN architecture of our method is composed of the objectness and motion subnets, we evaluate the contribution of each single subnet. We further analyze the contribution of 2C-LSTM, by comparing the trained models with and without 2C-LSTM. Specifically, the objectness subnet, motion subnet and OM-CNN are trained independently, with the same settings introduced above. Then, they are compared with our method, i.e., the combination of OM-CNN and 2C-LSTM. The comparison results are shown in Figure 12. We can see from this figure that OM-CNN performs better than the objectness subnet with 0.05 reduction in KL divergence, and outperforms the motion subnet with 0.09 KL divergence reduction. This indicates the effectiveness of integrating the subnets of objectness and motion. Besides, the combination of OM-CNN and 2C-LSTM reduces the KL divergence by 0.09, over the single OM-CNN architecture. Hence, we can conclude that 2C-LSTM can further improve the performance of OM-CNN, due to exploring temporal correlation of saliency across video frames.
Furthermore, we analyze the performance of Bayesian dropout in 2C-LSTM, which aims to avoid the over-fitting caused by the high dimensionality of 2C-LSTM. Through the experimental results, we find that the low dropout rate may incur the under-fitting issue, resulting in accuracy reduction of saliency prediction. To analyze the impact of the dropout rate, we train the 2C-LSTM models at different values of hidden dropout rate and feature dropout rate . The trained models are tested over the validation set of LEDOV, and the averaged results of KL divergence are shown in Figure 13. We can observe from the figure that the Bayesian dropout can bring around 0.03 KL reduction when both and are set to 0.25. However, the KL divergence sharply rises, once and are increased from 0.25 to 1. Therefore, we set and to be 0.25 in our method. They may be adjusted according to the amount of training data.
In this paper, we have proposed a DNN based video saliency prediction method. In our method, two DNN architectures were developed, i.e., OM-CNN and 2C-LSTM. The two DNN architectures were driven by the LEDOV database established in this paper, which is composed of 32 subjects fixations on 538 videos. Interestingly, we found from the LEDOV database that the human fixations more likely fall into the objects, especially the moving objects or moving regions in the objects. Additionally, we found that the correlation of attention across consecutive frames is high. In light of these findings, the OM-CNN architecture was proposed to explore the spatio-temporal features of object and motion to predict the intra-frame saliency of videos, and the 2C-LSTM architecture was developed to model the inter-frame saliency of videos. Finally, the experimental results verified that our DNN based method significantly outperforms other 8 state-of-the-art methods over both our and other two public video eye-tracking databases, in terms of AUC, CC, NSS, and KL metrics.
There are two promising directions for future work. First, our method mainly focuses on videos with objects. Actually, a handful of videos are of natural scenes, without any salient object. Hence, saliency prediction of natural scene videos is an interesting research direction in future. The second future work is the potential application of our method in perceptual video coding. In particular, our method is able to locate salient and non-salient regions in videos, and it is expected that the coding efficiency of videos can be improved by removing the perceptual redundancy existing in the non-salient regions. Consequently, the less bits are needed to encode and deliver videos, greatly relieving the bandwidth-hungry issue in video transmission.
A. Borji, L. Itti, State-of-the-art in visual attention modeling, Pattern Analysis and Machine Intelligence, IEEE Transactions on 35 (1) (2013) 185–207.
Y. Du, W. Wang, L. Wang, Hierarchical recurrent neural network for skeleton based action recognition, in: CVPR, 2015, pp. 1110–1118.
X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256.