1 Introduction
Video saliency detection [1, 2, 3, 4] is an important problem in the field of video analysis and has a wide range of applications such as object tracking, video retrieval, abnormal event detection and so on. This problem aims to automatically discover the visually interesting regions in a video sequence. In this area, a challenging issue is how to extract the atomic video structures(defined as the basic operating units which consider the spatial smoothness and temporal consistency) which reflect the intrinsical properties and the interactions among atomic units through both spatial and temporal dimensions in a unified framework. Therefore, the focus of video saliency detection is on effectively modeling the spatiotemporal relationship of the video.
In recent years, video saliency detection is typically characterized by exploring spatial and temporal properties. On one hand, the spatial saliency properties of a video sequence can be represented in multiple aspects. For example, some approaches measure saliency by local or global centersurround contrast which employs visual features such as color, intensity and orientation [5, 6, 7, 8, 9, 10, 11, 12]
. With the development of convolutional neural networks, deep learning is widely used in saliency detection tasks
[13, 14, 15, 16, 17]. These datadriven saliency models aim to directly capture the features which are able to represent the semantic properties of salient regions by means of supervised learning from a collection of training data with saliency annotations. Furthermore, the approaches
[18, 19] formulate saliency detection problem as a labeling task on the graph to represent the coherence and consistency among salient regions and the difference between salient regions and the background. On the other hand, video sequence understanding also needs to take temporal properties into consideration [20]. Some approaches [21, 22] extend existing spatial saliency detection methods by adding the temporal dimension to obtain final video saliency maps. In essence, the task of video saliency detection is related to both spatial and temporal properties simultaneously. Therefore, how to effectively model these factors in a unified framework is the key to better explore the intrinsical properties of atomic video structures and their associated contextual interactions.Motivated by these observations, we propose a graphtheoretic video saliency detection approach based on adaptive video structure discovery, which is carried out within a spatiotemporal atomic graph. Considering the spatial smoothness and temporal consistency, we discover atomic units to better model the intrinsical properties of a video sequence. In order to better represent the extensive information of atomic units (e.g., color, context and semantics), we combine lowlevel color features with highly abstract deep features. Then, we utilize the graphbased manifold propagation to model the semantically contextual interactions within atomic video structures for saliency detection while preserving spatial smoothness and temporal consistency.
The main contributions of our paper are summarized as follows:
1) We propose a graphbased video saliency detection approach based on adaptive video structure discovery, which is carried out in a spatiotemporal atomic graph. As the representation units of a video, the atomics are able to model both spatial and temporal layouts of the video in a unified structure.
2) In the spatiotemporal context model, we utilize graph propagation for video saliency detection. Through graphbased manifold propagation, the proposed approach is capable of effectively modeling the semantically contextual interactions within atomic video structures for saliency detection while preserving spatial smoothness and temporal consistency. Therefore, our work effectively explores the intrinsic problem on how to make the connection from spatiotemporal context modeling to video saliency, resulting in promising saliency detection results.
2 Our Approach
2.1 Problem Formulation
In this work, we aim to build a framework to automatically solve the video saliency detection problem. We think of a tframe video as a spacetime lattice where denotes the 2D frame of the video, and the time axis, so can be viewed as a 3D matrix of size . Our goal is to discover the salient region of a video so that if and only if the pixel in location is salient. Finally, the framework outputs salient maps for video sequence which defined as a set of frame saliency maps. The video salient maps not only segment the visually interesting regions of each frame, but also capture the motion regions which contain temporal consistency and the abrupt changes during the continuous frames.
A video is substantially a temporally adjacent sequence which encodes both static and dynamic information of a scene. The pixels within a spatiotemporal local region possess the property of spatial smoothness. Meanwhile, the motion trends of its contents preserve temporal smoothness across consecutive frames.
Based on these observations, we discover the spatiotemporal atomic units of a video to exploit consistency in time and smoothness in space. The video becomes a structural layout of atomic units which contain rich information about its appearance, motion trends, semantic interactions and so on. We cultivate these information by representing the units with combinatorial pixelwise features.
To define the contextual interactions of the video, a compact 3D adjacency graph is constructed with nodes (the atomic units) and edges (the interaction among units) along the spatial and temporal dimensions. Then, we exploit the saliency distribution of the video by means of graphbased manifold propagation. The salient region of this video is thus defined as the subgraph activated by high saliency scores during the propagation procedure. Figure 1 shows the pipeline of our method.
2.2 Graphbased Modeling
2.2.1 Spatiotemporal Atomic Unit Discovery
The core idea of our framework is to exploit the spatial smoothness and temporal consistency of a video simultaneously. These properties are enclosed in the interactions among pixels within a spatiotemporal atomic unit. So the discovery of atomic units is converted to a pixel clustering problem in accordance with coherence in appearance and motion trend. This problem can be solved by spatiotemporal segmentation [23].
For a given video , our goal is to discover its atomic structure . In order to obtain the atomic units of the most suitable granularity, the segmentation is conducted in a hierarchical way resulting in layers of individual segments where each layer is a set of atomic units such that and for all pairs of atomic units.
The spatiotemporal atomic units contain extensive information about the video such as color, context, semantics and so on. So we represent them with LAB color features [24] and FCN features [25] in a pixelwise manner. As the atomic units are of arbitrary shapes and sizes, we aggregate the pixelwise features by regional pooling.
Therefore, two feature maps are extracted from the video. For the ith atomic unit
, its feature vectors aggregated by regional pooling are denoted as
and . The video can be transformed to , where N is the total number of atomic units generated after discovery.2.2.2 Graphbased Context Model
The spatial smoothness and temporal consistency of a video also lie in the contextual interactions among the atomic units.
To address this problem, we construct a spatiotemporal adjacency graph to model the contextual interaction among atomic units, where is a set of atomic units denoted as , and
is a set of undirected edges which connect neighboring atomic units. Here, the spatiotemporal adjacency is defined as the relationship that atomic units are neighboring in both spatial and temporal dimensions. Therefore, the atomic unit level graph with affinity matrix
is constructed as follows:(1) 
where stands for the RBF kernel evaluating the feature similarity (such that with being a scaling factor). The weights are computed on the basis of the distance in the feature space, and in our case, two different 3D graphs are constructed in LAB feature space and FCN feature space respectively.
2.3 Saliency Propagation
We exploit the long range contextual correlation of the graph model which defines the saliency distribution of the video by graphbased propagation. The salient region of the video is thus a subgraph of its 3D graph model activated from saliency propagation.
In this framework, propagation is conducted in the manner of manifold ranking [18]. Let denote a ranking function, , is the saliency score of atomic unit . Let be an initial saliency vector obtained through coarsegrained foreground segmentation from standard computer vision techniques (e.g., background subtraction or object detection). The graph is associated with an edgeweighted matrix with its degree matrix being , where . Hence, our task is to learn an energy function within the following optimization framework:
(2) 
where is the smoothing factor. The closed form solution of Eq. (2) is:
(3) 
where
is an identity matrix and
is the normalized Laplacian matrix for the affinity matrix , .For the LAB and FCN feature spaces, we propagate the saliency map denoted as and respectively. The final saliency map of the video is then computed as:
(4) 
where is a tradeoff control factor and is the elementwise multiplication operator. The whole video saliency detection algorithm is summarized in Algorithm 1.
3 Experiments
3.1 Datasets and Evaluation Measures
In this section, we evaluate the performance of the proposed algorithm on 10 test video sequences taken by static camera from MPEG dataset, NTT dataset and MCL dataset [26, 20]. These three datasets^{1}^{1}1The details of these datdsets are shown in the supplementary material include outdoor and indoor scenes containing several challenging conditions such as low resolution, light changes, abrupt changes, occlusion, multiobjects, high scene complexity and so on.
We also utilize the AUC score which calculates the area under the receiver operating characteristic curve (ROC) to show the relationship between the true positive rate (TPR) and the false positive rate (FPR). Another evaluation measure is the normalized scanpath saliency (NSS) score
[27].3.2 Implementation Details
During the atomic units discovery procedure, the segmentation is carried out by the GBH method [28, 23] with the streaming level set to 10. We represent every pixel with a dimensional LAB feature vector and a
dimensional FCN feature vector. The FCN features are extracted with FCN32s network, implemented on the basis of the Caffe
[29, 14]toolbox. For the feature representation, we conduct intraframe maxpooling and interframe averagepooling within atomic units. In saliency propagation, there are three parameters in the proposed algorithm, the scalling factor
which controls the strength of the weight in the RBF kernel is set to . The smoothing factor in Eq. (2) is set to . The fusion weight in Eq. (4) is set to . All the above parameters are fixed throughout all the experiments.3.3 Analysis of Proposed Approach
We quantitatively and qualitatively compare the proposed approach with several comparison methods including GBVS [30], SRA [31], SSR [2], RWR [20], FSR [24], HBC [32], RBC [32] and GMR [18]. The experimental results^{2}^{2}2More qualitative and quantitative results can be found in the supplementary material show that our approach is able to effectively capture the whole object structure information by graphbased context modeling. Figure 2 shows the corresponding ROC curves of all the comparing approaches on the average result of all testing video sequences. From Figure 2, we observe that the proposed approach achieves a better performance than the other ones. Table 1 report the quantitative saliency detection performance on the AUC. It is clearly seen that our approach performs better against the comparison methods. Figure 3 shows an example of the experimental results for graph propagation. From Figure 3 (b) and (c), we observe that our approach is able to effectively capture the whole object structure information by graphbased context modeling.
Dataset  Ours  FSR  GBVS  GMR  RWR  HBC  RBC  SRA  SSR 
Hall1  0.8368  0.8136  0.8175  0.8143  0.8295  0.8154  0.8194  0.7803  0.8105 
bird1  0.7981  0.7504  0.7446  0.7498  0.7686  0.7413  0.7529  0.7291  0.7501 
Bird2  0.7741  0.7459  0.7302  0.7389  0.7691  0.7409  0.7518  0.7320  0.7530 
Horse  0.7243  0.6801  0.6821  0.6917  0.7069  0.6832  0.6957  0.6810  0.6921 
Car  0.7191  0.7023  0.7003  0.7008  0.7114  0.7001  0.7117  0.6923  0.7023 
Campus  0.7238  0.7059  0.7055  0.7059  0.7183  0.7007  0.7058  0.6961  0.7061 
Crowd  0.7713  0.7485  0.7504  0.7488  0.7673  0.7401  0.7499  0.7380  0.7498 
Road  0.7234  0.7065  0.7062  0.7069  0.7176  0.7039  0.7071  0.7009  0.7097 
Square  0.7217  0.7143  0.7148  0.7165  0.7157  0.7139  0.7154  0.7041  0.7035 
Stair  0.7208  0.7104  0.7128  0.7106  0.7154  0.7102  0.7121  0.7017  0.7033 
4 Conclusion
In this paper, we propose a graphtheoretic video saliency detection approach based on adaptive video structure discovery, which is carried out within a spatiotemporal atomic graph. We adopt the atomic units to better capture the intrinsical properties of video. Furthermore, the proposed approach utilize graphbased manifold propagation to model the semantically contextual interactions within atomic video structures to generate video saliency maps. We evaluate the proposed approach on video sequences and achieve promising results in comparison with
stateoftheart methods. The experimental results demonstrate that the proposed approach performs well in different evaluation metrics on complex video sequences.
References
 [1] Lingyun Zhang, Matthew H Tong, and Garrison W Cottrell, “Sunday: Saliency using natural statistics for dynamic analysis of scenes,” in Proc. 31st Annu. Cognit. Sci. Conf., 2009, pp. 2944–2949.
 [2] Yawen Xue, Xiaojie Guo, and Xiaochun Cao, “Motion saliency detection using lowrank and sparse decomposition,” in Proc. IEEE ICASSP, 2012, pp. 1485–1488.
 [3] Xiao Xian, Changsheng Xu, and Yong Rui, “Video based 3d reconstruction using spatiotemporal attention analysis,” in Proc. IEEE ICME, 2010, pp. 1091–1096.
 [4] Li Jia, Changqun Xia, and Xiaowu Chen, “A benchmark dataset and saliencyguided stacked autoencoder for videobased salient object detection,” in arXiv preprint arXiv:1611.00135, 2016.
 [5] Laurent Itti, Christof Koch, and Ernst Niebur, “A model of saliencybased visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259, 1998.
 [6] Dominik A Klein and Simone Frintrop, “Centersurround divergence of feature statistics for salient object detection,” in Proc. IEEE Conf. ICCV, 2011, pp. 2214–2219.
 [7] Yichen Wei, Fang Wen, Wangjiang Zhu, and Jian Sun, “Geodesic saliency using background priors,” in Proc. ECCV, 2012, pp. 29–42.
 [8] Nianyi Li, Bilin Sun, and Jingyi Yu, “A weighted sparse coding framework for saliency detection,” in Proc. IEEE Conf. CVPR, 2015, pp. 5216–5223.
 [9] Yao Qin, Huchuan Lu, Yiqun Xu, and He Wang, “Saliency detection via cellular automata,” in Proc. IEEE Conf. CVPR, 2015, pp. 110–119.
 [10] Jinqing Qi, Shijing Dong, Fang Huang, and Huchuan Lu, “Saliency detection via joint modeling global shape and local consistency,” Neurocomputing, vol. 222, pp. 81–90, 2017.
 [11] Dingwen Zhang, Deyu Meng, Long Zhao, and Junwei Han, “Bridging saliency detection to weakly supervised object detection based on selfpaced curriculum learning,” in Proc. of Int. Joint Conf. on Artif. Intel., 2016, pp. 825–841.
 [12] Junwei Han, Ling Shao, Nuno Vasconcelos, Jungong Han, and Dong Xu, “Guest editorial special section on visual saliency computing and learning,” IEEE Trans. Neural Networks and Learning Systems., vol. 27, pp. 1118–1121, 2016.

[13]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton,
“Imagenet classification with deep convolutional neural networks,”
in Proc. Adv. NIPS, 2012, pp. 1097–1105.  [14] Guanbin Li and Yizhou Yu, “Visual saliency based on multiscale deep features,” in Proc. IEEE Conf. CVPR, 2015, pp. 5455–5463.
 [15] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for largescale image recognition,” in Proc. ICLR, 2015, pp. 1–14.
 [16] Huaizu Jiang, Jingdong Wang, Zejian Yuan, Yang Wu, Nanning Zheng, and Shipeng Li, “Salient object detection: A discriminative regional feature integration approach,” in Proc. IEEE Conf. CVPR, 2013, pp. 2083–2090.
 [17] Linzhao Wang, Lijun Wang, Huchuan Lu, Pingping Zhang, and Xiang Ruan, “Saliency detection with recurrent fully convolutional networks,” in Proc. ECCV, 2016, pp. 825–841.
 [18] Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and MingHsuan Yang, “Saliency detection via graphbased manifold ranking,” in Proc. IEEE Conf. CVPR, 2013, pp. 3166–3173.
 [19] Xi Li, Yao Li, Chunhua Shen, A. Dick, and A. van den Hengel, “Contextual hypergraph modeling for salient object detection,” in Proc. IEEE Conf. ICCV, 2013, pp. 3328–3335.
 [20] Hansang Kim, Youngbae Kim, JaeYoung Sim, and ChangSu Kim, “Spatiotemporal saliency detection for video sequences based on random walk with restart,” IEEE Trans. Image Process., vol. 24, no. 8, pp. 2552–2564, 2015.
 [21] Yun Zhai and Mubarak Shah, “Visual attention detection in video sequences using spatiotemporal cues,” in Proc. ACM Multimedia, 2006, pp. 815–824.
 [22] Yin Li, Yue Zhou, Junchi Yan, and Jie Yang, “Visual saliency based on conditional entropy,” in Proc. ACCV, 2009, pp. 246–257.
 [23] Spencer Whitt Chenliang Xu and Jason J. Corso, “Flattening supervoxel hierarchies by the uniform entropy slice,” in Proc. IEEE Conf. ICCV, 2013, pp. 2240–2247.
 [24] Radhakrishna Achantay, Sheila Hemamiz, Francisco Estraday, and Sabine Susstrunk, “Frequencytuned salient region detection,” in Proc. IEEE Conf. CVPR, 2009, pp. 1597–1604.
 [25] Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. CVPR, 2015, pp. 3431–3440.
 [26] Kazuma Akamine, Ken Fukuchi, Akisato Kimura, and Shigeru Takagi, “Fully automatic extraction of salient objects from videos in near real time,” Cpmput. J., vol. 55, no. 1, pp. 3–14, 2012.
 [27] Robert J. Peters, Asha Iyer, Laurent Itti, and Christof Koch, “Components of bottomup gaze allocation in natural images,” Vision research, vol. 45, no. 18, pp. 2397–2416, 2005.
 [28] Chenliang Xu and Jason J. Corso, “Evaluation of supervoxel methods for early video processing,” in Proc. IEEE Conf. CVPR, 2012, pp. 1202–1209.
 [29] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proc. ACM Multimedia, 2014, pp. 675–678.
 [30] Christof Koch Harel, Jonathan and Pietro Perona, “Graphbased visual saliency,” in Proc. Adv. NIPS, 2006, pp. 545–552.
 [31] Xiaodi Hou and Liqing Zhang, “Saliency detection: A spectral residual approach,” in Proc. IEEE Conf. CVPR, 2007, pp. 1–8.
 [32] MingMing Cheng, Niloy J. Mitra, Xiaolei Huang, Philip H. S. Torr, and ShiMin Hu, “Global contrast based salient region detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, 2015.
Comments
There are no comments yet.