Graph-Theoretic Spatiotemporal Context Modeling for Video Saliency Detection

07/25/2017 ∙ by Lina Wei, et al. ∙ 0

As an important and challenging problem in computer vision, video saliency detection is typically cast as a spatiotemporal context modeling problem over consecutive frames. As a result, a key issue in video saliency detection is how to effectively capture the intrinsical properties of atomic video structures as well as their associated contextual interactions along the spatial and temporal dimensions. Motivated by this observation, we propose a graph-theoretic video saliency detection approach based on adaptive video structure discovery, which is carried out within a spatiotemporal atomic graph. Through graph-based manifold propagation, the proposed approach is capable of effectively modeling the semantically contextual interactions among atomic video structures for saliency detection while preserving spatial smoothness and temporal consistency. Experiments demonstrate the effectiveness of the proposed approach over several benchmark datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video saliency detection [1, 2, 3, 4] is an important problem in the field of video analysis and has a wide range of applications such as object tracking, video retrieval, abnormal event detection and so on. This problem aims to automatically discover the visually interesting regions in a video sequence. In this area, a challenging issue is how to extract the atomic video structures(defined as the basic operating units which consider the spatial smoothness and temporal consistency) which reflect the intrinsical properties and the interactions among atomic units through both spatial and temporal dimensions in a unified framework. Therefore, the focus of video saliency detection is on effectively modeling the spatiotemporal relationship of the video.

In recent years, video saliency detection is typically characterized by exploring spatial and temporal properties. On one hand, the spatial saliency properties of a video sequence can be represented in multiple aspects. For example, some approaches measure saliency by local or global center-surround contrast which employs visual features such as color, intensity and orientation [5, 6, 7, 8, 9, 10, 11, 12]

. With the development of convolutional neural networks, deep learning is widely used in saliency detection tasks 

[13, 14, 15, 16, 17]

. These data-driven saliency models aim to directly capture the features which are able to represent the semantic properties of salient regions by means of supervised learning from a collection of training data with saliency annotations. Furthermore, the approaches 

[18, 19] formulate saliency detection problem as a labeling task on the graph to represent the coherence and consistency among salient regions and the difference between salient regions and the background. On the other hand, video sequence understanding also needs to take temporal properties into consideration [20]. Some approaches [21, 22] extend existing spatial saliency detection methods by adding the temporal dimension to obtain final video saliency maps. In essence, the task of video saliency detection is related to both spatial and temporal properties simultaneously. Therefore, how to effectively model these factors in a unified framework is the key to better explore the intrinsical properties of atomic video structures and their associated contextual interactions.

Motivated by these observations, we propose a graph-theoretic video saliency detection approach based on adaptive video structure discovery, which is carried out within a spatiotemporal atomic graph. Considering the spatial smoothness and temporal consistency, we discover atomic units to better model the intrinsical properties of a video sequence. In order to better represent the extensive information of atomic units (e.g., color, context and semantics), we combine low-level color features with highly abstract deep features. Then, we utilize the graph-based manifold propagation to model the semantically contextual interactions within atomic video structures for saliency detection while preserving spatial smoothness and temporal consistency.

The main contributions of our paper are summarized as follows:

Figure 1: Illustration of the proposed approach for video saliency detection. First, an adaptive graph-based atomic structure is discovered according to the spatiotemporal segmentation. Second, construct a spatiotemporal graph. Third, utilize the graph-based manifold propagation to model the semantically contextual interactions within atomic video structures.

1) We propose a graph-based video saliency detection approach based on adaptive video structure discovery, which is carried out in a spatiotemporal atomic graph. As the representation units of a video, the atomics are able to model both spatial and temporal layouts of the video in a unified structure.

2) In the spatiotemporal context model, we utilize graph propagation for video saliency detection. Through graph-based manifold propagation, the proposed approach is capable of effectively modeling the semantically contextual interactions within atomic video structures for saliency detection while preserving spatial smoothness and temporal consistency. Therefore, our work effectively explores the intrinsic problem on how to make the connection from spatio-temporal context modeling to video saliency, resulting in promising saliency detection results.

2 Our Approach

2.1 Problem Formulation

In this work, we aim to build a framework to automatically solve the video saliency detection problem. We think of a t-frame video as a space-time lattice where denotes the 2D frame of the video, and the time axis, so can be viewed as a 3D matrix of size . Our goal is to discover the salient region of a video so that if and only if the pixel in location is salient. Finally, the framework outputs salient maps for video sequence which defined as a set of frame saliency maps. The video salient maps not only segment the visually interesting regions of each frame, but also capture the motion regions which contain temporal consistency and the abrupt changes during the continuous frames.

A video is substantially a temporally adjacent sequence which encodes both static and dynamic information of a scene. The pixels within a spatiotemporal local region possess the property of spatial smoothness. Meanwhile, the motion trends of its contents preserve temporal smoothness across consecutive frames.

Based on these observations, we discover the spatiotemporal atomic units of a video to exploit consistency in time and smoothness in space. The video becomes a structural layout of atomic units which contain rich information about its appearance, motion trends, semantic interactions and so on. We cultivate these information by representing the units with combinatorial pixel-wise features.

To define the contextual interactions of the video, a compact 3D adjacency graph is constructed with nodes (the atomic units) and edges (the interaction among units) along the spatial and temporal dimensions. Then, we exploit the saliency distribution of the video by means of graph-based manifold propagation. The salient region of this video is thus defined as the subgraph activated by high saliency scores during the propagation procedure. Figure 1 shows the pipeline of our method.

2.2 Graph-based Modeling

2.2.1 Spatiotemporal Atomic Unit Discovery

The core idea of our framework is to exploit the spatial smoothness and temporal consistency of a video simultaneously. These properties are enclosed in the interactions among pixels within a spatiotemporal atomic unit. So the discovery of atomic units is converted to a pixel clustering problem in accordance with coherence in appearance and motion trend. This problem can be solved by spatiotemporal segmentation [23].

For a given video , our goal is to discover its atomic structure . In order to obtain the atomic units of the most suitable granularity, the segmentation is conducted in a hierarchical way resulting in layers of individual segments where each layer is a set of atomic units such that and for all pairs of atomic units.

The spatiotemporal atomic units contain extensive information about the video such as color, context, semantics and so on. So we represent them with LAB color features [24] and FCN features [25] in a pixel-wise manner. As the atomic units are of arbitrary shapes and sizes, we aggregate the pixel-wise features by regional pooling.

Therefore, two feature maps are extracted from the video. For the i-th atomic unit

, its feature vectors aggregated by regional pooling are denoted as

and . The video can be transformed to , where N is the total number of atomic units generated after discovery.

2.2.2 Graph-based Context Model

The spatial smoothness and temporal consistency of a video also lie in the contextual interactions among the atomic units.

To address this problem, we construct a spatiotemporal adjacency graph to model the contextual interaction among atomic units, where is a set of atomic units denoted as , and

is a set of undirected edges which connect neighboring atomic units. Here, the spatiotemporal adjacency is defined as the relationship that atomic units are neighboring in both spatial and temporal dimensions. Therefore, the atomic unit level graph with affinity matrix

is constructed as follows:

(1)

where stands for the RBF kernel evaluating the feature similarity (such that with being a scaling factor). The weights are computed on the basis of the distance in the feature space, and in our case, two different 3D graphs are constructed in LAB feature space and FCN feature space respectively.

0:  Video sequence
0:  Video saliency map
  1. Given video , discover the atomic structure ;
  2. Represent the atomic units with aggregated pixel-wise features by regional pooling over the LAB and FCN feature spaces;
  3. Construct the spatiotemporal adjacency graph with an affinity matrix according to Eq. (1);
  4. Compute final saliency scores according to Eq. (3) and generate two saliency maps and respectively;
  5. Combine and to generate the final map according to Eq. (4)
  Return:  Video saliency map
Algorithm 1 Video saliency detection

2.3 Saliency Propagation

We exploit the long range contextual correlation of the graph model which defines the saliency distribution of the video by graph-based propagation. The salient region of the video is thus a subgraph of its 3D graph model activated from saliency propagation.

In this framework, propagation is conducted in the manner of manifold ranking [18]. Let denote a ranking function, , is the saliency score of atomic unit . Let be an initial saliency vector obtained through coarse-grained foreground segmentation from standard computer vision techniques (e.g., background subtraction or object detection). The graph is associated with an edge-weighted matrix with its degree matrix being , where . Hence, our task is to learn an energy function within the following optimization framework:

(2)

where is the smoothing factor. The closed form solution of Eq. (2) is:

(3)

where

is an identity matrix and

is the normalized Laplacian matrix for the affinity matrix , .

For the LAB and FCN feature spaces, we propagate the saliency map denoted as and respectively. The final saliency map of the video is then computed as:

(4)

where is a trade-off control factor and is the element-wise multiplication operator. The whole video saliency detection algorithm is summarized in Algorithm 1.

3 Experiments

3.1 Datasets and Evaluation Measures

In this section, we evaluate the performance of the proposed algorithm on 10 test video sequences taken by static camera from MPEG dataset, NTT dataset and MCL dataset [26, 20]. These three datasets111The details of these datdsets are shown in the supplementary material include outdoor and indoor scenes containing several challenging conditions such as low resolution, light changes, abrupt changes, occlusion, multi-objects, high scene complexity and so on.

We also utilize the AUC score which calculates the area under the receiver operating characteristic curve (ROC) to show the relationship between the true positive rate (TPR) and the false positive rate (FPR). Another evaluation measure is the normalized scan-path saliency (NSS) score 

[27].

3.2 Implementation Details

During the atomic units discovery procedure, the segmentation is carried out by the GBH method [28, 23] with the streaming level set to 10. We represent every pixel with a dimensional LAB feature vector and a

dimensional FCN feature vector. The FCN features are extracted with FCN-32s network, implemented on the basis of the Caffe 

[29, 14]

toolbox. For the feature representation, we conduct intra-frame max-pooling and inter-frame average-pooling within atomic units. In saliency propagation, there are three parameters in the proposed algorithm, the scalling factor

which controls the strength of the weight in the RBF kernel is set to . The smoothing factor in Eq. (2) is set to . The fusion weight in Eq. (4) is set to . All the above parameters are fixed throughout all the experiments.

Figure 2: The quantitative ROC results of all the comparison approaches under the average over all datasets. Clearly, our approach achieves the best performance over the competing approaches.

3.3 Analysis of Proposed Approach

Figure 3: Illustration of graph propagation. (a) input frame. (b) the visualization of the initial saliency values and (c) the resulting saliency map.

We quantitatively and qualitatively compare the proposed approach with several comparison methods including GBVS [30], SRA [31], SSR [2], RWR [20], FSR [24], HBC [32], RBC [32] and GMR [18]. The experimental results222More qualitative and quantitative results can be found in the supplementary material show that our approach is able to effectively capture the whole object structure information by graph-based context modeling. Figure 2 shows the corresponding ROC curves of all the comparing approaches on the average result of all testing video sequences. From Figure 2, we observe that the proposed approach achieves a better performance than the other ones. Table 1 report the quantitative saliency detection performance on the AUC. It is clearly seen that our approach performs better against the comparison methods. Figure 3 shows an example of the experimental results for graph propagation. From Figure 3 (b) and (c), we observe that our approach is able to effectively capture the whole object structure information by graph-based context modeling.

Dataset Ours FSR GBVS GMR RWR HBC RBC SRA SSR
Hall1 0.8368 0.8136 0.8175 0.8143 0.8295 0.8154 0.8194 0.7803 0.8105
bird1 0.7981 0.7504 0.7446 0.7498 0.7686 0.7413 0.7529 0.7291 0.7501
Bird2 0.7741 0.7459 0.7302 0.7389 0.7691 0.7409 0.7518 0.7320 0.7530
Horse 0.7243 0.6801 0.6821 0.6917 0.7069 0.6832 0.6957 0.6810 0.6921
Car 0.7191 0.7023 0.7003 0.7008 0.7114 0.7001 0.7117 0.6923 0.7023
Campus 0.7238 0.7059 0.7055 0.7059 0.7183 0.7007 0.7058 0.6961 0.7061
Crowd 0.7713 0.7485 0.7504 0.7488 0.7673 0.7401 0.7499 0.7380 0.7498
Road 0.7234 0.7065 0.7062 0.7069 0.7176 0.7039 0.7071 0.7009 0.7097
Square 0.7217 0.7143 0.7148 0.7165 0.7157 0.7139 0.7154 0.7041 0.7035
Stair 0.7208 0.7104 0.7128 0.7106 0.7154 0.7102 0.7121 0.7017 0.7033
Table 1: Comparison of AUC scores of the proposed algorithm and other comparing methods. Our approach achieves the best performance in this metric.

4 Conclusion

In this paper, we propose a graph-theoretic video saliency detection approach based on adaptive video structure discovery, which is carried out within a spatiotemporal atomic graph. We adopt the atomic units to better capture the intrinsical properties of video. Furthermore, the proposed approach utilize graph-based manifold propagation to model the semantically contextual interactions within atomic video structures to generate video saliency maps. We evaluate the proposed approach on video sequences and achieve promising results in comparison with

state-of-the-art methods. The experimental results demonstrate that the proposed approach performs well in different evaluation metrics on complex video sequences.

References

  • [1] Lingyun Zhang, Matthew H Tong, and Garrison W Cottrell, “Sunday: Saliency using natural statistics for dynamic analysis of scenes,” in Proc. 31st Annu. Cognit. Sci. Conf., 2009, pp. 2944–2949.
  • [2] Yawen Xue, Xiaojie Guo, and Xiaochun Cao, “Motion saliency detection using low-rank and sparse decomposition,” in Proc. IEEE ICASSP, 2012, pp. 1485–1488.
  • [3] Xiao Xian, Changsheng Xu, and Yong Rui, “Video based 3d reconstruction using spatio-temporal attention analysis,” in Proc. IEEE ICME, 2010, pp. 1091–1096.
  • [4] Li Jia, Changqun Xia, and Xiaowu Chen, “A benchmark dataset and saliency-guided stacked autoencoder for video-based salient object detection,” in arXiv preprint arXiv:1611.00135, 2016.
  • [5] Laurent Itti, Christof Koch, and Ernst Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259, 1998.
  • [6] Dominik A Klein and Simone Frintrop, “Center-surround divergence of feature statistics for salient object detection,” in Proc. IEEE Conf. ICCV, 2011, pp. 2214–2219.
  • [7] Yichen Wei, Fang Wen, Wangjiang Zhu, and Jian Sun, “Geodesic saliency using background priors,” in Proc. ECCV, 2012, pp. 29–42.
  • [8] Nianyi Li, Bilin Sun, and Jingyi Yu, “A weighted sparse coding framework for saliency detection,” in Proc. IEEE Conf. CVPR, 2015, pp. 5216–5223.
  • [9] Yao Qin, Huchuan Lu, Yiqun Xu, and He Wang, “Saliency detection via cellular automata,” in Proc. IEEE Conf. CVPR, 2015, pp. 110–119.
  • [10] Jinqing Qi, Shijing Dong, Fang Huang, and Huchuan Lu, “Saliency detection via joint modeling global shape and local consistency,” Neurocomputing, vol. 222, pp. 81–90, 2017.
  • [11] Dingwen Zhang, Deyu Meng, Long Zhao, and Junwei Han, “Bridging saliency detection to weakly supervised object detection based on self-paced curriculum learning,” in Proc. of Int. Joint Conf. on Artif. Intel., 2016, pp. 825–841.
  • [12] Junwei Han, Ling Shao, Nuno Vasconcelos, Jungong Han, and Dong Xu, “Guest editorial special section on visual saliency computing and learning,” IEEE Trans. Neural Networks and Learning Systems., vol. 27, pp. 1118–1121, 2016.
  • [13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton,

    Imagenet classification with deep convolutional neural networks,”

    in Proc. Adv. NIPS, 2012, pp. 1097–1105.
  • [14] Guanbin Li and Yizhou Yu, “Visual saliency based on multiscale deep features,” in Proc. IEEE Conf. CVPR, 2015, pp. 5455–5463.
  • [15] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2015, pp. 1–14.
  • [16] Huaizu Jiang, Jingdong Wang, Zejian Yuan, Yang Wu, Nanning Zheng, and Shipeng Li, “Salient object detection: A discriminative regional feature integration approach,” in Proc. IEEE Conf. CVPR, 2013, pp. 2083–2090.
  • [17] Linzhao Wang, Lijun Wang, Huchuan Lu, Pingping Zhang, and Xiang Ruan, “Saliency detection with recurrent fully convolutional networks,” in Proc. ECCV, 2016, pp. 825–841.
  • [18] Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang, “Saliency detection via graph-based manifold ranking,” in Proc. IEEE Conf. CVPR, 2013, pp. 3166–3173.
  • [19] Xi Li, Yao Li, Chunhua Shen, A. Dick, and A. van den Hengel, “Contextual hypergraph modeling for salient object detection,” in Proc. IEEE Conf. ICCV, 2013, pp. 3328–3335.
  • [20] Hansang Kim, Youngbae Kim, Jae-Young Sim, and Chang-Su Kim, “Spatiotemporal saliency detection for video sequences based on random walk with restart,” IEEE Trans. Image Process., vol. 24, no. 8, pp. 2552–2564, 2015.
  • [21] Yun Zhai and Mubarak Shah, “Visual attention detection in video sequences using spatiotemporal cues,” in Proc. ACM Multimedia, 2006, pp. 815–824.
  • [22] Yin Li, Yue Zhou, Junchi Yan, and Jie Yang, “Visual saliency based on conditional entropy,” in Proc. ACCV, 2009, pp. 246–257.
  • [23] Spencer Whitt Chenliang Xu and Jason J. Corso, “Flattening supervoxel hierarchies by the uniform entropy slice,” in Proc. IEEE Conf. ICCV, 2013, pp. 2240–2247.
  • [24] Radhakrishna Achantay, Sheila Hemamiz, Francisco Estraday, and Sabine Susstrunk, “Frequency-tuned salient region detection,” in Proc. IEEE Conf. CVPR, 2009, pp. 1597–1604.
  • [25] Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. CVPR, 2015, pp. 3431–3440.
  • [26] Kazuma Akamine, Ken Fukuchi, Akisato Kimura, and Shigeru Takagi, “Fully automatic extraction of salient objects from videos in near real time,” Cpmput. J., vol. 55, no. 1, pp. 3–14, 2012.
  • [27] Robert J. Peters, Asha Iyer, Laurent Itti, and Christof Koch, “Components of bottom-up gaze allocation in natural images,” Vision research, vol. 45, no. 18, pp. 2397–2416, 2005.
  • [28] Chenliang Xu and Jason J. Corso, “Evaluation of super-voxel methods for early video processing,” in Proc. IEEE Conf. CVPR, 2012, pp. 1202–1209.
  • [29] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proc. ACM Multimedia, 2014, pp. 675–678.
  • [30] Christof Koch Harel, Jonathan and Pietro Perona, “Graph-based visual saliency,” in Proc. Adv. NIPS, 2006, pp. 545–552.
  • [31] Xiaodi Hou and Liqing Zhang, “Saliency detection: A spectral residual approach,” in Proc. IEEE Conf. CVPR, 2007, pp. 1–8.
  • [32] Ming-Ming Cheng, Niloy J. Mitra, Xiaolei Huang, Philip H. S. Torr, and Shi-Min Hu, “Global contrast based salient region detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 569–582, 2015.