A Benchmark Dataset and Saliency-guided Stacked Autoencoders for Video-based Salient Object Detection

11/01/2016 ∙ by Jia Li, et al. ∙ 0

Image-based salient object detection (SOD) has been extensively studied in the past decades. However, video-based SOD is much less explored since there lack large-scale video datasets within which salient objects are unambiguously defined and annotated. Toward this end, this paper proposes a video-based SOD dataset that consists of 200 videos (64 minutes). In constructing the dataset, we manually annotate all objects and regions over 7,650 uniformly sampled keyframes and collect the eye-tracking data of 23 subjects that free-view all videos. From the user data, we find salient objects in video can be defined as objects that consistently pop-out throughout the video, and objects with such attributes can be unambiguously annotated by combining manually annotated object/region masks with eye-tracking data of multiple subjects. To the best of our knowledge, it is currently the largest dataset for video-based salient object detection. Based on this dataset, this paper proposes an unsupervised baseline approach for video-based SOD by using saliency-guided stacked autoencoders. In the proposed approach, multiple spatiotemporal saliency cues are first extracted at pixel, superpixel and object levels. With these saliency cues, stacked autoencoders are unsupervisedly constructed which automatically infer a saliency score for each pixel by progressively encoding the high-dimensional saliency cues gathered from the pixel and its spatiotemporal neighbors. Experimental results show that the proposed unsupervised approach outperforms 30 state-of-the-art models on the proposed dataset, including 19 image-based & classic (unsupervised or non-deep learning), 6 image-based & deep learning, and 5 video-based & unsupervised. Moreover, benchmarking results show that the proposed dataset is very challenging and has the potential to boost the development of video-based SOD.



There are no comments yet.


page 1

page 2

page 5

page 6

page 8

page 11

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The booming of image-based salient object detection (SOD) originates from the presence of large-scale benchmark datasets [1, 2]

. With these datasets, it becomes feasible to construct complex models with machine learning algorithms (


, random forest regressor 

[3], bootstrap learning [4], multi-instance learning [5] and deep learning [6]). Moreover, the presence of such datasets enables fair comparisons between state-of-the-art models [7, 8]. Actually, large-scale datasets provide a solid foundation for SOD and consistently guide the development of this area.

In the past decade, SOD datasets keep on evolving to meet the increasing demands in developing and benchmarking new models. Some researchers argue that images in early datasets like ASD [2] and MSRA-B [1] are relatively small and simple. They extend such datasets in terms of amount [9, 10] or complexity [11, 12, 13]. Meanwhile, the concept of SOD has been extended to RGBD images [14], image collections [15, 16, 17] and videos [18, 19, 20, 21]. Among these extensions, video-based SOD has invoked great research interests since it re-defines the problem from a spatiotemporal perspective. However, there still lack large-scale video datasets for comprehensive model comparison, which prevents the fast growth of this branch. For example, the widely used SegTrack dataset [22] consists of only 6 videos with 21 to 71 frames per video, while a recent dataset ViSal [21] contains only 17 videos with 30 to 100 frames per video. In addition, the definition of salient object in video is still not very clear (e.g., manually annotated foreground objects [23], class-specific objects [21] or moving objects [24]). It is necessary to construct a large video dataset with unambiguously defined salient objects.

Fig. 1: Representative scenarios in VOS. The 200 videos in VOS are grouped into two subsets according to the complexity of foreground, background and motion, including VOS-E (easy subset, 97 videos) and VOS-N (normal subset, 103 videos).

To address this issue, this paper proposes VOS, a large-scale dataset with 200 indoor/outdoor videos for video-based SOD (64 minutes, frames, see Fig. 1 for representative scenarios). In constructing VOS, we first collect two types of user data, including 1) the eye-tracking data of 23 subjects that free-view all the 200 videos, and 2) the masks of all objects and regions in uniformly sampled keyframes annotated by another 4 subjects. Based on these user data, salient objects in a video can be unambiguously annotated as the objects that consistently receive the highest density of fixations throughout the video. After discarding the pure-background keyframes as well as the keyframes in which salient objects are partially occluded or split into several disjoint parts, we obtain keyframes with binary masks of salient objects.

Based on the large-scale dataset, we propose an unsupervised baseline model for video-based SOD by constructing saliency-guided stacked autoencoders. Different from the fixation prediction task that aims to roughly detect where the human-being looks at and the image-based SOD task that aims to segment only the most spatially salient objects, the video-based SOD focuses on detecting and segmenting the objects that consistently pop-out throughout a video from a spatiotemporal perspective. Inspired by this fact, the proposed approach first extracts multiple spatiotemporal saliency cues at pixel, superpixel and object levels. Stacked autoencoders are then unsupervisedly trained which can automatically infer a saliency score for each pixel by progressively encoding the high-dimensional saliency cues gathered from the pixel and its spatiotemporal neighbors. In the comprehensive model benchmarking on VOS, the proposed approach outperforms 30 image-based and video-based models. Moreover, the benchmarking results validate that VOS is a challenging dataset that has the potential to greatly boost the development of this area.

Our main contributions are summarized as follows: 1) we propose a large and challenging dataset for video-based SOD, which we believe can be useful for the development of this area, 2) we propose saliency-guided stacked autoencoders for video-based SOD, which is an unsupervised baseline model that outperforms 30 image-based and video-based models, and 3) we provide a comprehensive benchmark of our approach and massive state-of-the-art models, which reveals several key challenges in video-based SOD and further validates the usefulness of the proposed dataset.

The rest of this paper is organized as follows: Section II reviews existing datasets and models. Section III presents a new dataset. In Section IV, we propose saliency-guided stacked autoencoders for video-based SOD. Section V benchmarks the proposed model and the state-of-the-art models, and the paper is concluded in Section VI.

Ii Related Work

Video-based SOD is correlated with image-based SOD, foreground/primary object detection and moving object segmentation. In this section, we will review the most related datasets and models from all these areas.

Dataset #Vid. Resolution (in pixels) #Orig. Frames #Labeled Frames #Avg. Obj. Obj. Area
Width Height Max Res. Total Avg. Total Avg. Per Frame Per Frame (%)
Image-based  ASD [2] - 1000 - 1000 - 1.160.87
 ECSSD [11] - 1000 - 1000 - 1.160.56
 DUT-O [10] - 5168 - 5168 - 1.200.69
 PASCAL-S [13] - 850 - 850 - 1.521.11
 MSRA10K [9] - 10000 - 10000 - 1.050.46
 HKU-IS [25] - 4447 - 4447 - 1.600.82
 XPIE [26] - 10000 - 10000 - 1.160.46
Video-based  SegTrack [22] 6 244 4118 244 4118 1.000.00
 SegTrack V2 [27] 14 1065 7682 1065 7682 1.381.01
 FBMS [28] 59 13860 235193 720 128 1.781.54
 DAVIS [23] 50 3455 6919 3455 6919 5.3922.87
 ViSal [21] 17 963 5720 193 114 1.160.40
 VOS-E 97 49206 507130 3236 339 1.020.18
 VOS-N 103 66897 649510 4231 4133 1.250.54
 VOS 200 116103 581383 7467 3725 1.150.44
  • Objects are counted as disconnected foreground regions. In DAVIS, a semantic object may be divided into hundreds of disconnected parts (e.g.

    , a bus occluded by a tree), leading to a extremely high mean and standard deviation in the number of foreground “objects” per frame.

TABLE I: Comparison between VOS (subsets: VOS-E and VOS-N) with representative image/video object segmentation datasets

Ii-a Datasets

SegTrack [22] is a popular dataset for video object segmentation. It contains 6 videos about animal and human with 244 frames in total, and videos are intentionally collected for benchmarking models with predefined challenges. Only one foreground object is manually annotated per frame.

SegTrack V2 [27] extends SegTrack from two perspectives. First, additional annotations of foreground objects are provided for the six videos in SegTrack. Second, 8 new videos are carefully chosen to cover more challenges. In total, SegTrack V2 contains 14 videos about bird, animal, car and human with densely annotated frames.

Freiburg-Berkeley Motion Segmentation (FBMS) is designed for motion segmentation (i.e., segmenting regions with similar motion). It is first proposed in [24] with 26 videos, and then Ochs et al.[28] extended the dataset with another 33 videos. In total, this dataset contains 59 videos with 720 sparsely annotated frames. Although the dataset is much larger than SegTrack and SegTrack V2, the scenarios it covers are still far from sufficient [23]. Moreover, moving objects are not equivalent to salient objects, especially in a scene with complex content.

DAVIS [23] contains 50 high quality videos about human, animal, vehicle, object and action with densely annotated frames. Each video has Full HD 1080p resolution and lasts about 2 to 4 seconds. Each video clip in this dataset contains one foreground object or two spatially connected objects. Note that such objects may split into hundreds of small regions due to occlusion.

ViSal is a pioneer video-based SOD dataset proposed in [21]. It contains 17 videos about human, animal, motorbike, etc. Each video contains 30 to 100 frames, in which salient objects are manually annotated according to the semantic classes of videos. In other words, this dataset assumes that salient objects are equivalent to the primary objects within videos annotated by semantic tags.

To facilitate the comparison between these datasets and our VOS dataset, we show in Table I more dataset statistics. Moreover, we also demonstrate the details of 7 representative image-based SOD datasets so as to provide an intuitive impression between image- and video-based SOD. Generally speaking, previous datasets reviewed above have greatly boosted the researches in video object segmentation but still have several drawbacks.

First, these datasets are still a little small for modern learning algorithms like Convolutional Neural Networks (CNN).

As shown in Table I, the numbers of annotated frames in most previous video datasets are much smaller than the image-based SOD datasets and VOS. Although thousands of frames in SegTrack V2 and DAVIS are densely annotated, the rich redundancy in consecutive frames may increase the over-fitting risk in model training.

Second, videos in some datasets are selected to maximally cover predefined challenges in video object segmentation (e.g., SegTrack and SegTrack V2). However, such intentionally selected videos may make the dataset not very “realistic” (i.e., different from the videos in real-world scenarios). Moreover, such datasets may favor models that are particularly designed to “over-fit” the limited scenarios. On the contrary, our VOS dataset is much larger so that the over-fitting risk can be largely alleviated.

Third, foreground objects in previous datasets are often manually annotated only by one or several annotators, which may incorporate strong subjective bias into these datasets. For example, in a video with both dog and monkey only the monkey is annotated in SegTrack, while SegTrack V2 has the dog annotated as well. Actually, manual annotations from different subjects often conflict with each other [29] and cause ambiguity. To alleviate such ambiguity, previous works like [30, 31, 32] have tried to locate salient targets by averaging rectangles manually annotated by 23 subjects [30] or collecting human fixations via eye-tracking apparatus [31, 32]. However, these datasets cannot be directly used in video-based SOD for lacking pixel-wise annotations of salient objects. Actually, pixel-wise annotation is the most time-consuming procedure in constructing video-based SOD datasets like VOS.

To sum up, existing datasets are still a little insufficient to benchmark video-based SOD models due to the limited video numbers as well as the ambiguous definition and annotation processes of salient/foreground/moving objects. To further boost the development of this area, it is necessary to construct a large-scale dataset that covers a wide variety of real-world scenarios and contains salient objects that are unambiguously defined and annotated.

Ii-B Models

Hundreds of bottom-up and learning-based models [33, 34, 3, 35, 36, 37] have been proposed for image-based SOD in the past decade. With the booming of deep learning methodology and the presence of large-scale datasets [25, 38, 39], many deep models [40, 41, 42, 43] have been proposed for image-based SOD. For example, Han et al. [44]

proposed multi-stream stacked denoising autoencoders that can detect salient regions by measuring the reconstruction residuals that reflect the distinctness between background and salient regions. He 

et al. [45] adopted CNNs to characterize superpixels with hierarchical features so as to detect salient objects at multiple scales, while the superpixel-based saliency computation was used by [46, 25] as well. Considering that the task of fixation prediction is tightly correlated with SOD, a unified deep network was proposed in [47] for simultaneous fixation prediction and image-based SOD.

The state-of-the-art deep SOD models often adopt recurrent frameworks that can achieve impressive performance. For example, Liu et al. [48] adopted hierarchical recurrent CNNs to progressively refine the details of salient objects. In [49], a coarse saliency map was first generated by using the convolution-deconvolution networks. After that, it was refined by iteratively enhancing the results in various sub-regions. Wang et al. [50] iteratively delivered the intermediate predictions back to the recurrent CNNs to refine saliency maps. In this way, salient objects can gradually pop-out, while distractors can be progressively suppressed.

Compared with image-based SOD, video-based SOD is less explored due to the lack of large video datasets. For example, Liu et al. [51] extended their image-based SOD model [1] to the spatiotemporal domain for salient object sequence detection. In [52], visual attention (i.e.

, the estimated fixation density) was used as prior knowledge to guide the segmentation of salient regions in video. Rahtu 

et al. [18] proposed to integrate local contrast features in illumination, color and motion channels with a statistical framework. A conditional random field was then adopted to recover salient objects from images and video frames. Due to the lack of large-scale benchmarking datasets, most of these early approaches only provide qualitative comparisons, and only a few works like [51] have provided quantitative comparisons on a small dataset within which salient objects are roughly annotated with rectangles.

To conduct quantitative comparisons in single video-based SOD, Bin et al. [53] manually annotated the salient objects in 10 videos with about 100 frames per video. They also proposed an approach to detect temporally coherent salient objects using regional dynamic contrast features in the spatiotemporal domain of color, texture and motion. Their approach demonstrated impressive performance in processing videos with only one salient object. In [54], Papazoglou and Ferrari proposed an approach for the fast segmentation of foreground objects from background regions. They first estimated an initial foreground map with respect to the motion information, which was then refined by building the foreground/background appearance models and encouraging the spatiotemporal smoothness of foreground objects over the whole video. The main assumption required by their approach was that foreground objects should move differently from its surrounding background in a good fraction of the video. Wang et al. [55] proposed an unsupervised approach for video-based SOD. In their approach, frame-wise saliency maps were first generated and refined with respect to the geodesic distances between regions in the current frame and subsequent frames. After that, global appearance models and dynamic location models were constructed so that the spatially and temporally coherent salient objects can be segmented by using an energy minimization framework. In their later work [21], Wang et al. proposed to utilize the inter-frame and intra-frame information in a gradient flow field. By extracting the local and global saliency measures, an energy function was then adopted to enhance the spatiotemporal consistency of the output saliency maps.

Despite the performance and benchmarking methodologies, these single video-based approaches have provide us an intuitive definition of salient objects. That is, salient objects in a video should be spatiotemporally consistent and visually distinct from background regions. However, in real-world scenarios the assumptions like color/texture dissimilarity and motion irregularity may not always hold. A more general definition of salient objects in video is required to guide the annotation and detection processes.

Beyond single video-based approaches, some approaches extend the idea of image co-segmentation to the video domain. For example, Chiu and Fritz [56] proposed a generative model for multi-class video co-segmentation. A global appearance model was learned to connect the segments from the same class so as to segment the foreground targets shared by different videos. Fu et al. [20] proposed to detect multiple foreground objects shared by a set of videos. Category-independent object proposals were first extracted and multi-state selection graph was then adopted to handle multiple foreground objects. Although video co-segmentation brings us a interesting new direction for studying video-based SOD, detecting salient objects in a single video is still the most common requirement in many real-world applications.

Iii A Large-scale Dataset for Video-based SOD

A good benchmark dataset should cover many real-world scenarios and the annotation process should contain little subjective bias. In this section, we will introduce the details in constructing the dataset and discuss how salient objects can be unambiguously defined and annotated in videos.

Iii-a Video Collection

We first collect hundreds of long videos from Internet (e.g., video-sharing websites like Youtube) and volunteers. Note that no instruction is given on what types of videos are required since we aim to collect more “realistic” daily videos. After that, we randomly sample short clips from long videos and keep only the clips that contain objects in most frames. Finally, we obtain 200 indoor/outdoor videos that last 64 minutes in total ( frames at 30fps). These videos are grouped into two subsets according to the content complexity, including:

VOS-E. This subset contains 97 easy videos (27 minutes, frames, 83 to 962 frames per video). As shown in Fig. 1, a video in this subset usually contains obvious foreground objects with slow camera motion. This subset serves as a baseline to explore the inherent correlations between image- and video-based SOD.

VOS-N. This subset contains 103 normal videos (37 minutes, frames, 710 to frames per video). As shown in Fig. 1, videos in this subset contain complex or highly dynamic foreground objects, dynamic or cluttered background regions, etc. This subset is very challenging and can be used to benchmark models in realistic scenarios.

Iii-B User Data Collection

The manual annotation of salient objects often generate ambiguity and strong subjective bias in complex scenes. Inspired by the solution used in [13], we collect two types of user data, including object masks and human fixations, to alleviate the ambiguity in defining and annotating salient objects in videos.

Object masks. Four subjects (2 males and 2 females, aged between 24 and 34) manually annotate the accurate boundaries of all objects and regions in video frames. Since it consumes too much time to annotate all frames, we uniformly sample only one keyframe out of every 15 frames and manually annotate the keyframes. In the annotation, an object will maintain the same label throughout a video, and the holes in objects are filled to speed up the annotation. Since moving objects may merge or split several times in a short period and it is difficult to consistently assign different labels to them (e.g., the fighting bears and cats in the third row of Fig. 2), we assign the same label to objects if they become indistinguishable in certain frames (e.g., the bears and cats in Fig. 2) or difficult to be re-identified (e.g., the jelly fishes in Fig. 2 frequently appear and disappear near screen borders). Finally, regions smaller than 16 pixels are ignored and we obtain the accurate boundaries of objects and regions.

Human fixations. Twenty-three subjects (16 males and 7 females, aged between 21 and 29) participate in the eye-tracking experiments. Note that none of them participates in annotating the object/region masks. Each subject is asked to free-view all the 200 videos displayed on a 22-inch color monitor with a resolution of . A chin rest is adopted to reduce head movements and enforce a viewing distance of cm. Considering that the non-stop watching of 200 videos (64 minutes) will be very tiring, we randomly divide videos into subgroups and adopt an interlaced schedule for different subjects that free-view the same subgroup of videos. In this manner, each subject will get sufficient time to rest after watching a small collection of videos, making the eye-tracking data more reliable. During the free-viewing process, an eye-tracking apparatus with a sample rate of 500Hz (SMI RED 500) is used to record various types of eye movements. Finally, we keep only the fixations and denote the set of eye positions on a video as , in which a sampled eye position is represented by a triplet . Note that and are the coordinates of and is the time stamp that starts (an eye position sampled by the 500HZ eye-tracker lasts about two milliseconds, see Fig. 3 for some examples).

Iii-C Definition and Annotation of Salient Objects in Video

In early datasets with only simple images, salient objects can be manually annotated without much ambiguity. However, in a complex video there may exist several candidate objects, and different subjects may have different biases in determining which ones are the most salient. As a result, such subjective biases prevent the direct manual annotation of salient objects in complex videos.

To alleviate the subjective bias, the fixations of multiple subjects can be used to find the most salient objects. For example, Li et al. [13] collect fixations from 8 subjects that free-view the same image for 2 seconds. After that, salient objects are defined as the objects that receive the highest number of fixations. This solution provides a less ambiguous definition of salient objects in images but may fail on videos due to four reasons:

1) Insufficient viewing time. The viewing time of a frame (e.g., 33ms) is much shorter than that of an image. As a result, the fixations received by a frame are often insufficient to fully distinguish the most salient objects, especially when there exist multiple candidates in the same video frame (e.g., the cars and bears in Fig. 3 (a)).

2) Inaccurate fixations. Human fixations may fall outside moving objects and small objects (e.g., the fast moving aircraft in Fig. 3 (b)).

3) Rapid attention shift. Human attention can be suddenly distracted by visual surprise and then return to the salient objects after a short period. In this case, the surprising background regions will be mistakenly recognized as salient if only the fixations in this short period are considered in defining salient objects (e.g., the black region in Fig. 3 (c)).

4) Background-only frames. Some frames are purely background. If salient objects are defined by fixations received only by these frames, background regions in these frames will be mistakenly annotated as salient (e.g., the girl is occluded by background regions in Fig. 3 (d)).

For these reasons, it is difficult to directly define and annotate salient objects separately on each frame. Inspired by the idea of co-saliency [56, 20], we propose to define salient objects at the scale of whole videos. That is, salient objects in videos are defined as the objects that consistently receive the highest fixation densities throughout a video. The highest density of fixations is used in defining salient objects in video other than the highest number of fixations. In this manner, we can avoid mistakenly assigning high saliency values to large background regions when salient objects are very small (e.g., the aircraft in Fig. 3 (b)).

Fig. 2: Masks of objects and regions annotated by 4 subjects. Holes are filled up to speed up the annotation process (e.g., the key in the first row), and multiple objects will be assigned the same labels throughout the video if they cannot be easily separated in certain frames (e.g., the fighting bears and cats) or difficult to be re-identified (e.g., the jelly fishes which frequently appear and disappear near screen borders).
Fig. 3: Human fixations (red dots) of 23 subjects on consecutive keyframes. We can see that these fixations are insufficient to directly annotate salient objects frame by frame. (a) Insufficient fixations to distinguish multiple salient objects and distractors; (b) Fixations fall outside small moving objects; (c) Fixations distracted by visually surprising regions; (d) Salient objects occluded by background regions, leading to background-only frames.

Iii-D Generation of Salient Object Masks

Based on the proposed definition, we can thus generate masks of salient objects for each video. We first compute the fixation density at each object in manually annotated keyframes. Considering that the fixations received by each keyframe are very sparse, we take the fixations recorded in a short period after the keyframe is displayed into consideration. Let be a frame presented at time and be an annotated object, we measure the fixation density at , denoted as , as


where is a pixel at and is the number of pixels in . The indicator function equals to 1 if and 0 otherwise. measures the spatial distance between the fixation and the pixel , which can be computed as


From (1) and (2), we can see that the influence of a fixation to the fixation density at the object gradually decreases when the spatial or temporal distances between and pixels in increase. Such influence is controlled by and which are empirically set to 3% of video width (or video height if it is larger than the width) and 0.1s, respectively.

Based on the fixation density , we can thus compute its saliency score from a global perspective:


In (3), the saliency of an object is defined as its average fixation density throughout a video. After that, we select the objects with saliency scores above an empirical threshold of 50 (or the object with the highest saliency score if it is smaller than 50). Note that such a threshold is empirically selected with respect to the (subjectively assessed) object completeness as well as the consistency between segmented salient objects and all recorded fixations. Finally, we generate a set of salient objects for each video, represented by a sequence of binary masks at keyframes. In particular, a keyframe which contains only background or a salient object that splits into several disconnected parts due to the occlusion of background distractors will be discarded. Finally, we obtain binary masks of keyframes ( for the 97 videos in VOS-E and for the 103 videos in VOS-N). Representative masks of salient objects can be found in Fig. 4.

Fig. 4: Representative keyframes and masks of salient objects.
Fig. 5: The average annotation maps of 6 datasets.

Iii-E Dataset Statistics

To reveal the main characteristics of VOS, we show in Fig. 5 the average annotation maps (AAMs) of VOS-E, VOS-N, VOS and three image datasets (i.e., ASD [2], ECSSD [11] and DUT-O [10]). Similar to [8], the average annotation map (AAM) of an image-based SOD dataset is computed by 1) resizing all ground-truth masks from the dataset into the same resolution, 2) adding up the resized masks pixel by pixel, and 3) normalizing the resulting map to a maximum value of 1.0. For a video-based SOD dataset (e.g., VOS-E, VOS-N and VOS), an AAM is first computed over each video, while the AAMs from all videos are fused following the same three steps to obtain the final AAM. In this manner, we can provide a better view of the distribution of salient objects in different videos (otherwise the AAMs will be heavily influenced by long videos).

From Fig. 5, we can see that the distributions of salient objects in VOS and its two subsets are both center-biased, while the degree of center-bias is a little stronger than that in ASD, ECSSD and DUT-O. This is caused by the fact that photographers often have strong tendency to place salient targets near the center of the view in taking videos. This implies that image-based and video-based SOD are inherently correlated, and it is possible to directly transfer some useful saliency cues from the spatial domain to the spatiotemporal domain (e.g., the background prior [3, 37] obtained from the boundaries pixels).

Moreover, Figure 6 shows the histograms of the number and area of salient objects. We see that the number and area of salient objects in VOS are similar to those in DUT-O. This implies that VOS, like the DUT-O dataset, is very challenging for reflecting many realistic scenarios. In particular, almost all keyframes from VOS-E contain only one salient object, while the sizes of such salient objects distribute almost uniformly in the Small (31.1%), Medium (30.1%), Large (20.6%) and Very Large (18.3%) categories. This finding indicates that VOS-E serves as a good baseline dataset to benchmark video-based SOD models.

Fig. 6: Histograms of the number and area of salient objects.

Iv A Baseline Model for Video-based SOD with Saliency-guided Stacked Autoencoders

Iv-a The Framework

To construct a baseline model for VOS, we propose an unsupervised approach that learns saliency-guided stacked autoencoders. The framework of the proposed approach is shown in Fig. 7. We first turn each frame from VOS into several color spaces and extract object proposals as well as the motion information (e.g.

, optical flow). After that, we extract three spatiotemporal saliency cues from each frame at pixel, superpixel and object levels, while such saliency cues reveal the presence of salient objects from different perspectives. Considering that salient objects are often spatially smooth and temporally consistent in consecutive frames, we characterize each pixel with a high-dimensional feature vector which consists of the saliency cues collected from the pixel, its spatial neighbors and the corresponding pixel in the subsequent frame.

With the guidance of saliency cues in the high dimensional feature vector at each pixel, stacked autoencoders can be unsupervisedly learned which contain only one hidden node in the last encoding layer (see Fig. 

7). Since the saliency cues within a pixel and its spatiotemporal neighbors can be well reconstructed from the output of this layer, we can safely assume that the degree of saliency at each pixel is strongly correlated with the output score. By computing the output scores and the linear correlation coefficient with the input saliency cues, we can derive an initial saliency map for each frame that is spatially smooth and temporally consistent. Finally, several simple post-processing operations are applied to further pop-out salient objects and suppress distractors.

Fig. 7: The framework of the proposed saliency-guided stacked autoencoders.

Iv-B Extracting Multi-scale Saliency Cues

To extract saliency cues, we first resize a frame to the maximum side length of 400 pixels and convert it to the Lab and HSV color spaces. After that, we estimate the optical flow [57] between and and compute the inter-frame flicker as the absolute in-place difference of intensity between and . For the sake of simplification, we use a space XYT formed by combining the optical flow and the flicker to indicate the variations along horizontal, vertical and temporal directions. Finally, each frame is represented by 12 feature channels from the RGB, Lab, HSV and XYT spaces. Based on these channels, we extract three types of saliency cues, including:

1) Pixel-based saliency. To efficiently extract the pixel-based saliency, we refer to the algorithm proposed in [37] that computes the minimum barrier distance from a pixel to image boundary (one pixel width). In the computation, we discard the Hue channel since the substraction between hue values can not always reflect the color contrast. Moreover, we also discard the RGB channels and the Value channel in HVS, which are somehow redundant to the other channels. For the rest 4 spatial and 3 temporal channels, the minimum barrier distances from all pixels to image boundary are separately computed over each channel. Such distances are then summed up across channels to initialize a pixel-based saliency map . Moreover, we also extract a backgroundness map as in [37] and multiply it with

to further enhance salient regions and suppress probable background regions. Finally, we conduct a morphological smoothing step over the pixel-based saliency map to smooth

while preserving the details of significant boundaries. As shown in Fig. 8 (c), the pixel-based saliency can be efficiently computed but sensitive to noise.

2) Superpixel-based saliency.

In image-based SOD, superpixels are often used as the basic units for feature extraction and saliency computation since they contain much more structural information than pixels. In this study, we adopt the approach proposed in

[58] to extract superpixel-based saliency in an unsupervised manner. This approach first divides a frame into superpixels, base on which the sparse and low-rank properties are utilized to decompose the feature matrix of superpixels so as to obtain their saliency scores. In this process, prior knowledge on location (i.e., center-bias), color and background is used to refine the superpixel-based saliency. Finally, the saliency value of a superpixel is mapped back to all pixels it contains to generate a saliency map . As shown in Fig. 8 (d), the superpixel-based saliency can detect a large salient object as a whole (e.g., the tissue in the third row of Fig. 8 (d)).

3) Object-based saliency. Inspired by the construction process of VOS, we adopt the Multiscale Combinatorial Grouping algorithm [59] to generate a set of object proposals for the frame and estimate an objectness score for each proposal. After that, we adopt the unsupervised fixation prediction model proposed in [60] to generate three fixation density maps in the Lab, HSV and XYT spaces, respectively. Let be the top-ranked objects with the highest objectness scores and be the three fixation density maps, the object-based saliency at a pixel can be computed as:


where is an indicator function which equals to 1 if and 0 otherwise. is the set of objects used for computing the object-based saliency maps, and we set in experiments. (or ) indicates the ratio of fixations received by over the fixation density map , which is computed as:


As shown in Fig. 8 (e), the object-based saliency cues can successfully pop-out large salient object as a whole but often contain the background regions near them.

Iv-C Learning Stacked Autoencoders

Given the saliency cues, we have to estimate a non-negative saliency score for each pixel, which, statistically, has positive correlation with the saliency cues. Moreover, as stated in many previous works [54, 21, 55], the estimated saliency scores should have the following attributes:

1) Spatial smoothness. Similar pixels spatially adjacent to each other should have similar saliency scores.

2) Temporal consistency. Corresponding pixels in adjacent frames should have similar saliency scores so that salient objects can consistently pop-out throughout a video.

To develop a model with such attributes, we train stacked autoencoders that take saliency cues at a pixel and its spatiotemporal neighbors as the input so that the spatial smoothness and temporal consistency of predicted saliency scores can be guaranteed. Considering the computational efficiency, for each pixel we adopt its eight spatial neighbors and only one temporal neighbor in the subsequent frame defined by the optical flow. A pixel is then represented by a feature vector with saliency cues.

Fig. 8: Saliency cues and the estimated saliency maps. (a) Frames, (b) ground-truth, (c) pixel-based saliency, (d) superpixel-based saliency, (e) object-based saliency, (f) initial saliency maps obtained by the saliency-guided stacked autoencoders, (g) final saliency maps obtained after post-processing.

With the guidance of the high-dimensional saliency cues, we collect the feature vectors from randomly selected pixels in VOS, denoted as . With these data, we train stacked autoencoders with encoding layers and the same number of decoding layers with logistic sigmoid transfer functions. In the training process, no ground-truth data is used, while the th encoding layer , and its corresponding decoding layer is trained by minimizing


where is a -2 regularization term that can be used to penalize the -2 norm of weights in the encoding and decoding layers (we empirically set in this study).

is a sparsity regularizer that is defined as the Kullback-Leibler divergence between the average output of each neuron in

and a predefined score (we empirically set and ).

In minimizing (6), the first encoding layer takes the sampled feature vectors of saliency cues as the input data, while other encoding layers take the output of previous encoding layers as the input. That is, in training the th encoding/decoding layer, we have


where indicates the normalization operation that enforces each dimension of the input data that enters a encoding layer falls in the same dynamic range of . In this study, we use encoding layers with

neurons at each layer, and each layer is trained with 100 epochs. Note that the

th layer contains only one neuron, and by using its output score the input saliency cues within a pixel and its spatiotemporal neighbors can be well reconstructed by the decoding layers. As a result, we can safely assume that such output scores are tightly correlated with the input saliency cues , and the the degree of correlation can be measured by averaging the linear correlation coefficients between and every dimension of . As a result, the saliency score of a pixel , given its feature vector that contains the saliency cues from and its spatiotemporal neighbors, can be computed as


After computing a saliency score with (8) for each pixel, we can initialize a saliency map for each frame in VOS with the saliency values normalized to . As shown in Fig. 8 (f), such a saliency map already performs impressive in highlighting salient objects and suppressing distractors. To further pop-out salient objects and suppress distractors, we conduct three post-processing operations, including:

  1. Apply temporal smoothing between adjacent frames to reduce the inter-frame flicker. We adopt a Gaussian filter with a width of 3 and .

  2. Enhance the foreground/background contrast by using the sigmoid function proposed in


  3. Binarize the saliency map with the average value of the whole saliency map and suppress the connected components that are extremely small.

As shown in Fig. 8 (g), these post-processing operations can generate compact and precise salient objects. Note that operations like center-biased re-weighting and spatial smoothing are not adopted here because the autoencoders unsupervisedly learned over a large-scale dataset already have the capability to accurately detect various types of salient objects despite their positions and sizes.

V Experiments

In this Section, we compare the proposed Saliency-guided Stacked Autoencoders (SSA) with the state-of-the-art models on VOS. The main objectives are two-fold: 1) validate the effectiveness of the dataset VOS and the baseline model SSA, and 2) provide a comprehensive benchmark to reveal the key challenges in video-based SOD. The rest of this section will first introduce the experimental settings and then discuss the results.

    Model Pub. & Year &  Type Model Pub. & Year &  Type
SIV [18] ECCV 2010   [V+U] CB [61] BMVC 2011   [I+C]
RC [33] CVPR 2011   [I+C] ULR [34] CVPR 2012   [I+C]
LMLC [62] TIP 2013   [I+C] DRFI [3] CVPR 2013   [I+C]
GMR [10] CVPR 2013   [I+C] HS [11] CVPR 2013   [I+C]
PCA [63] CVPR 2013   [I+C] CHM [64] ICCV 2013   [I+C]
DSR [65] ICCV 2013   [I+C] MC [35] ICCV 2013   [I+C]
FST [54] ICCV 2013   [V+U] HDCT [66] CVPR 2014   [I+C]
RBD [36] CVPR 2014   [I+C] NLC [67] BMVC 2014   [V+U]
BL [68] CVPR 2015   [I+C] BSCA [69] CVPR 2015   [I+C]
LEGS [40] CVPR 2015   [I+D] MCDL [42] CVPR 2015   [I+D]
MDF [25] CVPR 2015   [I+D] SAG [55] CVPR 2015   [V+U]
GP [70] ICCV 2015   [I+C] MB [37] ICCV 2015   [I+C]
MB+ [37] ICCV 2015   [I+C] GF [21] TIP 2015   [V+U]
ELD [46] CVPR 2016   [I+D] DCL [41] CVPR 2016   [I+D]
RFCN [50] ECCV 2016   [I+D] DHSNet [48] CVPR 2016   [I+D]
SMD [58] PAMI 2017   [I+C] SSA Our approach   [V+U]
TABLE II: Models for benchmarking (Symbols: [I] for image-based, [V] for video-based; [C] for classic unsupervised or non-deep learning, [D] for deep learning, [U] for unsupervised).
[I+C]  CB [61] 0.755 0.791 0.763 0.145 0.463 0.563 0.483 0.229 0.605 0.674 0.619 0.188
 RC [33] 0.738 0.677 0.723 0.171 0.465 0.561 0.484 0.221 0.597 0.617 0.602 0.197
 ULR [34] 0.693 0.737 0.703 0.158 0.390 0.675 0.432 0.168 0.537 0.705 0.568 0.163
 LMLC [62] 0.687 0.736 0.697 0.154 0.408 0.501 0.426 0.262 0.543 0.615 0.558 0.210
 GMR [10] 0.813 0.697 0.783 0.140 0.500 0.611 0.522 0.195 0.652 0.653 0.652 0.168
 HS [11] 0.755 0.615 0.717 0.141 0.497 0.521 0.502 0.262 0.622 0.567 0.608 0.203
 CHM [64] 0.756 0.765 0.758 0.124 0.409 0.611 0.443 0.186 0.578 0.685 0.599 0.156
 DRFI [3] 0.762 0.837 0.778 0.114 0.442 0.733 0.486 0.150 0.597 0.783 0.632 0.132
 PCA [63] 0.750 0.725 0.744 0.143 0.420 0.696 0.462 0.142 0.580 0.710 0.606 0.143
 DSR [65] 0.765 0.748 0.761 0.112 0.450 0.679 0.488 0.140 0.603 0.713 0.625 0.127
 MC [35] 0.819 0.737 0.799 0.140 0.499 0.665 0.530 0.192 0.655 0.700 0.664 0.167
 HDCT [66] 0.711 0.791 0.728 0.128 0.420 0.677 0.460 0.142 0.561 0.733 0.593 0.136
 RBD [36] 0.799 0.782 0.795 0.091 0.516 0.709 0.550 0.145 0.653 0.745 0.672 0.119
 GP [70] 0.743 0.788 0.753 0.141 0.405 0.704 0.449 0.227 0.569 0.745 0.602 0.185
 MB [37] 0.814 0.735 0.794 0.107 0.480 0.696 0.517 0.151 0.642 0.715 0.657 0.129
 MB+ [37] 0.803 0.792 0.801 0.096 0.492 0.754 0.535 0.162 0.643 0.772 0.669 0.130
 BL [68] 0.765 0.777 0.768 0.165 0.477 0.658 0.509 0.220 0.617 0.716 0.637 0.194
 BSCA [69] 0.766 0.758 0.764 0.133 0.457 0.663 0.493 0.195 0.607 0.709 0.628 0.165
 SMD [58] 0.811 0.789 0.806 0.096 0.528 0.688 0.558 0.148 0.665 0.737 0.681 0.123
[I+D]  LEGS [40] 0.820 0.685 0.784 0.193 0.556 0.593 0.564 0.215 0.684 0.638 0.673 0.204
 MCDL [42] 0.831 0.787 0.821 0.081 0.570 0.645 0.586 0.085 0.697 0.714 0.701 0.083
 MDF [25] 0.740 0.848 0.762 0.100 0.527 0.742 0.565 0.098 0.630 0.793 0.661 0.099
 ELD [46] 0.790 0.884 0.810 0.060 0.569 0.838 0.615 0.081 0.676 0.861 0.712 0.071
 DCL [41] 0.864 0.735 0.830 0.084 0.583 0.809 0.624 0.079 0.719 0.773 0.731 0.081
 RFCN [50] 0.834 0.820 0.831 0.075 0.614 0.783 0.646 0.080 0.721 0.801 0.738 0.078
 DHSNet [48] 0.863 0.905 0.872 0.049 0.649 0.851 0.686 0.055 0.753 0.877 0.778 0.052
[V+U]  SIV [18] 0.693 0.543 0.651 0.204 0.451 0.523 0.466 0.201 0.568 0.533 0.560 0.203
 FST [54] 0.781 0.903 0.806 0.076 0.619 0.691 0.634 0.117 0.697 0.794 0.718 0.097
 NLC [67] 0.439 0.421 0.435 0.204 0.561 0.610 0.572 0.123 0.502 0.518 0.505 0.162
 SAG [55] 0.709 0.814 0.731 0.129 0.354 0.742 0.402 0.150 0.526 0.777 0.568 0.140
 GF [21] 0.712 0.798 0.730 0.153 0.346 0.738 0.394 0.331 0.523 0.767 0.565 0.244
 SSA 0.875 0.776 0.850 0.062 0.660 0.682 0.665 0.103 0.764 0.728 0.755 0.083
  • The executable of NLC only output valid results on 187 videos (91 from VOS-E and 96 from VOS-N).

TABLE III: Performance benchmarking of our approach and 31 state-of-the-art models on VOS and its two subsets VOS-E and VOS-N. Top three scores in each column are marked in red, green and blue, respectively. Symbols of model categories: [I+C] for image-based & classic unsupervised or non-deep learning, [I+D] for image-based & deep learning, [V+U] for video-based & unsupervised.

V-a Settings

As shown in Table II, thirty-two state-of-the-art models, including the proposed baseline model SSA, are tested on the VOS dataset (19 image-based & classic unsupervised or non-deep learning, 7 image-based & deep learning and 6 video-based & unsupervised). Similar to many image-based SOD works, we also adopt Recall, Precision,

and Mean Absolute Error (MAE) as the evaluation metrics. Let

be the ground-truth binary mask of a keyframe and be the saliency map predicted by a model, the MAE score can be computed as the average absolute difference between all pixels in and to directly reflect the visual difference [71, 8]. Moreover, the Recall and Precision scores can be computed by converting into a binary mask and comparing it with :


Intuitively, the overall performance of a model on VOS can be assessed by directly computing the average Recall and Precision over all keyframes. However, this solution will over-emphasize the performance on long videos and ignore the performance on short videos (e.g., a video with 100 keyframes will overwhelm a video with only 10 keyframes). To avoid that, we first compute the average Recall, Precision and MAE separately over each video. After that, the mean values of the average Recall, Precision and MAE are computed over all videos. In this manner, the Mean Average Recall (MAR), Mean Average Precision (MAP) and MAE can well reflect the performance of a model by equivalently considering its performance over all videos. Correspondingly, is computed by fusing MAR and MAP to quantize the overall performance:


where we set as most of existing image-based models [2, 8] did in the performance evaluation.

Another problem in assessing models with MAP, MAR and is how to turn a gray-scale saliency map into a binary mask . Similar to image-based SOD, we adopt the adaptive threshold proposed in [2], which are computed as twice the average values of , to generate a binary mask from each saliency map. Considering that such adaptive threshold may sometimes exceed the maximal saliency value of if there exists a very large salient object, we set this threshold to the maximal saliency value in this case. In this manner, unique MAR, MAP and scores can be generated to measure the overall performance of a model.

Fig. 9: The representative results of SSA and the best model from each of the three model categories.

V-B Model Benchmarking

The performance scores of the proposed baseline model SSA and the other state-of-the-art models over VOS-E, VOS-N and VOS are illustrated in Table III. Some representative results from the best model of each of the model categories are shown in Fig. 9. With Table III and Fig. 9, we conduct several comparisons and discussions, including:

1) Comparisons between SSA and the other models. From Table III, we can see that SSA outperforms 30 state-of-the-art models in terms of , including 6 image-based deep models (except DHSNet) and 5 video-based models. Note that no ground-truth data in any form has been used in SSA, while the other deep models often make use of VGGNet [72] pre-trained on massive images with semantic tags and have their SOD models fine-tuned on thousands of images with manually annotated salient objects (e.g., DHSNet starts with VGGNet and then takes 9500 images from two datasets for model fine-tuning). Even in such an challenging setting, the unsupervised shallow model SSA, which only utilizes four layers of stacked autoencoders, still outperforms all deep models in terms of MAP, and the score outperforms the other six deep models. This result validates the effectiveness of the saliency-guided autoencoding scheme in video-based SOD.

One more thing that worth mentioning is that on VOS and its two subsets, SSA always has the best Precision ( on VOS), while its MAR scores are even lower than some unsupervised image-based models like MB+ and RBD. This may be caused by the fact that such models adopt bottom-up frameworks that tend to pop-out almost all regions that are different from the predefined context (i.e., image boundary in MB+ and RBD), leading to high recall rates. However, the suppression of distractors is less considered in such frameworks, making their precision much lower than SSA. Actually, in the SOD task, it is widely recognized that a high precision is much more difficult to obtain than a high recall [29, 39], and a frequently used trade-off is to gain a remarkable increase in precision at the cost of slightly decreasing recall. That is why the computation of in this work and almost all the image-based models emphasize more on precision than recall. Although a higher recall usually leads to a better subjective impression in qualitative comparisons, the overall performance, especially the score, may be not very satisfactory due to the emphasis of precision in computing . This result also proposes a challenge for the proposed VOS dataset: how to further improve the Recall rate while maintaining the high Precision?

2) Comparisons between (non-deep) image-based and video-based models. Beyond analyzing the best models, another issue worth discussing is the performance of image-based and video-based models, especially the non-deep ones. Interestingly, video-based models like GF and SAG may sometimes perform even worse than the image-based models (e.g., SMD, RBD and MB+). This may be caused by two reasons. First, the impact of incorporating temporal information in visual saliency computation is not always positive. In some videos, the salient objects, as assumed by many video-based models, have specific motion patterns that are remarkably different from the distractors (e.g., the dancing bear & girl in the second row of Fig. 4). However, such an assumption may not always hold in processing the “realistic” videos from VOS. For example, in some videos with global camera motion and static salient objects/distractors (e.g., the shoes and book in the second column of Fig. 4), the temporal information acts as a kind of noise and often leads to unsatisfactory results. Second, the parameters of most video-based models are manually fine-tuned on small datasets, which may become “over-fitting” to specific scenarios. Given a new scenario contained in VOS, these parameters may lead to unsatisfactory results, either by emphasizing the wrong feature channels or by propagating the wrong results from some frames to the entire video in an energy-based optimization framework.

3) Comparisons between image-based deep and non-deep models. From Table III, we also find that image-based deep models often perform remarkably better than the image-based models with classic unsupervised or non-deep learning frameworks. This may be caused by the fact that deep models can be very complex to make use of massive training data. Taking the seven deep models compared in Table III as examples, we can find a ranked list with decreasing on VOS. The ranked list, as well as their training data, are listed as follows: 1) DHSNet: 9500 from MSRA10K and DUT-O, 2) RFCN: 10000 from MARA10K, 3) DCL: 2500 from MSRA-B, 4) ELD: 9000 from MSRA10K, 5) MCDL: 8000 from MSRA10K, 6) LEGS: 3340 from MSRA-B and PASCAL-S, and 7) MDF: 2500 from MSRA-B. Note that the scenarios in DUT-O and PASCAL-S are much more challenging than those from MSRA-B and MSRA10K (many images of MSRA-B are also contained in MSRA10K

). From this ranked list, we can conclude that, except an outlier

DCL, the more training data and training sources, the better performance of a deep model. This finding is quite interesting and may help to explain the success of some top-ranked deep models like DHSNet and RFCN. Moreover, the top-ranked models often adopt a recurrent mechanism in detecting salient objects, while such mechanisms can help to iteratively discover salient objects and suppress probable distractors. For video-based SOD, the success of such deep models shows a feasible way to develop better spatiotemporal models by using image-based training data as well as the recurrent architecture. Furthermore, it is necessary to develop an unsupervised baseline model that utilizes no training data in any form so as to provide fair comparisons for the other unsupervised and supervised models. That’s why we propose SSA that has the potential of being widely used as the baseline model on VOS.

V-C Performance Analysis of Ssa

Beyond model benchmarking, we also conducted several experiments to analyze the performance of SSA, including scalability test, influence of various components, influence of temporal window size, speed test and failure cases.

1) Scalability test. One concern about SSA may be its scalability to other datasets. To validate this point, we reuse the stacked autoencoders generated on VOS to a new dataset ViSal [21]. On ViSal, the performance of SSA and the other 9 models (i.e., the top three models on VOS from each model categories) are reported in Table IV. We find that the overall performance of SSA, although not fine-tuned on ViSal, still ranks the second place on this dataset (only worse than the deep model DHSNet). In particular, its MAE score even gains a higher rank than that on VOS, which may be caused by the fact that VOS is a large dataset that covers a variety of scenarios (e.g., VOS-N contains many such outdoor scenarios about animal and aeroplane that also present in ViSal). Moreover, the unsupervised architecture often have better performance in scalability test and can be generalized to new scenarios. This can be further proved by the model FST, which ranks the third place in terms of on ViSal (higher than its rank on VOS). To sum up, VOS contains a large number of real-world scenarios that may help to alleviate the over-fitting risk. Moreover, the unsupervised framework of SSA makes it a scalable model that can be generalized to other scenarios without a remarkable performance drop.

Models  MAP  MAR    MAE
[I+C]  MB+ [37] 0.551 0.887 0.604 0.145
 RBD [36] 0.529 0.787 0.572 0.129
 SMD [58] 0.583 0.886 0.633 0.133
[I+D]  DCL [41] 0.718 0.859 0.747 0.261
 RFCN [50] 0.781 0.897 0.805 0.050
 DHSNet [48] 0.816 0.955 0.845 0.027
[V+U]  GF [21] 0.556 0.850 0.604 0.108
 SAG [55] 0.538 0.858 0.589 0.104
 FST [54] 0.803 0.815 0.806 0.052
 SSA 0.787 0.884 0.808 0.046
TABLE IV: Performance scores of our approach and the other 9 models on ViSal. The 9 models are selected as the top three models from each of the three model categories. The top 3 scores in each column are marked in red, green and blue, respectively.

2) Influence of various components. SSA involves three types of saliency, and we aim to explore which ones contribute the most to the performance of SSA. Toward this end, we conduct an experiment to see the performance of SSA on VOS when some types of saliency are ignored. For fair comparisons, we adopt the same architecture of stacked autoencoders but replace some saliency cues to zeros in training and testing SSA. As shown in Table V, the pixel-based saliency has the best precision, while object-based saliency has the best recall. Meanwhile, integrating all three types of saliency leads to the best overall performance. An interesting phenomena is that in the superpixel-only setting SSA outperforms SMD in both recall and precision, while SMD is exactly the model used in computing the superpixel-based saliency. This may be mainly caused by the fact that temporal cues from adjacent frames are incorporated into the auto-encoding processes, which provides an opportunity to refine the results of SMD from a temporal perspective. Due to the existence of temporal dimension in defining and annotating salient video objects, video-based SOD datasets contain something than cannot be obtained from image-based SOD datasets. For example, in the “fighting bears” scenario illustrated in the first two rows of the right column of Fig. 9, the mailbox and cars are considered to be non-salient from the perspective of the entire video, even though in some specific frames they do capture more human fixations than the fighting bears. In other words, the VOS dataset provides a new way to explore the influence of spatiotemporal cues (e.g., optical flow and features propagated from adjacent frames) in defining, annotating and detecting salient objects, while in most image-based SOD datasets only spatial cues are involved. We believe the spatiotemporal definition of salient objects in VOS may help future works to discover what is and what is not a salient object as the human-being does.

3) Influence of temporal window size. In SSA, only one subsequent frame is referred to in processing a frame. To justify the rationality, we conduct an experiment that gradually incorporates none or more subsequent frames and show the variation of SSA on VOS. In this experiment, we refer to the next frames, while . As shown in Fig. 10, by referring only to the subsequent frame the score increase from 0.735 (=0) to 0.755 (=1). This result implies that the temporal cues can facilitate the detection of salient objects in a frame, even though consecutive frames are highly correlated. By incorporating more far-away frames at , the performance gains are not as high as expected. This may be caused by the fact that the temporal correspondence between consecutive frames is the most reliable, while such reliability gradually decreases when the temporal gap between two frames increases. Such an experiment, together with the scalability test, can empirically prove that the over-fitting risk of SSA is not very high, even though only one subsequent frame is used as the temporal context of the current frame.

Fig. 10: Performance of SSA on VOS when temporal windows with different sizes are taken into consideration.

4) Speed test. The SSA model consists of many feature extraction steps, and their speed analysis will help to find out how to further enhance the efficiency. Toward this end, we list the time costs of various key steps of SSA in processing the first video in VOS, and compare them with that of the other 5 video-based SOD models. Note that the video has original resolution , and we down-sample it to for fair comparisons of various models in speed test. All models are tested on CPU platform (single core, 3.4GHz) with 128GB memory. As shown in Table VI, the speed of SSA is comparable to many previous algorithms like SIV and NLC. By investigating the time cost at each component of SSA, we find that about 58.8% computational resource is consumed in extracting object proposal. Moreover, about 22.1% computational resource is spent on generating the optical flow. As a result, a probable way to speed up SSA can be replacing these two components with faster models for optical flow computation and object proposal generation. In addition, the parallel processing mechanism can be explored as well, especially in extracting and encoding frame-wise saliency cues.

Saliency Cues MAP MAR MAE
Pixel-only 0.705 0.649 0.691 0.105
Superpixel-only 0.691 0.744 0.703 0.103
Object-only 0.647 0.811 0.678 0.155
Pixel + Superpixel 0.677 0.801 0.702 0.100
Pixel + Object 0.788 0.564 0.722 0.100
Superpixel + Object 0.720 0.774 0.732 0.091
All 0.764 0.728 0.755 0.083
TABLE V: Performance of SSA on VOS when different types of saliency cues are used.
Models or Key Steps Average Time (s/frame)
SIV [18] 10.5
FST [54] 5.80
NLC [67] 19.0
SAG [55] 5.37
GF [21] 4.67
Optical Flow 1.84
Object proposal 4.89
Pixel-based Saliency 0.06
Superpixel-based Saliency 0.83
Object-based Saliency 0.38
Auto-encoding & Post-Proc. 0.31
SSA 8.31
TABLE VI: Speed test of SSA, all its components and the other 5 video-based SOD models. All tests are tested on the first video of VOS with 617 frames, which is down-sampled to the resolution of for fair comparisons of all models.

5) Failure cases. Although SSA achieves the best performance, we can see that its score is still far from perfect, which is mainly caused by the low Recall rate. On VOS-E that contains only simple videos with nearly static salient objects and distractors as well as slow camera motion, SSA only reaches a score of 0.850, while the performance score drops sharply to 0.665 on VOS-N. This implies that the videos from the real-world scenarios are much more challenging than the videos taken in the laboratory environment. Actually, this is also the main reason that prevents the usage of existing SOD models in other applications.

To validate this point, we illustrate in Fig. 11 two representative scenarios that SSA fails, which actually provide two key challenges in video-based SOD. First, salient objects in a keyframe should be defined and detected by considering the entire video other than the keyframe itself. For example, in some early frames of Fig. 11 it is difficult to determine whether the pen or the notebook is the most salient object. Although in some later frames the pen is correctly detected, it is difficult to transfer such correct results to the frames far away. This indicates that the local spatiotemporal correspondences between pixels used by SSA is still insufficient to handle more challenging scenarios, and a salient object should be detected by computing saliency from the global perspective as well.

Nevertheless, the failure cases in Fig. 11 not only suggest what should be considered in developing new video-based models but also validate the effectiveness of the VOS dataset. Actually, the indoor/outdoor scenarios from VOS are mainly taken by non-professional photographers, which are quite different from those in existing image datasets. For example, the moving crab in Fig. 11 consistently receives the highest density of fixations and becomes the most salient object in video, even though it is very small. The existence of such scenarios in VOS increases the difficulties to transfer the knowledge learned on existing image datasets (e.g., the deep model DHSNet learned from 9500 images) to the spatiotemporal domain, making video-based SOD on VOS an extremely challenging task. With such challenging cases, it is believed that VOS can facilitate the development of new models by benchmarking their performance in processing real-world videos.

V-D Discussion

From all the results presented above, we draw three major conclusions: First, video-based SOD is much more challenging than image-based SOD. Even the state-of-the-art image-based models perform far from perfect without fully utilizing the temporal information from both local and global perspectives. Second, there exist some inherent correlations between image-based and video-based SOD, and the VOS-E subset serves as a good baseline to help extend existing image-based models to the spatiotemporal domain. Third, real-world scenarios are still very challenging for existing models. In user-generated videos, salient objects may be very small, fast moving, with poor lighting conditions and cluttered dynamic background, etc. By handling such challenging scenarios in VOS-N, a model can have better capability to process real-world scenarios. In particular, fixation prediction models often have impressive performance in detecting the most salient locations even in very complex real-world scenarios [73, 74], developing a better fixation prediction model may be very helpful to handle the VOS-N dataset in which salient objects are annotated with respect to human fixations.

Fig. 11: Failure cases. (a) Frames, (b) the fixations received in 30ms after a keyframe is displayed, (c) binary masks of salient objects and (d) the estimated saliency maps of SSA.

Vi Conclusion

Salient object Detection is a hot topic in the area of computer vision. In the past five years, hundreds of innovative models have been proposed for detecting salient objects in images, which gradually evolve from bottom-up models to deep models due to the presence of large-scale image datasets. However, the problem of video-based SOD has not been sufficiently explored since there lacks a large-scale video dataset. Actually, the most challenging part in building such a dataset is to provide a reasonable and unambiguous definition of salient objects from the spatiotemporal perspective.

To address this problem, this paper proposes VOS, a large-scale dataset with 200 videos. Different from existing datasets, salient objects in VOS are defined by combining human fixations and manually annotated objects throughout a video. As a result, the definition and annotation of salient objects in videos become less ambiguous. Moreover, we propose saliency-guided stacked autoencoders for video-based SOD, which, together with massive state-of-the-art models, are compared over VOS to show the challenges of video-based SOD as well as its differences and correlations with image-based SOD. We find that VOS is very challenging for containing a large amount of realistic videos, and its subset VOS-E serves as a good baseline to extend existing image-based models to the spatiotemporal domain. Moreover, its subset VOS-N covers many real-world scenarios that can help the deployment of better algorithms. This dataset can be very helpful for the area of video-based SOD, and the unsupervised saliency-guided stacked autoencoders can be used a good baseline model for benchmarking new video-based models.


  • [1] T. Liu, J. Sun, N. Zheng, X. Tang, and H.-Y. Shum, “Learning to detect a salient object,” in

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2007, pp. 1–8.
  • [2] R. Achanta, S. Hemami, F. Estrada, and S. Süsstrunk, “Frequency-tuned salient region detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
  • [3] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salient object detection: A discriminative regional feature integration approach,” in CVPR, 2013, pp. 2083–2090.
  • [4] N. Tong, H. Lu, X. Ruan, and M.-H. Yang, “Salient object detection via bootstrap learning,” in CVPR, 2015, pp. 1884–1892.
  • [5] D. Zhang, D. Meng, and J. Han, “Co-saliency detection via a self-paced multiple-instance learning framework,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
  • [6] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-context deep learning,” in CVPR, 2015, pp. 1265–1274.
  • [7] A. Borji, D. N. Sihite, and L. Itti, “Salient object detection: A benchmark,” in European Conference on Computer Vision (ECCV), 2012, pp. 414–429.
  • [8] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient object detection: A benchmark,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5706–5722, 2015.
  • [9] MSRA10K and THUR15K, “http://mmcheng.net/gsal/.
  • [10] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
  • [11] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
  • [12] A. Borji, “What is a salient object? a dataset and a baseline model for salient object detection,” in IEEE Transactions on Image Processing, 2014.
  • [13] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “The secrets of salient object segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [14] H. Peng, B. Li, W. Xiong, W. Hu, and R. Ji, “Rgbd salient object detection: a benchmark and algorithms,” in ECCV, 2014, pp. 92–109.
  • [15] H. Li, F. Meng, and K. Ngan, “Co-salient object detection from multiple images,” IEEE Transactions on Multimedia, vol. 15, no. 8, pp. 1896–1909, 2013.
  • [16] H. Fu, X. Cao, and Z. Tu, “Cluster-based co-saliency detection,” IEEE Transactions on Image Processing, vol. 22, no. 10, pp. 3766–3778, 2013.
  • [17] D. Zhang, D. Meng, C. Li, L. Jiang, Q. Zhao, and J. Han, “A self-paced multiple-instance learning framework for co-saliency detection,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 594–602.
  • [18] E. Rahtu, J. Kannala, M. Salo, and J. Heikkilä, “Segmenting salient objects from images and videos,” in European Conference on Computer Vision (ECCV), 2010.
  • [19] W.-T. Li, H.-S. Chang, K.-C. Lien, H.-T. Chang, and Y.-C. F. Wang, “Exploring visual and motion saliency for automatic video object extraction,” IEEE Transactions on Image Processing, vol. 22, no. 7, pp. 2600–2610, 2013.
  • [20] H. Fu, D. Xu, B. Zhang, S. Lin, and R. K. Ward, “Object-based multiple foreground video co-segmentation via multi-state selection graph,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 3415–3424, Nov 2015.
  • [21] W. Wang, J. Shen, and L. Shao, “Consistent video saliency using local gradient flow optimization and global refinement,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 4185–4196, Nov 2015.
  • [22] D. Tsai, M. Flagg, and J. M.Rehg, “Motion coherent tracking with multi-label mrf optimization,” British Machine Vision Conference (BMVC), 2010.
  • [23] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. V. Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [24] T. Brox and J. Malik, Object Segmentation by Long Term Analysis of Point Trajectories.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 282–295.
  • [25]

    G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” in

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [26] C. Xia, J. Li, X. Chen, A. Zheng, and Y. Zhang, “What is and what is not a salient object? learning salient object detector by ensembling linear exemplar regressors,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [27] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg, “Video segmentation by tracking many figure-ground segments,” in IEEE International Conference on Computer Vision (ICCV), 2013.
  • [28] P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects by long term video analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 6, pp. 1187–1200, June 2014.
  • [29] T. Liu, Z. Yuan, J. Sun, J. Wang, N. Zheng, X. Tang, and H.-Y. Shum, “Learning to detect a salient object,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 353–367, 2011.
  • [30] J. Li, Y. Tian, T. Huang, and W. Gao, “A dataset and evaluation methodology for visual saliency in video,” in IEEE International Conference on Multimedia and Expo (ICME), 2009, pp. 442–445.
  • [31] R. Carmi and L. Itti, “The role of memory in guiding attention during natural vision,” Journal of Vision, vol. 6, no. 9, pp. 4, 898–914, 2006.
  • [32] T. Vigier, J. Rousseau, M. P. Da Silva, and P. Le Callet, “A new HD and UHD video eye tracking dataset,” in International Conference on Multimedia Systems, 2016, pp. 48:1–48:6.
  • [33] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu, “Global contrast based salient region detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
  • [34] X. Shen and Y. Wu, “A unified approach to salient object detection via low rank matrix recovery,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [35]

    B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang, “Saliency detection via absorbing markov chain,” in

    IEEE International Conference on Computer Vision (ICCV), 2013.
  • [36] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robust background detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [37] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mech, “Minimum barrier salient object detection at 80 fps,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1404–1412.
  • [38] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detection via graph-based manifold ranking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
  • [39] M. M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S. M. Hu, “Global contrast based salient region detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 569–582, March 2015.
  • [40] L. Wang, H. Lu, X. Ruan, and M.-H. Yang, “Deep networks for saliency detection via local estimation and global search,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [41] G. Li and Y. Yu, “Deep contrast learning for salient object detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [42] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-context deep learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [43] Q. Hou, M.-M. Cheng, X.-W. Hu, A. Borji, Z. Tu, and P. Torr, “Deeply supervised salient object detection with short connection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [44] J. Han, D. Zhang, X. Hu, L. Guo, J. Ren, and F. Wu, “Background prior-based salient object detection via deep reconstruction residual,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 8, pp. 1309–1321, 2015.
  • [45] S. He, R. Lau, W. Liu, Z. Huang, and Q. Yang, “Supercnn: A superpixelwise convolutional neural network for salient object detection,” International Journal of Computer Vision, vol. 115, no. 3, pp. 330–344, 2015.
  • [46] G. Lee, Y.-W. Tai, and J. Kim, “Deep saliency with encoded low level distance map and high level features,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [47] S. S. S. Kruthiventi, V. Gudisa, J. H. Dholakiya, and R. V. Babu, “Saliency unified: A deep architecture for simultaneous eye fixation prediction and salient object segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [48] N. Liu and J. Han, “DHSNet: Deep hierarchical saliency network for salient object detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [49] J. Kuen, Z. Wang, and G. Wang, “Recurrent attentional networks for saliency detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [50] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Saliency detection with recurrent fully convolutional networks,” in European Conference on Computer Vision (ECCV), 2016.
  • [51] T. Liu, N. Zheng, W. Ding, and Z. Yuan, “Video attention: Learning to detect a salient object sequence,” in International Conference on Pattern Recognition (ICPR), 2008.
  • [52] K. Fukuchi, K. Miyazato, A. Kimura, S. Takagi, and J. Yamato, “Saliency-based video segmentation with graph cuts and sequentially updated priors,” in IEEE International Conference on Multimedia and Expo (ICME), June 2009, pp. 638–641.
  • [53] S. Bin, Y. Li, L. Ma, W. Wu, and Z. Xie, “Temporally coherent video saliency using regional dynamic contrast,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 12, pp. 2067–2076, 2013.
  • [54] A. Papazoglou and V. Ferrari, “Fast object segmentation in unconstrained video,” in IEEE International Conference on Computer Vision (ICCV), Dec 2013, pp. 1777–1784.
  • [55] W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video object segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 3395–3402.
  • [56] W. C. Chiu and M. Fritz, “Multi-class video co-segmentation with a generative multi-video model,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013, pp. 321–328.
  • [57] T. Brox and J. Malik, “Large displacement optical flow: Descriptor matching in variational motion estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 3, pp. 500–513, March 2011.
  • [58] H. Peng, B. Li, H. Ling, W. Hu, W. Xiong, and S. J. Maybank, “Salient object detection via structured matrix decomposition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 818–832, April 2017.
  • [59] J. Pont-Tuset, P. Arbelaez, J. Barron, F. Marques, and J. Malik, “Multiscale combinatorial grouping for image segmentation and object proposal generation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
  • [60] J. Li, M. Levine, X. An, X. Xu, and H. He, “Visual saliency based on scale-space analysis in the frequency domain,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 4, pp. 996–1010, 2013.
  • [61] H. Jiang, J. Wang, Z. Yuan, T. Liu, and N. Zheng, “Automatic salient object segmentation based on context and shape prior,” in British Machine Vision Conference (BMVC), 2011.
  • [62] Y. Xie, H. Lu, and M.-H. Yang, “Bayesian saliency via low and mid level cues,” IEEE Transactions on Image Processing, vol. 22, no. 5, 2013.
  • [63] R. Margolin, A. Tal, and L. Zelnik-Manor, “What makes a patch distinct?” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 1139–1146.
  • [64] X. Li, Y. Li, C. Shen, A. R. Dick, and A. van den Hengel, “Contextual hypergraph modeling for salient object detection,” in IEEE International Conference on Computer Vision (ICCV), 2013, pp. 3328–3335.
  • [65] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang, “Saliency detection via dense and sparse reconstruction,” in IEEE International Conference on Computer Vision (ICCV), 2013.
  • [66] J. Kim, D. Han, Y.-W. Tai, and J. Kim, “Salient region detection via high-dimensional color transform,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [67] A. Faktor and M. Irani, “Video segmentation by non-local consensus voting,” in British Machine Vision Conference (BMVC), 2014.
  • [68] N. Tong, H. Lu, X. Ruan, and M.-H. Yang, “Salient object detection via bootstrap learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1884–1892.
  • [69] Y. Qin, H. Lu, Y. Xu, and H. Wang, “Saliency detection via cellular automata,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 110–119.
  • [70] P. Jiang, N. Vasconcelos, and J. Peng, “Generic promotion of diffusion-based salient object detection,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 217–225.
  • [71] G. Lee, Y. Tai, and J. Kim, “Deep saliency with encoded low level distance map and high level features,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [72] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
  • [73] J. Han, D. Zhang, S. Wen, L. Guo, T. Liu, and X. Li, “Two-stage learning to predict human eye fixations via SDAEs,” IEEE Transactions on Cybernetics, vol. 46, no. 2, pp. 487–498, Feb 2016.
  • [74] Y. Fang, W. Lin, Z. Chen, C.-M. Tsai, and C.-W. Lin, “A video saliency detection model in compressed domain,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 1, pp. 27–38, 2014.