Semi-Supervised Video Salient Object Detection Using Pseudo-Labels

08/12/2019 ∙ by Pengxiang Yan, et al. ∙ Megvii Technology Limited IEEE The University of Hong Kong SUN YAT-SEN UNIVERSITY 2

Deep learning-based video salient object detection has recently achieved great success with its performance significantly outperforming any other unsupervised methods. However, existing data-driven approaches heavily rely on a large quantity of pixel-wise annotated video frames to deliver such promising results. In this paper, we address the semi-supervised video salient object detection task using pseudo-labels. Specifically, we present an effective video saliency detector that consists of a spatial refinement network and a spatiotemporal module. Based on the same refinement network and motion information in terms of optical flow, we further propose a novel method for generating pixel-level pseudo-labels from sparsely annotated frames. By utilizing the generated pseudo-labels together with a part of manual annotations, our video saliency detector learns spatial and temporal cues for both contrast inference and coherence enhancement, thus producing accurate saliency maps. Experimental results demonstrate that our proposed semi-supervised method even greatly outperforms all the state-of-the-art fully supervised methods across three public benchmarks of VOS, DAVIS, and FBMS.



There are no comments yet.


page 1

page 2

page 6

page 8

page 9

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Salient object detection aims at identifying the most visually distinctive objects in an image or video that attract human attention. In contrast to the other type of saliency detection, i.e., eye fixation prediction [kruthiventi2017deepfix, wang2018revisiting] which is designed to locate the focus of human attention, salient object detection focuses on segmenting the most salient objects with precise contours. This topic has drawn widespread interest as it can be applied to a wide range of vision applications, such as object segmentation [wei2017stc], visual tracking [wu2014weighted], video compression [itti2004automatic], and video summarization [ma2002user].

Recently, video salient object detection has achieved significant progress [li2018flow, song2018pyramid, wang2018video]

due to the development of deep convolutional neural networks (CNNs). However, the performance of these deep learning-based methods comes at the cost of a large quantity of densely annotated frames. It is arduous and time consuming to manually annotate a large number of pixel-level video frames since even an experienced annotator needs several minutes to label a single frame. Moreover, a video clip usually contains hundreds of video frames with similar content. To reduce the impact of label noise on model training, the annotators need to spend considerable time checking the consistency of the label before and after. Considering that visual saliency is subjective, the annotation work becomes even more difficult, and the quality of the labeling is hard to guarantee.

Figure 1: Example ground truth masks (orange mask) vs. our generated pseudo-labels (blue mask) from the VOS [li2018benchmark] dataset.

Although there are many unsupervised video salient object detection methods [wang2015saliency, wang2015consistent, li2018benchmark] that are free of numerous training samples, these methods suffer from low prediction accuracy and efficiency. Since most of these methods exploit hand-crafted low-level features, e.g., color, gradient or contrast, they work well in some considered cases while failing in other challenging cases. Recent research by Li et al. [li2018weakly] noticed the weakness of unsupervised methods and the lack of annotations for deep learning-based methods. They attempted to use the combination of coarse activation maps and saliency maps, which were generated by learning-based classification networks and unsupervised methods respectively, as pixel-wise training annotations for image salient object detection. However, this method is not suitable for the video-based salient object detection task, where object motion and changes in appearance contrast are more attractive to human attention [itti1998model] than object categories. Moreover, it is also challenging to train deep learning-based video salient object detection models for temporally consistent saliency map generation, due to the lack of temporal cues in sparsely annotated frames.

By carefully observing the training samples of existing video salient object detection benchmarks [li2018benchmark, perazzi2016benchmark, brox2010object], we found that the adjacent frames in a video share small differences due to the high video sampling rate (e.g., 24 fps in the DAVIS [perazzi2016benchmark]

dataset). Thus, we conjecture that it is not necessary to densely annotate all the frames since some of the annotations can be estimated by exploiting motion information. Moreover, recent work has shown that a well-trained CNN can also correct some manual annotation errors that exist in the training samples 


Inspired by these observations, in this paper, we address the semi-supervised video salient object detection task using unannotated frames with pseudo-labels as well as a few sparsely annotated frames. We develop a framework that exploits pixel-wise pseudo-labels generated from a few ground truth labels to train a video-based convolutional network for saliency maps with spatiotemporal coherence. Specifically, we first propose a refinement network with residual connections (RCRNet) to extract spatial saliency information and generate saliency maps with high-resolution through a series of upsampling based refine operations. Then, the RCRNet equipped with a non-locally enhanced recurrent (NER) module is proposed to enhance the spatiotemporal coherence of the resulting saliency maps. For the pseudo-label generation, we adopt a pretrained FlowNet 2.0 

[ilg2017flownet] for motion estimation between labeled and unlabeled frames and propagate adjacent labels to unlabeled frames. Meanwhile, another RCRNet is modified to accept multiple channels as input, including RGB channels, propagated adjacent ground truth annotations, and motion estimations, to generate consecutive pixel-wise pseudo-labels, which make up for the temporal information deficiency that exists in sparse annotations. As shown in Fig. 1, our model can produce reasonable and consistent pseudo-labels, which can even improve the boundary details (Example a) and overcome the labeling ambiguity between frames (Example b). Learning under the supervision of generated pseudo-labels together with a few ground truth labels, our proposed RCRNet with NER module (RCRNet+NER) can generate more accurate saliency maps which even outperforms the results of top-performing fully supervised video salient object detection methods.

In summary, this paper has the following contributions:

We introduce a refinement network equipped with a non-locally enhanced recurrent module to generate saliency maps with spatiotemporal coherence.

We further propose a flow-guided pseudo-label generator, which captures the interframe continuity of video and generates pseudo-labels of intervals based on sparse annotations.

Under the joint supervision of the generated pseudo-labels and the manually labeled sparse annotations (e.g., 20% ground truth labels), our semi-supervised model can be trained to outperform existing state-of-the-art fully supervised video salient object detection methods.

2 Related Work

2.1 Salient Object Detection

Benefiting from the development of deep convolutional networks, salient object detection has recently achieved significant progress. In particular, these methods based on the fully convolutional network (FCN) and its variants [li2017instance, hou2017deeply, li2016deep] have become the dominant methods in this field, due to their powerful end-to-end feature learning nature and high computational efficiency. Nevertheless, these methods are inapplicable to video salient object detection without considering spatiotemporal information and contrast information within both motion and appearance in videos. Recently, attempts to apply deep CNNs to video salient object detection have attracted considerable research interest. Wang et al. [wang2018video] introduced FCN to this problem by taking adjacent pairs of frames as input. However, this method fails to learn sufficient spatiotemporal information with a limited number of input frames. To overcome this deficiency, Li et al. [li2018flow] proposed to enhance the temporal coherence at the feature level by exploiting both motion information and sequential feature evolution encoding. Fan et al. [fan2019shifting] proposed to captures video dynamics with a saliency-shift-aware module that learns human attention-shift. However, all the above methods rely on densely annotated video datasets, and none of them have ever attempted to reduce the dependence on dense labeling.

To the best of our knowledge, we are the first to explore the video salient object detection task by reducing the dependence on dense labeling. Moreover, we verify that the generated pseudo-labels can overcome the ambiguity in the labeling process to some extent, thus facilitating our model to achieve better performance.

2.2 Video Object Segmentation

Video object segmentation tasks can be divided into two categories, including semi-supervised video object segmentation [jain2014supervoxel, chockalingam2009adaptive] and unsupervised video object segmentation [tokmakov2017learningvideo, jain2017fusionseg]. Semi-supervised video object segmentation aims at tracking a target mask given from the first annotated frame in the subsequent frames, while unsupervised video object segmentation aims at detecting the primary objects through the whole video sequence automatically. It should be noted that the supervised or semi-supervised video segmentation methods mentioned here are all for the test phase, and the training process of both tasks is fully supervised. The semi-supervised video salient object detection considered in this paper is aimed at reducing the labeling dependence of training samples during the training process. Here, unsupervised video object segmentation is the most related task to ours as both tasks require no annotations during the inference phase. It can be achieved by graph cut [papazoglou2013fast], saliency detection [wang2015saliency], motion analysis [li2018unsupervised], or object proposal ranking [lee2011key]. Recently, unsupervised video object segmentation methods have been mainly based on deep learning networks, such as two-stream architecture [jain2017fusionseg], FCN network [cheng2017segflow], and recurrent networks [tokmakov2017learningvideo]. However, most of the deep learning methods rely on a large quantity of pixel-wise labels for fully supervised training.

In this paper, we address the semi-supervised video salient object detection task using pseudo-labels with a few annotated frames. Although our proposed model is trained with semi-supervision, it is still well applicable to unsupervised video object segmentation.

3 Our Approach

In this section, we elaborate on the details of the proposed framework for semi-supervised video salient object detection, which consists of three major components. First, a residual connected refinement network is proposed to provide a spatial feature extractor and a pixel-wise classifier for salient object detection, which are respectively used for extracting spatial saliency features from raw input images and encoding the features to pixel-wise saliency maps with low-level cues connected to high-level features. Second, a non-locally enhanced recurrent module is designed to enhance the spatiotemporal coherence of the feature representation. Finally, a flow-guided pseudo-label generation (FGPLG) model, comprised of a modified RCRNet and an off-the-shelf FlowNet 2.0 model 

[ilg2017flownet], is applied to generate in-between pseudo-labels from sparsely annotated video frames. With appropriate numbers of pseudo-labels, RCRNet with the NER module can be trained to capture the spatiotemporal information and generate accurate saliency maps for dense input frames.

Figure 2: The architecture of our refinement network with residual connection (RCRNet). Here, ‘

’ denotes element-wise addition. Output stride (OS) explains the ratio of the input image size to the output feature map size.

Figure 3: The architecture of our proposed video salient object detection network (RCRNet+NER). We incorporate a non-locally enhanced temporal module with our proposed RCRNet for spatiotemporal coherence modeling.

3.1 Refinement Network with Residual Connection

Typical deep convolutional neural networks can extract high-level features from low-level cues of images, such as colors and textures, using a stack of convolutional layers and downsampling operations. The downsampling operation obtains an abstract feature representation by gradually increasing the receptive field of the convolutional layers. However, many spatial details are lost in this process. Without sufficient spatial details, pixel-wise prediction tasks, such as salient object detection, cannot precisely predict on object boundaries or small objects. Inspired by [li2017instance], we adopt a refinement architecture to incorporate low-level spatial information in the decoding process for pixel-level saliency inference. As shown in Fig. 2, the proposed RCRNet consists of a spatial feature extractor and a pixel-wise classifier connected by three connection layers in different stages. The output saliency map of a given frame can be computed as


Spatial Feature Extractor: The spatial feature extractor is based on a ResNet-50 [he2016deep] model. Specifically, we use the first five groups of layers of ResNet-50 and remove the downsampling operations in conv5_x to reduce the loss of spatial information. To maintain the same receptive field, we use dilated convolutions [yu2015multi] with to replace the convolutional layers in the last layer. Then we attach an atrous spatial pyramid pooling (ASPP) [chen2017rethinking] module to the last layer, which captures both the image-level global context and the multiscale spatial context. Finally, the spatial feature extractor produces features with channels and of the original input resolution ().

Pixel-wise Classifier:

The pixel-wise classifier is composed of three cascaded refinement blocks, each of which is connected to a layer in the spatial feature extractor via a connection layer. It is designed to mitigate the impact of the loss of spatial details during the downsampling process. Each refinement block takes as input the previous bottom-up output feature map and its corresponding feature map connected from the top-down stream. The resolution of these two feature maps should be consistent, so the upsampling operation is performed via bilinear interpolation when necessary. The refinement block works by first concatenating the feature maps and then feeding them to another

convolutional layer with channels. Motivated by [he2016deep], a residual bottleneck architecture, named residual skip connection layer, is employed as the connection layer to connect low-level features to high-level ones. It downsamples the low-level feature maps from channels to channels and brings more spatial information to the refinement block. Residual learning allows us to connect the pixel-wise classifier to the pretrained spatial feature extractor without breaking its initial state (e.g., if the weight of the residual bottleneck is initialized as zero).

3.2 Non-locally Enhanced Recurrent Module

Given a sequence of video clip , video salient object detection aims at producing the saliency maps of all frames . Although the proposed RCRNet specializes in spatial saliency learning, it still lacks spatiotemporal modeling for video frames. Thus, we further propose a non-locally enhanced temporal (NER) module, which consists of two non-local blocks [wang2018non] and a convolutional GRU (ConvGRU) [ballas2015delving] module, to improve spatiotemporal coherence in high-level features. As shown in Fig. 3, incorporated with the NER module, RCRNet can be extended to video-based salient object detection.

Specifically, we first combine the features extracted from input video frames

as Here, [,.,] denotes the concatenation operation and the spatial feature of each frame is computed as . Then, the combined feature is fed into a non-local block. The non-local block computes the response at a position as a weighted sum of features at all positions for input feature maps. It can construct the spatiotemporal connection between the features of input video frames.

On the other hand, as a video sequence is composed of a series of scenes that are captured in chronological order, it is also necessary to characterize the sequential evolution of appearance contrast in the temporal domain. Based on this, we propose to exploit ConvGRU [ballas2015delving] modules for sequential feature evolution modeling. ConvGRU is an extension of traditional fully connected GRU [cho2014learning] that has convolutional structures in both input-to-state and state-to-state connections. Let denote the input to ConvGRU and stand for its hidden states. A ConvGRU module consists of a reset gate and an update gate . With these two gates, ConvGRU can achieve selective memorization and forgetting. Given the above definition, the overall updating process of ConvGRU unrolled by time can be listed as follows:


where ‘’ denotes the convolution operator and ‘’ denotes the Hadamard product.

represents the sigmoid function and

represents the learnable weight matrices. For notational simplicity, the bias terms are omitted.

Motivated by [song2018pyramid], we stack two ConvGRU modules with forward and backward directions to strengthen the spatiotemporal information exchanges between two directions. In this way, deeper bidirectional ConvGRU (DB-ConvGRU) can memorize not only past sequences but also future ones. It can be formulated as follows:


where and represent the hidden state from forward and backward ConvGRU units, respectively. represents the final output of DB-ConvGRU. is the output feature from the non-local block.

As proven in [wang2018non], more non-local blocks in general lead to better results. Thus, we attach another non-local block to DB-ConvGRU to further enhance spatiotemporal coherence.

Figure 4: The architecture of our proposed flow-guided pseudo-label generation model (FGPLG).

3.3 Flow-Guided Pseudo-Label Generation Model

Although the proposed RCRNet+NER has a great potential to produce saliency maps with spatiotemporal coherence. With only a few sparsely annotated frames, it can barely learn enough temporal information, which greatly reduces the temporal coherence of the resulting saliency maps. To solve this problem, we attempt to generate denser pseudo-labels from a few sparse annotations and train our video saliency model with both types of labels.

Given triplets of input video frames , the proposed FGPLG model aims at generating a pseudo-label for frame with ground truth and propagated from frame and , respectively. First, it computes the optical flow from frame to frame with the off-the-shelf FlowNet 2.0. The optical flow is obtained in the same way. Then, the label of frame is estimated by applying a warping function to adjacent ground truth and . Nevertheless, as we can see in Fig. 4, the warped ground truth and , are still too noisy to be used as supervisory information for practical training. Although the magnitude of optical flow and provide reasonable estimations of the motion mask of frame , they cannot be employed as the estimated ground truth directly since not all the motion masks are salient. To further refine the estimated pseudo-label of frame , another RCRNet is modified to accept a frame with 7 channels including RGB channels of frame , adjacent warped ground truth and and optical flow magnitude and . With the above settings, a more reasonable and precise pseudo-label of frame can be generated as:


Here, the magnitude of optical flow is calculated by first normalizing the optical flow into interval and then computing its Euclidean norm.

The generation model can be trained with sparsely annotated frames to generate denser pseudo-labels. In our experiments, we use a fixed interval to select sparse annotations for training. We take an annotation every frames, i.e., the interval between the and frame, and the interval between the and frame are both equal to . Experimental results show that the generation model designed in this way has a strong generalization ability. It can use the model trained by the triples sampled at larger interframe intervals to generate dense pseudo-labels of very high quality.

Datasets Metric I+C I+D V+U V+D
VOS 0.558 0.589 0.577 0.680 0.704 0.715 0.703 0.719 0.723 0.734 0.541 0.529 0.669 0.681 0.714 0.741 0.856
0.612 0.652 0.638 0.721 0.728 0.783 0.760 0.764 0.776 0.796 0.597 0.560 0.710 0.727 0.734 0.797 0.872
DAVIS 0.488 0.481 0.520 0.732 0.760 0.785 0.775 0.775 0.758 0.809 0.519 0.619 0.697 0.764 0.797 0.849 0.859
0.590 0.620 0.568 0.788 0.803 0.820 0.814 0.789 0.811 0.844 0.663 0.686 0.738 0.757 0.838 0.878 0.884
FBMS 0.466 0.488 0.540 0.764 0.760 0.765 0.776 0.809 0.813 0.823 0.545 0.609 0.597 0.752 0.801 0.823 0.861
0.567 0.591 0.586 0.765 0.772 0.793 0.793 0.835 0.832 0.847 0.632 0.642 0.634 0.747 0.818 0.839 0.870
  • Note that our model is a semi-supervised learning model using only approximately

    ground truth labels for training.

Table 1: Comparison of quantitative results using maximum F-measure (larger is better), S-measure (larger is better). The best three results on each dataset are shown in red, blue, and green, respectively. Symbols of model categories: I+C for image-based classic unsupervised or non-deep learning methods, I+D for image-based deep learning methods, V+U for video-based unsupervised methods, V+D for video-based deep learning methods. Refer to the supplemental document for more detailed results.

Figure 5: Comparison of precision-recall curves of 15 saliency detection methods on the VOS, DAVIS and FBMS datasets. Our proposed RCRNet+NER consistently outperforms other methods across three testing datasets using only 20% of ground truth labels.

4 Experimental Results

4.1 Datasets and Evaluation

We evaluate the performance of our method on three public datasets: VOS [li2018benchmark], DAVIS [perazzi2016benchmark] and FBMS [brox2010object]. VOS is a large-scale dataset with 200 indoor/outdoor videos for video-based salient object detection. It contains 116,103 frames including 7,650 pixel-wise annotated keyframes. The DAVIS dataset contains 50 high-quality videos, with a total of 3,455 pixel-wise annotated frames. The FBMS dataset contains 59 videos, totaling 720 sparsely annotated frames. We evaluate our trained RFCN+NER on the test sets of VOS, DAVIS, and FBMS for the task of video salient object detection.

We adopt precision-recall curves (PR), maximum F-measure and S-measure for evaluation. The F-measure is defined as Here, is set to 0.3 as done by most existing image-based models [borji2015salient, li2017instance, hou2017deeply]. We report the maximum F-measure computed from all precision-recall pairs. The S-measure is a new measure proposed in [fan2017structure], which can simultaneously evaluate both region-aware and object-aware structural similarity between a saliency map and its corresponding ground truth.

4.2 Implementation Details

Our proposed method is implemented on PyTorch


, a flexible open source deep learning platform. First, we initialize the weights of the spatial feature extractor in RCRNet with an ImageNet 

[deng2009imagenet] pretrained ResNet-50 [he2016deep]. Next, we pretrain the RCRNet using two image saliency datasets, i.e., MSRA-B [liu2011learning] and HKU-IS [li2015visual], for spatial saliency learning. For semi-supervised video salient object detection, we combine the training sets of VOS [li2018benchmark], DAVIS [perazzi2016benchmark], and FBMS [brox2010object] as our training set. The RCRNet pretrained on image saliency datasets is used as the backbone of the pseudo-label generator. Then the FGPLG fine-tuned with a subset of the video training set is used to generate pseudo-labels. By utilizing the pseudo-labels together with the subset, we jointly train the RCRNet+NER, which takes a video clip of length as input, to generate saliency maps to all input frames. Due to the limitation of machine memory, the default value of is set to 4 in our experiments.

During the training process, we adopt Adam [kingma2014adam] as the optimizer. The learning rate is initially set to 1e-4 when training RCRNet, and is set to 1e-5 when fine-tuning RCRNet+NER and FGPLG. The input images or video frames are resized to

before being fed into the network in both training and inference phases. We use sigmoid cross-entropy loss as the loss function and compute the loss between each input image/frame and its corresponding label, even if it is a pseudo-label. In Section 

4.4, we explore the effect of different amount of ground truth (GT) and pseudo-labels usage. It shows that when we take one GT and generate one pseudo-label every five frames (column ‘1 / 5’ in Table 2) as the new training set, RCRNet+NER can be trained to outperform the model trained with all ground truth labels on the VOS dataset. We use this setting when performing external comparisons with existing state-of-the-art methods. In this setting, it takes approximately 10 hours to finish the whole training process on a workstation with an NVIDIA GTX 1080 GPU and a 2.4 GHz Intel CPU. In the inference phase, it takes approximately 37 ms to generate a saliency map for a input frame, which reaches a real-time speed of 27 fps.

4.3 Comparison with State-of-the-Art

We compare our video saliency model (RCRNet+NER) against 16 state-of-the-art image/video saliency methods, including MC [jiang2013saliency], RBD [zhu2014saliency], MB+ [zhang2015minimum], RFCN [wang2016saliency], DCL [li2016deep], DHS [liu2016dhsnet], DSS [hou2017deeply], MSR [li2017instance], DGRL [wang2018detect], PiCA [liu2018picanet], SAG [wang2015saliency], GF [wang2015consistent], SSA [li2018benchmark], FCNS [wang2018video], FGRN [li2018flow], and PDB [song2018pyramid]. For a fair comparison, we use the implementations provided by the authors and fine-tune all the deep learning-based methods using the same training set, as mentioned in Section  4.2.

A visual comparison is given in Fig. 6. As shown in the figure, RCRNet+NER can not only accurately detect salient objects but also generate precise and consistent saliency maps in various challenging cases. As a part of the quantitative evaluation, we show a comparison of PR curves in Fig. 5. Moreover, a quantitative comparison of maximum F-measure and S-measure is listed in Table 1. As can be seen, our method can outperform all the state-of-the-art image-based and video-based saliency detection methods on VOS, DAVIS, and FBMS. Specifically, our RCRNet+NER improves the maximum F-measure achieved by the existing best-performing algorithms by 15.52%, 1.18%, and 4.62% respectively on VOS, DAVIS, and FBMS, and improves the S-measure by 9.41%, 0.68%, 2.72% accordingly. It is worth noting that our proposed method uses only approximately 20% ground truth maps in the training process to outperform the best-performing fully supervised video-based method (PDB), even though both models are based on the same backbone network (ResNet-50).

Figure 6: Visual comparison of saliency maps generated by state-of-the-art methods, including our RCRNet+NER. The ground truth (GT) is shown in the last column. Our model consistently produces saliency maps closest to the ground truth. Zoom in for details.

Figure 7: Sensitivities analysis on the amount of ground truth labels usage.
Labels / 0 / 1 0 / 2 0 / 5 1 / 5 4 / 5 0 / 20 7 / 20 19 / 20
Proportion GT 100% 50% 20% 20% 20% 5% 5% 5%
Pseudo 0% 0% 0% 20% 80% 0% 35% 95%
Metric 0.849 0.850 0.849 0.861 0.850 0.821 0.847 0.845
0.873 0.869 0.867 0.874 0.873 0.832 0.861 0.860
Table 2: Some representative quantitative results on different amount of ground truth (GT) and pseudo-labels usage. Here, refers to the GT label interval, and denotes the number of pseudo-labels used in each interval. For example, ‘0 / 5’ means using one GT every five frames with no pseudo-labels. ‘1 / 5’ means using one GT and generating one pseudo-label every five frames. Refer to the supplemental document for more detailed analysis.

4.4 Sensitivities to Different Amount of Ground Truth and Pseudo-Labels Usage

As described in Section 4.3, RCRNet+NER achieves state-of-the-art performance using only a few GTs and generated pseudo-labels for training. To demonstrate the effectiveness of our proposed semi-supervised framework, we explore the sensitivities to different amount of GT and pseudo-labels usage on the VOS dataset. First, we take a subset of the training set of VOS by a fixed interval and then fine-tune the RCRNet+NER with it. By repeating the above experiment with different fixed intervals, we show the performance of RCRNet+NER trained with different number of GT labels in Fig. 7. As shown in the figure, when the number of GT labels is severely insufficient (e.g., 5% of the origin training set), RCRNet+NER can benefit substantially from the increase in GT label usage. An interesting phenomenon is that when the training set is large enough, the application of denser label data does not necessarily lead to better performance. Considering that adjacent densely annotated frames share small differences, ambiguity is usually inevitable during the manual labeling procedure, which may lead to overfitting and affect the generalization performance of the model.

Then, we further use the proposed FGPLG to generate different number of pseudo-labels with different number of GT labels. Some representative quantitative results are shown in Table 2, where we find that when there are insufficient GT labels, adding an appropriate number of generated pseudo-labels for training can effectively improve the performance. Furthermore, when we use 20% of annotations and 20% of pseudo-labels (column ‘1 / 5’ in the table) to train RCRNet+NER, it reaches the max and -measure on the test set of VOS, surpassing the one trained with all GT labels. Even if trained with 5% of annotations and 35% of pseudo-labels (column ‘7 / 20’ in the table), our model can produce comparable results. This interesting phenomenon demonstrates that pseudo-labels can overcome labeling ambiguity to some extent. Moreover, it also indicates that it is not necessary to densely annotate all video frames manually considering redundancies. Under the premise of the same labeling effort, selecting the sparse labeling strategy to cover more kinds of video content, and assisting with the generated pseudo-labels for training, will bring more performance gain.

4.5 Ablation Studies

To investigate the effectiveness of the proposed modules, we conduct the ablation studies on the VOS dataset.

The effectiveness of NER. As described in Section 3.2, our proposed NER module contains three cascaded modules, including a non-local block, a DB-ConvGRU module, and another non-local block. To validate the effectiveness and necessity of each submodule, we compare our RCRNet equipped with NER or its four variants on the test set of VOS. Here, we use one ground truth and one pseudo-label every five frames as the training set, to fix the impact of different amount of GT and pseudo-labels usage. As shown in Table 3, refers to our proposed RCRNet with a non-locally enhanced module. By comparing the performance of the first three variants , , and , we find that adding non-local blocks and DB-ConvGRU can create a certain level of performance improvement. On the basis of , adding an extra non-local block (i.e., ) can further increase 0.5% w.r.t max F-measure. When compared with and , we observe that DB-ConvGRU is indeed superior to ConvGRU as it involves deeper bidirectionally sequential modeling.

The effectiveness of FGPLG. As mentioned in Section 3.3, FGPLG model takes multiple channels as input to generate pseudo-labels, including image RGB channels, warped adjacent ground truth maps, and magnitude of optical flow. To validate the effectiveness and necessity of each component, we train three separate RCRNet+NER with pseudo-labels generated by our proposed FGPLG including its two variants, each of which takes different channels as input. Here, we use one ground truth and seven pseudo-labels every 20 frames as the training set for comparison. It also includes the performance of model , which is trained without pseudo-labels, as a baseline. As shown in Table 4, the models trained with pseudo-labels (i.e., , , and ) all surpass the baseline model , which further validates the effectiveness of using pseudo-labels for training. On the basis of , adding adjacent ground truth as input (i.e., ) slightly improves the performance, while our proposed pseudo-label generator outperforms all the other variants with a significant margin by further exploiting adjacent ground truth through flow-guided motion estimation.

first non-local block?
second non-local block?
0.846 0.853 0.856 0.857 0.861
0.865 0.871 0.871 0.872 0.874
Table 3: Effectiveness of non-locally enhanced recurrent module.
without label generation?
RGB channels?
adjacent ground truth?
optical flow and GT warping?
0.821 0.832 0.838 0.847
0.832 0.854 0.860 0.861
Table 4: Effectiveness of flow-guided label generation model.
Dataset Metric Methods
DAVIS 74.3 70.1 70.7 70.0 67.4 55.8 74.7
72.8 72.1 65.3 65.9 66.7 51.1 73.3
FBMS 72.3 65.1 68.4 35.7 35.7 47.7 75.9
  • Note that our model is a semi-supervised learning model using only approximately ground truth labels for training.

Table 5: Performance comparison with 6 representative unsupervised video object segmentation methods on the DAVIS and FBMS datasets. The best scores are marked in bold.

5 Performance on Unsupervised Video Object Segmentation

Unsupervised video object segmentation aims at automatically separating primary objects from input video sequences. As described, its problem setting is quite similar to video salient object detection, except that it seeks to perform a binary classification instead of computing a saliency probability for each pixel. To demonstrate the advantages and generalization ability of our proposed semi-supervised model, we test the pretrained RCRNet+NER (mentioned in Section 

4) on the DAVIS and FBMS dataset without any pre-/post-processing and make a fair comparison with other 6 representative state-of-the-art unsupervised video segmentation methods, including FST [papazoglou2013fast], SFL [cheng2017segflow], LMP [tokmakov2017learningmotion], FSEG [jain2017fusionseg], LVO [tokmakov2017learningvideo] and PDB [song2018pyramid]

. We adopt the mean Jaccard index

 (intersection-over-union) and mean contour accuracy as metrics for quantitative comparison on the DAVIS dataset according to its settings. For the FBMS dataset, we employ the mean Jaccard index , as done by previous works [song2018pyramid, li2018flow]. As shown in Table 5, our proposed method outperforms the above methods on both the DAVIS and FBMS datasets, which implies that our method has a strong ability to capture spatiotemporal information from video frames and is applicable to unsupervised video segmentation.

6 Conclusion

In this paper, we propose an accurate and cost-effective framework for video salient object detection. Our proposed RCRNet equipped with a non-locally enhanced recurrent module can learn to effectively capture spatiotemporal information with only a few ground truths and an appropriate number of pseudo-labels generated by our proposed flow-guided pseudo-label generation model. We believe this will bring insights to future work on the manual annotation for video segmentation tasks. Experimental results demonstrate that our proposed method can achieve state-of-the-art performance on video salient object detection and is also applicable to unsupervised video segmentation. In future work, we will further explore the impact of the use of keyframe selection instead of interval sampling of GT labels on the performance of the proposed method.


This work was supported by the State Key Development Program under Grant 2016YFB1001004, the National Natural Science Foundation of China under Grant No.U1811463, No.61702565 and No.61876045, the Department Science and Technology of Guangdong Province Fund under Grant No.2018B030338001, and was also sponsored by SenseTime Research Fund.