A Novel Video Salient Object Detection Method via Semi-supervised Motion Quality Perception

08/07/2020 ∙ by Chenglizhao Chen, et al. ∙ NetEase, Inc 3

Previous video salient object detection (VSOD) approaches have mainly focused on designing fancy networks to achieve their performance improvements. However, with the slow-down in development of deep learning techniques recently, it may become more and more difficult to anticipate another breakthrough via fancy networks solely. To this end, this paper proposes a universal learning scheme to get a further 3% performance improvement for all state-of-the-art (SOTA) methods. The major highlight of our method is that we resort the "motion quality"—a brand new concept, to select a sub-group of video frames from the original testing set to construct a new training set. The selected frames in this new training set should all contain high-quality motions, in which the salient objects will have large probability to be successfully detected by the "target SOTA method"—the one we want to improve. Consequently, we can achieve a significant performance improvement by using this new training set to start a new round of network training. During this new round training, the VSOD results of the target SOTA method will be applied as the pseudo training objectives. Our novel learning scheme is simple yet effective, and its semi-supervised methodology may have large potential to inspire the VSOD community in the future.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 5

page 6

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction and Motivation

Different from images that comprise spatial information only, video data usually contain both spatial (appearances) and temporal (motions) information. To alleviate the computational burden, most of the video related applications [53, 52, 2, 3, 16, 5, 35] have adopted the video salient object detection (VSOD) approaches as the pre-processing tool to filter the less important video contents while highlighting the salient objects that attract our visual system most, aiming to strike the trade-off between efficiency and performance.

Fig. 1: The key motivation of our method is to select a sub group video frames from the original testing set to construct a new training set, and these selected frames should have high-quality motions (by our MQPM) and their VSOD provided by the target SOTA method will be used as the pseudo-GT to start a new round training which will improve the target SOTA method significantly. SOD: the salient object detection results by feeding the optical flow data into the pre-trained image salient object detection model (we choose CPD [49] here, see Eq. 1); SOTA: the VSOD of the target SOTA method (we take the SSAV [15] for example) which we aim to improve its performance, and it can be any other SOTA methods; Ours: the final VSOD results after using our novel learning scheme, of which the overall performance have outperformed the SOTA results, significantly.

After entering the deep learning era, the state-of-the-art (SOTA) VSOD approaches have achieved steady performance improvements via various fancy networks, such as ConvLSTM [51] and 3D ConvNet [44]. However, with the slow-down in development of the deep learning techniques recently, we shouldn’t anticipate for new breakthrough via fancy networks solely. For example, compared with the leading SOTA method in 2019 (i.e., MGA [27]), the performance improvement made by the most recent work in 2020 (i.e. PCSA [17]) is really marginal with a performance gap less than 1% averagely. This fact motivates us to wonder why wouldn’t we develop a universal learning scheme, rather than using fancy networks, to get the SOTA performances further improved?

Given an off-the-shelf VSOD approach (we name it as the “target SOTA method”), this paper aims to improve its performance via a novel learning scheme, and we formulate our idea as following.
1) We shall select a sub-group of video frames from the original testing set to construct a new training set, and these selected frames are needed to be the ones that have been “successfully detected” by the target SOTA method.
2) Consequently, we will achieve a significant performance improvement by using this new training set to start a new round of network training, in which the VSODs of the corresponding SOTA method will be used as the training objectives (pseudo-GT).
So, without using any saliency ground truth (GT) of the original testing set, all that remains now is how can we know which frames will be successful detected by the target SOTA method in advance.

Our key idea is quite simple and straight-froward, which is inspired by a common phenomenon in the SOTA methods; i.e., for most of the SOTA VSOD methods, their performances usually vary from frame to frame, even though these frames belong to an identical video sequence sharing similar scenes. For example, as is shown in Fig. 1, the 1st row shows 10 consecutive frames with similar scenes containing a worm as the salient object; however, as is shown in the 3rd row, the VSOD results of the SOTA method (SSAV [15]) in the frame #17, #18, #21 and #24 are clearly better than other frames. The main reason is that the VSOD performance is determined by both spatial and temporal saliency clues. Though the spatial saliency clues are usually stable between consecutive video frames, the motion saliency clues may vary a lot due to the unpredictable nature of movements, not to mention other additional challenges induced by camera view angle changes. So, we propose a brand new concept—“motion quality”, to predict which video frames will have large probability to be successfully detected by the target SOTA approach.

Fig. 2: Motion quality demonstrations, where the high-quality motions can usually separate salient objects from their non-salient surroundings nearby, while the low-quality motions cannot achieve this.

For those clear motions (e.g., rigid movements) which can positively facilitate the VSOD task by separating salient objects from their non-salient surroundings nearby, we name it as the “high-quality motions”, and we call other cases as the “low-quality motions” accordingly, see Fig. 2.

In most cases, we believe that those video frames containing “high-quality motions” should be selected into our new training set. To predict motion quality in advance, we advocate a semi-supervised scheme to train our motion quality perception module (MQPM) within a frame-wise manner, see the Fig. 3-C and it will be detailed in Sec. III-B. As one of the key components in our learning framework, the MQPM takes motion patterns (sensed by optical flow) as input, and then it makes a binary decision regarding whether or not the given frame contains high-quality motions. Meanwhile, in the case of a video frame has some high-quality motions, the MQPM will also provide the corresponding spatial locations of these high-quality motions, and these spatial locations will be used to facilitate the data filtering scheme (Sec. III-C2), another key component in our learning framework, to double-check if these motions really belong to the high-quality cases.

In summary, the main contributions of our method can be summarized as following four aspects:

  • A semi-supervised learning scheme to conduct Motion Quality Perception (to the best of our knowledge, this is the first attempt to improve the VSOD performance from the motion quality perspective);

  • A universal scheme to improve the performance of any other SOTA methods (at least 3% performance improvement in general);

  • Extensive quantitative validations and comparisons (almost all SOTA methods in recent 3 years over 5 largest datasets);

  • Method source code and results are all publicly available at https://github.com/qduOliver/MQP, which will have large potential to benefit the VSOD community in the future.

Ii Related Work

Ii-a Image Salient Object Detection

The main target of image saliency ([21, 20, 4]) is to fast locate the most eye-catching objects in a given image. In general, there are two typical methods for the image salient object detection (ISOD) task, including the full convolutional networks (FCNs) based methods and the multi-task learning (MTL) based methods, and we will briefly introduce several most representative methods regarding these two types.

Ii-A1 The FCNs based methods

The key rationale of the FCNs based methods [22, 45, 49] is to utilize the multi-scale/multi-level contrast computation to sense saliency clues. In fact, different network layers usually show different saliency perception abilities, i.e., those deeper layers tend to preserve localization information solely, yet those shallower layers are mainly abundant in tiny details. Thus, Hou et al[22] proposed to use short connections between different layers to achieve the multi-scale ISOD, in which the coarse localization information was introduced into the shallower layers, achieving a much improved performance. Similarly, Wang et al[45] adopted a top-down and bottom-up inference network, implementing step-by-step optimization via a cooperative and iterative feed-forward and feed-back strategy. Although these two most representative methods have achieved significant performance improvements, their network structures are generally too heavy. In contrast, Wu et al[49]

proposed a lightweight framework, which discarded those high-resolution deep features to speed up detection, of which the motivation is that those deep features in shallower layers usually contribute less to the overall performance yet at high computational costs.

Ii-A2 The MTL based methods

The key rationale of the MTL based methods is to resort additional auxiliary information to boost the overall performance of the conventional single stream methods, in which such information frequently includes depths [56], image captions [54] and edge clues [31, 38, 55]. Zhu et al[56] proposed to learn a switch map to adaptively fuse the RGB saliency clues with the depth saliency clues to formulate final ISOD result. Zhang et al[54] leveraged the image captions to facilitate their newly proposed weakly supervised ISOD learning scheme, in which the key idea is to utilize the feature similarities between different caption categories to shrink the given problem domain. Qin et al[38]

proposed a novel edge related loss function to further refine the tiny details in the final ISOD maps. Similarly, Zhao

et al[55] combined the edge loss function with multi-level features to further improve the ISOD performance, in which the edge related saliency clues are treated as an explicit indicator to coarsely locate the salient objects.

Ii-B Video Salient Object Detection

Ii-B1 Conventional hand-crafted methods

Different to the above mentioned ISOD methods, the video salient object detection (VSOD) is more challenge due to the newly available temporal information. Previous hand-crafted approaches [46, 18, 6, 19, 10] have widely adopted the low-level saliency clues, which were revealed individually from either spatial branch or temporal branch, to formulate their VSOD. To fuse spatial and temporal saliency clues, Wang et al[46] resorted both the spatial edges and the temporal boundaries to facilitate the salient object localization. Guo et al[18] designed a primitive approach to identify the salient object by ranking and selecting the salient proposals. Chen et al[6] devised a bi-level learning strategy to model long-term spatial-temporal saliency consistency. Guo et al[19]

proposed a fast VSOD method by using the principal motion vectors to represent the corresponding motion patterns, and such motion message coupling with the color clues together will be fed into the multi-clue optimization framework to achieve the spatiotemporal VSOD.

Ii-B2 Deep-Learning based methods

The development of convolutional neural networks (CNNs) has fulfilled the needs for performance improvement in the VSOD field. To date, since the spatial saliency can be measured via the off-the-shelf ISOD deep models, considerable researches have been paid to the measurement of temporal saliency within the deep learning framework, in which the current mainstream works can be categorized into two groups according to their network structures 

[13], i.e., the single-stream network based methods and the bi-stream network based methods.

We will introduce the single-stream network based methods firstly. Le et al[24] designed an end-to-end 3D network to directly learn spatiotemporal information. This 3D framework has added a refinement component at the end of its encoder-decoder backbone network, and its key rationale is to resort the semantical information of the deeper layers to refine its spatiotemporal saliency maps. Li et al[26] developed a novel FCNs based network to conduct VSOD within a stage-wise manner which mainly consists of two main stages; i.e., the spatial saliency maps (using RGB information solely) will be computed in advance, and then those spatial saliency maps within consecutive video frames will be simultaneously fused as the spatiotemporal saliency maps. To enlarge the temporal sensing scope, Wang et al. [48] adopted the optical flow based correspondences to warp long-term information into the current video frame. Similarly, Song et al[41] presented a novel scheme to sense the multi-scale spatiotemporal information, in which the key idea is to resort the bi-LSTM network to extract long-term temporal features. Meanwhile, this work has adopted the pyramid dilated convolutions to extract multi-scale spatial saliency features, which will latterly be fed into the above mentioned bi-LSTM network to achieve the long-term and multi-scale VSOD. Fan et al[15] developed an attention-shift baseline and also released a large-scale saliency-shift-aware dataset for the VSOD problem.

Fig. 3: The overall method pipeline. Our novel learning scheme can be applied to the conventional learning scheme (see subfigure-A), and it mainly consists of three steps which have been marked by different colors (red, green and blue) in the subfigure-B; The motion quality perception module is the most important component, and we have demonstrated its details in subfigure-C, where the marks from 1 to 7 respectively show the detailed dataflow.

Different from the single-stream networks with limited motion sensing ability [30, 34], the bi-stream networks [11, 8] are usually capable of sensing the motion clues explicitly, in which both the RGB frames and the optical flow maps are treated as the input of their two subbranches, individually. Then, both the spatial saliency clues and the temporal saliency clues will be computed respectively and latter be fused as the final VSOD results. Tokmakov et al[43] proposed to feed the concatenated spatial and temporal deep features into the ConvLSTM network, aiming to strike an optimal balance between its temporal branch and spatial branch. Li et al[27] exploited the motion message as attention to boost the overall performance of its spatial branch. Most recently, Gu et al[17] learned the non-local motion dependencies across several frames, and then it followed the pyramid structure to capture the spatiotemporal saliency clues at various scales.

Iii Proposed Approach

Iii-a Method Overview

Given a pre-trained SOTA method (i.e., the target SOTA method), our key idea is to use a subgroup of testing frames with high-quality VSODs to train a novel appearance model, and this novel model will significantly outperform the target SOTA method eventually. To achieve it, our method mainly consists of three steps, and the detailed method overview can be found in Fig. 3.
1) Firstly, we weakly train a novel deep model, i.e., the Motion Quality Perception Module (MQPM, blue box).
2) Next, we use the MQPM to select a subgroup video frames (with high-quality motions) in testing set to formulate a new training set (red box).
3) Finally, this new training set will be used to train a novel appearance model with much improved VSOD performance (green box).

Iii-B Motion Quality Perception Module

We demonstrate the detailed MQPM pipeline in Fig. 3-C, the ultimate goal of our approach is to provide a frame-wise binary prediction regarding whether or not the given frame contains high-quality motions. If yes, it will also provide the spatial locations of the high-quality motions.

To achieve our goal, we should initially divide the training instances (i.e., frames) of the original VSOD training set (i.e., Davis-TR [37]) into two groups, i.e., one includes frames with high-quality motions, and another one includes frames with low-quality motions only. Thus, the MQPM can be easily trained by using this partition.

Now the problem is how can we automatically achieve such motion-quality-aware partition in advance.

Fig. 4: The detailed network architecture of our motion quality perception module (MQPM). For simplicity, we have omitted all up-sampling/down-sampling operations.

Iii-B1 Motion Quality Measurement

As is shown in the 2nd row of Fig. 1, we have demonstrated the corresponding optical flow results (encrypted using RGB color) of some consecutive frames in a given video sequence(i.e., the “worm” sequence from the widely-used Davis set). Notice that these optical flow results are computed by using the off-the-shelf optical flow tool [42]

to sense motions between two consecutive frames, in which the RGB colors at different pixels denote the estimated motion intensities and directions. It can be easily observed in Fig. 

1 that the video frames with high-quality motions (e.g., the frame #18) usually share some distinct attributes in common, i.e., the optical flow values inside the salient object (i.e., the worm) will be totally different to the non-salient surroundings nearby. Based on this, we propose a simple yet effective way to measure the motion quality score (MQS) as Eq. 1 with a quite straight froward rationale; i.e., the salient objects in those frames with high-quality motions will have large probability to be successfully detected by the off-the-shelf image salient object detection method, and these frames should be assigned with large MQSs.

(1)

where denotes the optical flow result of the -th frame, and denotes the human well-annotated pixel-wise VSOD saliency ground truth; denotes a pre-trained image salient object detection deep model, which we choose the off-the-shelf CPD [49] due to its lightweight implementation; denotes the consistency measurement between the SOD made by and the . In fact, there are various consistency measurements which are widely used to conduct quantitative evaluations, such as MAE [36], F-Measure [1] and S-Measure [14]. For simplicity, we choose the S-Measure as the consistency measurement in Eq. 1. Notice that we have also tested other measurements, but the overall performance won’t change much, i.e., floating of two decimal places mostly.

Iii-B2 Training Set for MQPM

To train our MQPM, we need to weakly assign binary labels for each frame in the Davis training set regarding whether it contains high-quality motions. Therefore, we use the motion quality scores (MQS, Eq. 1) as the key indicator to produce such labels () as following:

(2)

where XH means high-quality optical flow frames, and XL denotes low-quality movements. When motion quality scores (MQS) is less than the threshold value , the label is assigned to 1. Otherwise, the label is assigned to 0, where is a pre-defined decision threshold. To ensure an optimal balance between positive-1 and negative-0 training instances, we iteratively update until the convergence via Eq. 3 and Eq. 4.

(3)
(4)

where

is the probability distribution of MQS in the entire VSOD training set.

Thus far, we can formulate the training set as , , , where denotes the -th video frame, is the original binary VSOD ground truth. Next, we will introduce how to train the MQPM by using this training set.

Iii-B3 MQPM Training

We formulate our MQPM training as a multi-task procedure following the vanilla bi-stream structure, in which one stream aims the binary motion quality prediction (i.e., classification) and another stream conducts the pixel-wise motion saliency detection (i.e., localization).

As is shown in Fig. 4, the MQPM takes the RGB encrypted optical flow data as input, and its output comprises two parts: 1) motion saliency map; 2) motion quality prediction. The main network structure of MQPM comprises three components: one feature encoder (VGG-16 [40]) and two sub-branches with different loss functions.

The motion saliency branch takes the last three encoder layers as input. Next, each of these input will be fed into the widely-used multi-scale dilated attention module (with dilation factors ranging between {2,4,6,8}) to filter those irrelevant features. Thus, the motion saliency map can be computed by applying the U-Net [39] decoder iteratively, in which the binary cross entropy loss () is used. Meanwhile, the classification branch only takes the last decoder layer as input. Thus, the total loss function can be represented as Eq. 5.

(5)

where the binary cross entropy loss () can be detailed as Eq. 6, and the is a typical binary classification loss as is shown in Eq. 7.

(6)

where denotes the predicted motion saliency value at the -th pixel in the -th frame; represents ground truth value at the -th pixel in the -th frame; “” is a conventional multiplication operation; is a typical logarithmic mathematical operation.

(7)

where

is a logistic regression cost loss function;

denote the confidence regarding the category predictions (i.e., high-quality/low-quality motions); is the previously determined motion quality label (Eq. 2).

Iii-C New Training Set For VSOD

Iii-C1 Initialization

Thus far, the motion quality perception module (MQPM) has been trained, providing two vital information which can be used to improve the target SOTA method: 1) the binary motion quality prediction; 2) the motion saliency map.

Fig. 5: Qualitative comparisons with the current SOTA methods. Due to the limited space, we only list six most representative ones here, including PCSA20 [17], SSAV19 [15], MGA19 [27], COS19 [33], LSTI20 [9] and CPD19 [49].

As we have mentioned before, the former one can be used as an explicit indicator to tell which frames in the VSOD testing set should be selected, while the latter will be used as a double-check to ensure the selected frames are really with high-quality motions which are capable of benefiting the VSOD training in practice. Here we will use both of these two to facilitate the construction of a new training set, which only comprises video frames containing high-quality motions. And this new training set will be used to start a new round of network training and improve the target SOTA method performance eventually.

For each frame in the VSOD testing set, we first compute its optical flow results frame-by-frame, and then feed these optical flow results into the well-trained MQPM, and thus those frames (i.e., the original frames rather than their optical flow results) which are predicted to have contained “High-quality Motions” will be directly pooled as the initial version of the new training set.

Next, for each training instance () in this new training set, it mainly consists of two components, including the original frame X and the corresponding VSOD result predicted by the target SOTA method (trained using both spatial and temporal information) as its training objective (i.e., Y, see the pictorial demonstration in the red box of Fig. 3).

Also, it is worthy mentioning that we can not directly use the motion saliency maps (i.e., the output of the localization branch in Fig. 4) as the training objectives. The main reason is that the motion saliency maps are usually with blur object boundaries (due to absent of spatial information), and thus the performance improvement may be severally limited if we directly apply the motion saliency maps as the pseudo-GT during this new round training, and the corresponding quantitative evidences can be found in Table. III.

Dataset Davis Segv2 Visal DAVSOD VOS
Metric maxF S-M MAE maxF S-M MAE maxF S-M MAE maxF S-M MAE maxF S-M MAE
Basline 0.861 0.893 0.028 0.801 0.851 0.023 0.939 0.943 0.020 0.603 0.724 0.092 0.742 0.819 0.073
T=1 0.892 0.910 0.019 0.826 0.873 0.020 0.934 0.939 0.018 0.686 0.764 0.076 0.755 0.820 0.067
T=1/2 0.890 0.908 0.018 0.824 0.866 0.019 0.934 0.935 0.018 0.686 0.760 0.074 0.760 0.819 0.066
T=1/3 0.889 0.906 0.020 0.833 0.874 0.019 0.938 0.942 0.016 0.696 0.769 0.072 0.758 0.825 0.064
T=1/4 0.893 0.908 0.018 0.832 0.870 0.018 0.934 0.933 0.016 0.693 0.768 0.074 0.756 0.822 0.063
T=1/5 0.894 0.906 0.020 0.836 0.880 0.019 0.935 0.940 0.017 0.699 0.774 0.071 0.767 0.831 0.066
T=1/10 0.888 0.906 0.019 0.836 0.876 0.018 0.940 0.939 0.016 0.698 0.769 0.071 0.738 0.812 0.070
TABLE I: Ablation study regrading our data filtering strategy, where baseline denotes the target SOTA method (i.e., SSAV), see more details in Sec. IV-E.
Quality Frames with High-quality Motions (HQ) Frames with Low-quality Motions (LQ)
Metric maxF meanF adpF S-M MAE maxF meanF adpF S-M MAE
Davis [37] 0.884 0.840 0.800 0.906 0.022 0.828 0.782 0.719 0.875 0.034
Segv2 [25] 0.864 0.808 0.834 0.881 0.024 0.780 0.726 0.709 0.852 0.024
DAVSOD [15] 0.653 0.621 0.626 0.753 0.080 0.642 0.611 0.611 0.738 0.086
Visal [47] 0.883 0.850 0.832 0.910 0.025 0.938 0.895 0.841 0.945 0.014
VOS [28] 0.767 0.739 0.749 0.815 0.073 0.734 0.697 0.700 0.816 0.074
Total 0.810 0.772 0.768 0.853 0.045 0.784 0.742 0.716 0.845 0.046
TABLE II: Proofs regarding the effectiveness of our motion quality perception module (MQPM). The quantitative metrics include the maxF (larger is better), meanF (larger is better), adpF (larger is better), S-Measure (larger is better) and MAE (smaller is better). By using the MQPM as the indicator, those frames which are predicted to have high-quality motions can outperform other frames significantly, in which we choose the SSAV [15] as the target SOTA method for example here.

Iii-C2 Data Filtering

As we have mentioned before, our rationale is based on the assumption that the SOTA methods tend to exhibit high-quality VSOD over those frames with high-quality motions (see the quantitative proofs in Table. II). In fact, this assumption holds in most cases. However, there still exists exceptions occasionally.

As is shown in Fig. 1, our MQPM has predicted that the #20 frame has large probability of containing some high-quality motions, and the optical flow result of the #20 frame (in the 2nd row) is indeed capable of separating the salient object from its non-salient surroundings nearby, producing high-quality motion saliency map as well (in the 3rd row). However, the VSOD predicted by the target SOTA method (i.e., it can be any SOTA method, here, we simply choose the SSAV [15] for example) failed to completely detect the salient object, and it may degrade the overall performance if the new training set contains a large number of such cases.

Meanwhile, we have noticed that there exists a large number of consecutive frames in the testing VSOD set (almost 30%) which are tend to be predicted as the ones containing high-quality motions. Since these consecutive frames usually share similar spatial appearance in general, it will easily lead to an over-fitted appearance model if we use all these frames during the up-coming training.

Dataset Davis Segv2 Visal DAVSOD VOS
Metric maxF S-M MAE maxF S-M MAE maxF S-M MAE maxF S-M MAE maxF S-M MAE
MS Baseline 0.798 0.854 0.044 0.648 0.760 0.054 0.627 0.738 0.079 0.450 0.613 0.148 0.405 0.566 0.167
MSMQPM 0.784 0.844 0.043 0.656 0.761 0.053 0.688 0.774 0.075 0.488 0.632 0.143 0.501 0.617 0.161
MSMQPM 0.814 0.866 0.032 0.760 0.832 0.028 0.745 0.809 0.051 0.569 0.685 0.107 0.627 0.702 0.108
MSMQPMSOTA 0.894 0.906 0.020 0.836 0.880 0.019 0.935 0.940 0.017 0.699 0.774 0.071 0.767 0.831 0.066
TABLE III: Component quantitative evaluation results. The quantitative metrics include the maxF (larger is better), S-Measure (larger is better) and MAE (smaller is better), more details can be found in Sec. IV-D.
Dataset Metric Ours 2020 2019 2018 2017
PCSA LSTI SSAV MGA COS CPD PDBM MBNM SCOM SFLR SGSP STBP
[17] [9] [15] [27] [33] [49] [41] [29] [50] [7] [32] [50]
Davis [37] maxF 0.894 0.880 0.850 0.861 0.892 0.875 0.778 0.855 0.861 0.783 0.727 0.655 0.544
S-M 0.906 0.902 0.876 0.893 0.910 0.902 0.859 0.882 0.887 0.832 0.790 0.692 0.677
MAE 0.020 0.022 0.034 0.023 0.023 0.020 0.032 0.028 0.031 0.064 0.056 0.138 0.096
SegV2 [25] maxF 0.836 0.810 0.858 0.801 0.821 0.801 0.778 0.800 0.716 0.764 0.745 0.673 0.640
S-M 0.880 0.865 0.870 0.851 0.865 0.850 0.841 0.864 0.809 0.815 0.804 0.681 0.735
MAE 0.019 0.025 0.025 0.023 0.030 0.020 0.023 0.024 0.026 0.030 0.037 0.124 0.061
Visal [47] maxF 0.935 0.940 0.905 0.939 0.933 0.966 0.941 0.888 0.883 0.831 0.779 0.677 0.622
S-M 0.940 0.946 0.916 0.943 0.936 0.965 0.942 0.907 0.898 0.762 0.814 0.706 0.629
MAE 0.017 0.017 0.033 0.020 0.017 0.011 0.016 0.032 0.020 0.122 0.062 0.165 0.163
DAVSOD [15] maxF 0.699 0.655 0.585 0.603 0.640 0.614 0.608 0.572 0.520 0.464 0.478 0.426 0.410
S-M 0.774 0.741 0.695 0.724 0.738 0.725 0.724 0.698 0.637 0.599 0.624 0.577 0.568
MAE 0.071 0.086 0.106 0.092 0.084 0.096 0.092 0.116 0.159 0.220 0.132 0.207 0.160
VOS [28] maxF 0.767 0.747 0.649 0.742 0.735 0.724 0.735 0.742 0.670 0.690 0.546 0.426 0.526
S-M 0.831 0.827 0.695 0.819 0.792 0.798 0.818 0.818 0.742 0.712 0.624 0.557 0.576
MAE 0.066 0.065 0.115 0.073 0.075 0.065 0.068 0.078 0.099 0.162 0.145 0.236 0.163
TABLE IV: Quantitative comparisons with current SOTA methods. The top three results are marked by red, green and blue, respectively.

So, due to the above mentioned issues, we propose a novel filtering scheme, aiming to exclude the less-trustworthy or redundant training instances, see below.
1) For each frame in the new training set, we measure the consistency degree (we choose the S-Measure, but not limited to it) between its motion saliency map and the VSOD result produced by the target SOTA method.
2) For each T frames in the new training set, only one frame with the largest consistency degree—this consistency degree is usually positively correlated to the trustworthy degree regarding the VSOD predictions made by the target SOTA method, will be remained (see the detailed ablation study on T in Table I).

Iii-C3 New Round Of Network Training

Once the new training set has been constructed, we will conduct a new round of network training on it. However, we can not directly retrain the target SOTA model using this new training set, because it only consists of individual video frames without any temporal information; i.e., our new training set only preserves spatial information, while the SOTA models need to be fed by both spatial and temporal information. So, we choose to set up a completely new model with an identical network structure to the localization branch demonstrated in Fig. 4, and this new model will be trained over this new training set by using the common thread supervised training protocol (Eq. 6), and its output will be our final VSOD results with much improved performance compared with the target SOTA method.

Specifically, though this new round of training requires additional time cost, the performance gain can still benefit scenarios without speed requirements.

Iv Experiments

Iv-a Datasets

We have evaluated our method on five widely used public available datasets, including Davis [37], Segtrack-v2 [25], Visal [47], DAVSOD [15], and VOS [28].

  • Davis dataset contains 50 video sequences with 3455 frames in total, and most of its sequences only contain moderate motions.

  • Segtrack-v2 dataset contains 13 video sequences (exclude the penguin sequence) with 1024 frames in total, containing complex backgrounds and variable motion patterns, which is more challenging than the Davis dataset generally.

  • Visal dataset contains 17 video sequences with 963 frames in total, and this dataset is a relatively simple one than others.

  • DAVSOD dataset contains 226 video sequences with 23938 frames in total, which is the most challenging dataset in the field, involving various object instances, different motion patterns, and saliency shifting between different objects.

  • VOS dataset contains 40 video sequences with 24177 frames in total, yet only 1540 frames were annotated well, in which the sequences are all obtained in indoor scenes.

Iv-B Implementation Details

We have implemented our method on a PC with an Intel(R) Xeon(R) CPU, Nvidia GTX2080Ti GPU (with 11G RAM) and 64G RAM. We use the DAVIS-TR [37] as the initial training set to train our motion quality perception model (MQPM). Also, an ADAM optimizer [23] is applied to update the network parameters. We set the batch size to 8 which takes almost all GPU memory. The initial learning rate is set to 10e-3. To avoid over-fitting problem, we have adopted the random horizontal flips for data augmentation.

Fig. 6: Qualitative comparisons between several most representative target SOTA methods and the corresponding VSOD results after using our novel learning scheme.
Dataset Metric SSAV[15] SSAV* MGA[27] MGA* COS[33] COS* LSTI[9] LSTI* PCSA[17] PCSA*
Davis [37] maxF 0.861 0.894 0.892 0.900 0.875 0.892 0.850 0.863 0.880 0.894
S-M 0.893 0.906 0.910 0.914 0.902 0.909 0.876 0.889 0.902 0.909
MAE 0.023 0.020 0.023 0.018 0.020 0.017 0.034 0.024 0.022 0.019
SegV2[25] maxF 0.801 0.836 0.821 0.835 0.801 0.815 0.858 0.862 0.810 0.835
S-M 0.851 0.880 0.865 0.882 0.850 0.866 0.870 0.891 0.865 0.880
MAE 0.023 0.019 0.030 0.028 0.020 0.018 0.025 0.016 0.025 0.020
Visal[47] maxF 0.939 0.935 0.933 0.933 0.966 0.956 0.905 0.916 0.940 0.942
S-M 0.943 0.940 0.936 0.931 0.965 0.955 0.916 0.928 0.946 0.946
MAE 0.020 0.017 0.017 0.015 0.011 0.010 0.033 0.022 0.017 0.014
DAVSOD[15] maxF 0.603 0.699 0.640 0.672 0.614 0.643 0.585 0.627 0.655 0.680
S-M 0.724 0.774 0.738 0.755 0.725 0.736 0.695 0.718 0.741 0.751
MAE 0.092 0.071 0.084 0.075 0.096 0.086 0.106 0.093 0.086 0.077
VOS[28] maxF 0.742 0.767 0.735 0.755 0.724 0.758 0.649 0.690 0.747 0.758
S-M 0.819 0.831 0.792 0.811 0.798 0.810 0.695 0.722 0.827 0.824
MAE 0.073 0.066 0.075 0.066 0.065 0.063 0.115 0.101 0.065 0.057
TABLE V: Quantitative comparisons of several most representative SOTA methods (SSAV19, MGA19, COS19, LSTI20, and PCSA20) vs. their improved results by using our novel learning scheme.

Iv-C Evaluation Metrics

In order to accurately measure the consistency between the predicted VSOD and the manually annotated ground truth, we adopt three common used evaluation metrics, including the maximum F-measure value (maxF) 

[1], the mean absolute error (MAE) [36], and the structure measure value (S-measure) [14].

Iv-D Component Evaluation

We have conducted an extensive component evaluation to verify the effectiveness of our proposed motion quality perception module (MQPM), and the quantitative results can be found in Table III. Meanwhile, the corresponding qualitative demonstrations regarding this component evaluation can be found in Fig. 7.

Fig. 7: The corresponding qualitative demonstrations regarding the component evaluations in Table III, in which the “” has achieved the best performance.

As is shown in Table III, the performance of the learned motion saliency, which is denoted by “MS” and it can be obtained via as mentioned in Eq. 1, have exhibited the worst performance in all the adopted metrics. Then, by using our MQPM (Sec. III-B) to formulate a new training set (MS will be applied as the pseudo-GTs), the overall performance can be improved significantly (denoted by “MSMQPM”), e.g., the maxF metric value in the VOS dataset has been increased from 40.5% to 62.7%. Notice that we can not achieve such performance improvements via the randomly assembled key frames from the training set, and we denote such implementation as “MSMQPM”, of which the overall performance is quite similar to the original MS baseline. For example, in the breakdance video sequence of the Davis testing set, the MQPM has selected 16 high-quality key frames. For fair comparison, the “MSMQPM” randomly select 16 frames as the key frames.

Since the object boundaries are usually blur in the MS baseline, the overall performance of the above re-trained model (i.e., “MSMQPM”) is limited. Thus, we further resort our data filtering strategy (Sec. III-C2) to introduce the target SOTA results as the high-quality pseudo-GTs, of which the corresponding results are shown in the last row of Table III with the highest scores in all metrics, showing the effectiveness of our data filtering strategy.

Also, it should be noted that we have simply chosen the SSAV [15] as the target SOTA method here, because the off-the-shelf SSAV model was pre-trained using the identical training set as our method, which can avoid the data leakage problem.

Iv-E Ablation Study

As we have mentioned in Sec. III-C2, there are almost 30% video frames in the original testing set which will be predicted to contain high-quality motions (we abbreviate it as the high-quality frames). Due to the reasons we have mentioned in Sec. III-C2, we believe that it is time-consuming and not necessary to use all these high-quality frames to start a new round of training. Thus, the main purpose of our data filtering strategy is to automatically keep a small subgroup of high-quality frames as the final training set.

Thus far, we have conducted an extensive ablation study regarding the parameter T, and the detailed results can be found in Table I. We choose respectively, in which means to use all those high-quality video frames as the new training set, and denotes only one frame with the largest consistency degree will be remained for each 5 consecutive high-quality frames. As is shown in Table I, the overall performance of our method is moderately sensitive to the choice of T, in which the overall performance via have exhibited the best performance in general, and a clear performance degradation can be found when we assign . So, we set as the optimal choice to strike the trade-off between performance and efficiency.

Methods Ours PCSA20 [17] LSTI20 [9] SSAV19 [15] MGA19 [27] COS19 [33] PDBM18 [41] SCOM18 [12] SFLR17 [7] SGSP17 [32]
FPS 33 110 0.7 20.0 14.0 0.4 20.0 0.03 0.3 0.1
Platform GTX2080Ti GTXTitanXp GTX1080Ti GTXTianX GTX2080Ti GTX2080Ti GTXTitanX GTXTitanX GTX970 CPU
TABLE VI:

Runtime comparisons, where we have excluded the training time (i.e., the FPS provided here is only the inference speed), because the training procedure may only need to be conducted only once for many video saliency based subsequent applications. Also, our method takes about 80s to construct the new training set, and another 600s to conduct the fine-tuning in 5 epoches (this will vary with the training set size); for a single testing frame, it takes about 0.03s to inference SOD result.

Iv-F Comparisons to the SOTA methods

We have compared our method with 12 most representative SOTA methods, including PCSA20 [17],LSTI20 [9], SSAV19 [15], MGA19 [27], COS19 [33], CPD19 [49], PDBM18 [41], MBNM18 [29], SFLR17 [7], SGSP17 [32], STBP17 [50] and SCOM18 [12].

As is shown in Table IV, all quantitative results have indicated that our method (we take the SSAV as the target SOTA model here) have significantly outperformed these compared SOTA methods for all tested datasets excepting the Visal dataset, showing the performance superiority of our method. In fact, the Visal dataset may be a bit different to other datasets, i.e., the Visal dataset is dominated by color information, in which the motion clues are usually at the second place to determine the true saliency. As a result, the COS19, which is heavily rely on the spatial domain, has exhibited the best performance in the Visal dataset. Also, we have provided the qualitative comparisons in Fig. 5, where our VSOD results are more consistent to the GT than those compared SOTA methods.

Moreover, our method can be applied to any other SOTA VSOD methods to get its performance further improved. To show such advantage, we have provided the direct comparisons between several most representative SOTA methods and their improved versions after using our learning scheme. As is shown in Table V, our method can make averagely 5% performance improvement generally and almost 9.6% regarding the best case (maxF), and the corresponding qualitative comparisons can be found in Fig. 6.

Also, we have conducted the running time comparisons to the SOTA methods in Table VI, in which our method has achieved the real-time speed with 33 FPS during the inference phase. Although our total time is a bit time-consuming, there are still advantages compared to other methods.

V Conclusion

In this paper, we have proposed a universal scheme to boost the SOTA methods within a semi-supervised manner. The key components in our method include: 1) The motion quality perception module, which was used to select a subgroup of high-quality frames from the original testing set to construct a new training set; 2) Data filtering scheme, which was used as a double-check to ensure the overall quality of the newly constructed training set. We have conducted an extensive quantitative evaluation to respectively show the effectiveness regarding these two components.

References

  • [1] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk (2009) Frequency-tuned salient region detection. In

    Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)

    ,
    pp. 1597–1604. Cited by: §III-B1, §IV-C.
  • [2] K. Belloulata, B. Amina, and S. Zhu (2014) Object-based stereo video compression using fractals and shape-adaptive dct. AEU-Int. J. Electron. Commun. 68, pp. 687–697. External Links: Document Cited by: §I.
  • [3] C. Chen, S. li, H. Qin, and A. Hao (2015) Real-time and robust object tracking in video via low-rank coherency analysis in feature space. Pattern Recognit. (PR) 48, pp. 2885–2905. External Links: Document Cited by: §I.
  • [4] C. Chen, S. Li, H. Qin, and A. Hao (2015) Structure-sensitive saliency detection via multilevel rank analysis in intrinsic feature space. IEEE Trans. Image Process. (TIP) 24 (8), pp. 2303–2316. Cited by: §II-A.
  • [5] C. Chen, S. Li, H. Qin, and A. Hao (2016) Robust salient motion detection in non-stationary videos via novel integrated strategies of spatio-temporal coherency clues and low-rank analysis. Pattern Recognit. (PR) 52, pp. 410–432. Cited by: §I.
  • [6] C. Chen, S. Li, H. Qin, Z. Pan, and G. Yang (2018) Bilevel feature learning for video saliency detection. IEEE Trans. Multimedia. (TMM) 20 (12), pp. 3324–3336. Cited by: §II-B1.
  • [7] C. Chen, S. Li, Y. Wang, H. Qin, and A. Hao (2017) Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion. IEEE Trans. Image Process. (TIP) 26 (7), pp. 3156–3170. Cited by: TABLE IV, §IV-F, TABLE VI.
  • [8] C. Chen, Y. Li, S. Li, H. Qin, and A. Hao (2017) A novel bottom-up saliency detection method for video with dynamic background. IEEE Signal Processing Letters. (SPL) 25 (2), pp. 154–158. Cited by: §II-B2.
  • [9] C. Chen, G. Wang, C. Peng, X. Zhang, and H. Qin (2019) Improved robust video saliency detection based on long-term spatial-temporal information. IEEE Trans. Image Process. (TIP) 29, pp. 1090–1100. Cited by: Fig. 5, TABLE IV, §IV-F, TABLE V, TABLE VI.
  • [10] C. Chen, G. Wang, and C. Peng (2019) Structure-aware adaptive diffusion for video saliency detection. IEEE Access. 7, pp. 79770–79782. Cited by: §II-B1.
  • [11] C. Chen, J. Wei, C. Peng, W. Zhang, and H. Qin (2020) Improved saliency detection in rgb-d images using two-phase depth estimation and selective deep fusion. IEEE Trans. Image Process. (TIP) 29, pp. 4296–4307. Cited by: §II-B2.
  • [12] Y. Chen, W. Zou, Y. Tang, X. Li, C. Xu, and N. Komodakis (2018) SCOM: spatiotemporal constrained optimization for salient object detection. IEEE Trans. Image Process. (TIP) 27 (7), pp. 3345–3357. Cited by: §IV-F, TABLE VI.
  • [13] R. Cong, J. Lei, H. Fu, M. Cheng, W. Lin, and Q. Huang (2018) Review of visual saliency detection with comprehensive information. IEEE Trans. Circuits Syst. Video Technol. (TCSVT) 29 (10), pp. 2941–2959. Cited by: §II-B2.
  • [14] D. Fan, M. Cheng, Y. Liu, T. Li, and A. Borji (2017) Structure-measure: a new way to evaluate foreground maps. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 4548–4557. Cited by: §III-B1, §IV-C.
  • [15] D. Fan, Wang,Wenguan, M. Cheng, and J. Shen (2019) Shifting more attention to video salient object detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 8554–8564. Cited by: Fig. 1, §I, §II-B2, Fig. 5, §III-C2, TABLE II, TABLE IV, §IV-A, §IV-D, §IV-F, TABLE V, TABLE VI.
  • [16] Q. Fan, W. Luo, Y. Xia, G. Li, and D. He (2019) Metrics and methods of video quality assessment: a brief review. Multimed. Tools Appl. (MTA) 78 (22), pp. 31019–31033. Cited by: §I.
  • [17] Y. Gu, L. Wang, Z. Wang, Y. Liu, M. Cheng, and S. Lu (2020) Pyramid constrained self-attention network for fast video salient object detection. In Proc. AAAI Conf. Artif. Intell. (AAAI), Cited by: §I, §II-B2, Fig. 5, TABLE IV, §IV-F, TABLE V, TABLE VI.
  • [18] F. Guo, W. Wang, J. Shen, L. Shao, J. Yang, D. Tao, and Y. Tang (2017) Video saliency detection using object proposals. IEEE Trans. Cybern. (TCYB) 48 (11), pp. 3159–3170. Cited by: §II-B1.
  • [19] F. Guo, W. Wang, Z. Shen, J. Shena, L. Shao, and D. Tao (2019) Motion-aware rapid video saliency detection. IEEE Trans. Circuits Syst. Video Technol. (TCSVT). Cited by: §II-B1.
  • [20] J. Han, G. Cheng, Z. Li, and D. Zhang (2017) A unified metric learning-based framework for co-saliency detection. IEEE Trans. Circuits Syst. Video Technol. (TCSVT) 28 (10), pp. 2473–2483. Cited by: §II-A.
  • [21] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu (2018) Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Process. Mag. (ISPM) 35 (1), pp. 84–100. Cited by: §II-A.
  • [22] Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. S. Torr (2019) Deeply supervised salient object detection with short connections. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 41 (4), pp. 815–828. Cited by: §II-A1.
  • [23] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-B.
  • [24] T. N. Le and A. Sugimoto (2017) Deeply supervised 3d recurrent fcn for salient object detection in videos. In British Machine Vis. Conf. (BMVC), Vol. 1, pp. 3. Cited by: §II-B2.
  • [25] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg (2013) Video segmentation by tracking many figure-ground segments. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 2192–2199. Cited by: TABLE II, TABLE IV, §IV-A, TABLE V.
  • [26] G. Li, Y. Xie, T. Wei, K. Wang, and L. Lin (2018) Flow guided recurrent neural encoder for video salient object detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3243–3252. Cited by: §II-B2.
  • [27] H. Li, G. Chen, G. Li, and Y. Yu (2019) Motion guided attention for video salient object detection. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 7274–7283. Cited by: §I, §II-B2, Fig. 5, TABLE IV, §IV-F, TABLE V, TABLE VI.
  • [28] J. Li, C. Xia, and X. Chen (2017)

    A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection

    .
    IEEE Trans. Image Process. (TIP) 27 (1), pp. 349–364. Cited by: TABLE II, TABLE IV, §IV-A, TABLE V.
  • [29] S. Li, B. Seybold, A. Vorobyov, X. Lei, and C. Jay Kuo (2018) Unsupervised video object segmentation with motion-based bilateral networks. In Proc. IEEE Eur. Conf. Comput. Vis. (ECCV), pp. 207–223. Cited by: TABLE IV, §IV-F.
  • [30] Y. Li, S. Li, C. Chen, A. Hao, and H. Qin (2019) Accurate and robust video saliency detection via self-paced diffusion. IEEE Trans. Multimedia. (TMM) 22 (5), pp. 1153–1167. Cited by: §II-B2.
  • [31] J. Liu, Q. Hou, M. Cheng, J. Feng, and J. Jiang (2019) A simple pooling-based design for real-time salient object detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3917–3926. Cited by: §II-A2.
  • [32] Z. Liu, J. Li, L. Ye, G. Sun, and L. Shen (2016) Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation. IEEE Trans. Circuits Syst. Video Technol. (TCSVT) 27 (12), pp. 2527–2542. Cited by: TABLE IV, §IV-F, TABLE VI.
  • [33] X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli (2019) See more, know more: unsupervised video object segmentation with co-attention siamese networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3623–3632. Cited by: Fig. 5, TABLE IV, §IV-F, TABLE V, TABLE VI.
  • [34] G. Ma, C. Chen, S. Li, C. Peng, A. Hao, and H. Qin (2019) Salient object detection via multiple instance joint re-learning. IEEE Trans. Multimedia. (TMM) 22 (2), pp. 324–336. Cited by: §II-B2.
  • [35] C. Peng, Y. Chen, Z. Kang, C. Chen, and Q. Cheng (2020)

    Robust principal component analysis: a factorization-based approach with linear complexity

    .
    Inf. Sci. 513, pp. 581–599. Cited by: §I.
  • [36] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung (2012) Saliency filters: contrast based filtering for salient region detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 733–740. Cited by: §III-B1, §IV-C.
  • [37] F. Perazzi, J. PontTuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 724–732. Cited by: §III-B, TABLE II, TABLE IV, §IV-A, §IV-B, TABLE V.
  • [38] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand (2019) Basnet: boundary-aware salient object detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 7479–7489. Cited by: §II-A2.
  • [39] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Proc. Int. Conf. Med. Image Comput. Comput. Assist. Intervent., pp. 234–241. Cited by: §III-B3.
  • [40] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §III-B3.
  • [41] H. Song, W. Wang, S. Zhao, J. Shen, and K. Lam (2018) Pyramid dilated deeper convlstm for video salient object detection. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 715–731. Cited by: §II-B2, TABLE IV, §IV-F, TABLE VI.
  • [42] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) Pwc-net: cnns for optical flow using pyramid, warping, and cost volume. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 8934–8943. Cited by: §III-B1.
  • [43] P. Tokmakov, K. Alahari, and C. Schmid (2017) Learning video object segmentation with visual memory. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 4481–4490. Cited by: §II-B2.
  • [44] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 4489–4497. Cited by: §I.
  • [45] W. Wang, J. Shen, M. Cheng, and L. Shao (2019) An iterative and cooperative top-down and bottom-up inference network for salient object detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5968–5977. Cited by: §II-A1.
  • [46] W. Wang, J. Shen, and F. Porikli (2015) Saliency-aware geodesic video object segmentation. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3395–3402. Cited by: §II-B1.
  • [47] W. Wang, J. Shen, and L. Shao (2015) Consistent video saliency using local gradient flow optimization and global refinement. IEEE Trans. Image Process. (TIP) 24 (11), pp. 4185–4196. Cited by: TABLE II, TABLE IV, §IV-A, TABLE V.
  • [48] W. Wang, J. Shen, and L. Shao (2017) Video salient object detection via fully convolutional networks. IEEE Trans. Image Process. (TIP) 27 (1), pp. 38–49. Cited by: §II-B2.
  • [49] Z. Wu, L. Su, and Q. Huang (2019) Cascaded partial decoder for fast and accurate salient object detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 3907–3916. Cited by: Fig. 1, §II-A1, Fig. 5, §III-B1, TABLE IV, §IV-F.
  • [50] T. Xi, W. Zhao, H. Wang, and W. Lin (2016) Salient object detection with spatiotemporal background priors for video. IEEE Trans. Image Process. (TIP) 26 (7), pp. 3425–3436. Cited by: TABLE IV, §IV-F.
  • [51] S. Xingjian, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015)

    Convolutional lstm network: a machine learning approach for precipitation nowcasting

    .
    In Proc. Adv. Neural Inf. Process. Syst. (NIPS), pp. 802–810. Cited by: §I.
  • [52] C. Yan, B. Gong, Y. Wei, and Y. Gao (2020)

    Deep multi-view enhancement hashing for image retrieval

    .
    IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI). Cited by: §I.
  • [53] C. Yan, B. Shao, H. Zhao, R. Ning, Y. Zhang, and F. Xu (2020) 3d room layout estimation from a single rgb image. IEEE Trans. Multimedia. (TMM). Cited by: §I.
  • [54] L. Zhang, J. Zhang, Z. Lin, H. Lu, and Y. He (2019) CapSal: leveraging captioning to boost semantics for salient object detection. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 6024–6033. Cited by: §II-A2.
  • [55] J. Zhao, J. Liu, D. Fan, Y. Cao, J. Yang, and M. Cheng (2019) EGNet: edge guidance network for salient object detection. In Proc. IEEE Int. Conf. Comput. Vis. (ICCV), pp. 8779–8788. Cited by: §II-A2.
  • [56] C. Zhu, X. Cai, K. Huang, T. H. Li, and G. Li (2019) Pdnet: prior-model guided depth-enhanced network for salient object detection. In Proc. IEEE Int. Conf. Multimedia Expo. (ICME), pp. 199–204. Cited by: §II-A2.