Log In Sign Up

Full-Duplex Strategy for Video Object Segmentation

by   Ge-Peng Ji, et al.

Appearance and motion are two important sources of information in video object segmentation (VOS). Previous methods mainly focus on using simplex solutions, lowering the upper bound of feature collaboration among and across these two cues. In this paper, we study a novel framework, termed the FSNet (Full-duplex Strategy Network), which designs a relational cross-attention module (RCAM) to achieve the bidirectional message propagation across embedding subspaces. Furthermore, the bidirectional purification module (BPM) is introduced to update the inconsistent features between the spatial-temporal embeddings, effectively improving the model robustness. By considering the mutual restraint within the full-duplex strategy, our FSNet performs the cross-modal feature-passing (i.e., transmission and receiving) simultaneously before the fusion and decoding stage, making it robust to various challenging scenarios (e.g., motion blur, occlusion) in VOS. Extensive experiments on five popular benchmarks (i.e., DAVIS_16, FBMS, MCL, SegTrack-V2, and DAVSOD_19) show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.


page 1

page 3

page 7


Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

Referring video object segmentation aims to predict foreground labels fo...

Hierarchical Feature Alignment Network for Unsupervised Video Object Segmentation

Optical flow is an easily conceived and precious cue for advancing unsup...

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

Referring video object segmentation (RVOS) aims to segment video objects...

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Video Question Answering is a task which requires an AI agent to answer ...

PSNet: Parallel Symmetric Network for Video Salient Object Detection

For the video salient object detection (VSOD) task, how to excavate the ...

Semi-Supervised Cross-Modal Salient Object Detection with U-Structure Networks

Salient Object Detection (SOD) is a popular and important topic aimed at...

VS-Net: Multiscale Spatiotemporal Features for Lightweight Video Salient Document Detection

Video Salient Document Detection (VSDD) is an essential task of practica...

1 Introduction

Video object segmentation (VOS) [104, 12, 107, 35]

is a fundamental topic in computer vision for intelligent video analysis, whose purpose is to delineate pixel-level moving object

111We use ‘foreground object’ & ‘target object’ interchangeably. masks in each frame. It has been widely applied to robotic manipulation [1], autonomous cars [61], video editing [37], medicine [39]

, optical flow estimation 

[18], interactive segmentation [9, 63, 32], URVOS [78], and video captioning [68]. There are two settings for addressing this task (i.e., semi-supervised [97] and unsupervised [62] VOS), depending on whether or not the candidate object is given manually in the first frame. In this work, we focus on the unsupervised setting, i.e., zero-shot VOS [130, 129]. For semi-supervised VOS, we refer readers to prior works [125, 76, 123, 46, 56, 79, 5, 119, 117, 8].

Figure 1: Visual comparison between the simplex (i.e., (a) appearance-refined motion and (b) motion-refined appearance) and our full-duplex strategy. In contrast, our FSNet offers a collaborative way to leverage the appearance and motion cues under the mutual restraint of full-duplex strategy, thus providing more accurate structure details and alleviating the short-term feature drifting issue [120].

Recent years have witnessed promising progress of addressing video content understanding by exploiting appearance (e.g., color frame [122]) and motion (e.g., optical flow [36, 86] and pixel trajectory [82]) correlation between frames. However, short-term dependency estimation (i.e., one-step motion cues [36, 86]) produces unreliable results and suffers the common ordeals [33] (e.g., diffusion, noise, and deformation), while the capability of appearance-based modeling (e.g., recurrent neural network (RNN) [88, 62]) is severely hindered by blurred foregrounds or cluttered backgrounds [14]. Those conflicts are prone to accumulating inaccuracies with the propagation of spatial-temporal embeddings, which cause short-term feature drifting [120].


Figure 2: Mean contour accuracy (vs. mean region similarity () scores on DAVIS [74]. Circles indicate UVOS methods. Four variants of our FSNet are shown in bold-italic, in which ‘’ indicates the number of BPM. Compared with the best unsupervised VOS model (MAT [130] with CRF [45] post-processing), the proposed method FSNet (=4, CRF) achieves the new SOTA by a large margin.

Earlier solutions address this issue using direction-independent strategy [41, 38, 16, 88, 111], which would be to encode the appearance and motion features individually and fuse them directly. However, this implicit strategy will cause feature conflicts, since motion and appearance are two distinctive modalities, extracted from separate branches. A reasonable idea is to integrate them in a guided manner, and thus, several recent approaches opt for the simplex strategy [130, 91, 57, 65, 53, 33, 71], which is either appearance-based or motion-guided. Although these two strategies have achieved remarkable advances, they both fail to infer the mutual restraint between the appearance and motion cues that both guide human visual attention allocation during dynamic observation, according to previous studies in cognitive psychology [43, 108, 90] and computer vision [96, 38].

For the same object, we argue that appearance and motion characteristics should be homogeneous to a certain degree. Intuitively, as shown in Fig. 1, the foreground region of appearance (top-left) and motion (bottom-left) maps intrinsically share the correlative patterns about perceptions, including semantic structure, movement posture. However, misguided knowledge in the individual modality, e.g., static spectators at the bullring and dynamic watermark on TV (blue boxes), will produce inaccuracies during the feature propagation, and thus, it easily stains the result (red boxes).

To alleviate the above conflicts, it is important to introduce a new modality transmission scheme, instead of embedding them individually. Inspired by this, we introduce the idea of full-duplex222On the same channel, information can be transmitted and received simultaneously [4]. from the field of wireless communication. As shown in Fig. 4 (c) & Fig. 5 (c), this is a bidirectional-attention scheme across motion and appearance cues, which explicitly incorporates the appearance and motion patterns in a unified framework. As can be seen in the first row of Fig. 1, the proposed Full-duplex Strategy Network (FSNet) visually performs better than the one with simplex strategy. To understand what enables good learning strategies, we comprehensively delve into the simplex and full-duplex strategies of our framework and present the following contributions:

  • [leftmargin=0.15in, label=]

  • We emphasize the importance of the full-duplex strategy for the spatial-temporal representations. Specifically, a bidirectional interaction module, termed the relational cross-attention module (RCAM), is used to extract discriminative features from the appearance and motion branches, which ensures the mutual restraint between each other.

  • To further improve the model robustness, we introduce a bidirectional purification module (BPM), which is equipped with an interlaced decremental connection (IDC) to automatically update inconsistent features between the spatial-temporal embeddings.

  • We demonstrate that our FSNet performs superior performance on five mainstream benchmarks, especially for FSNet (=4, CRF) outperforms the SOTA UVOS model (i.e., MAT [130]) on the DAVIS [74] leaderboard by a margin of 2.4% in terms of score (see Fig. 2), with less training data (i.e., Ours-13K vs. MAT-16K). This suggests that the mutual restraints within full-duplex strategy is promising for the spatial-temporal learning tasks.

2 Related Works

2.1 Unsupervised VOS

Although there are many works [40, 110, 92, 15, 7, 72] addressing the VOS task in a semi-supervised manner, i.e.

, by supposing an object mask annotation is given in the first frame, other researchers have attempted to address the more challenging unsupervised VOS (UVOS) problem. Early UVOS models resort to low-level handcrafted features for heuristic segmentation inference, such as long sparse point trajectories 

[6, 66, 26, 81, 100], object proposals [50, 51, 60, 75], saliency priors [98, 20, 95], optical flow [91], or superpixels [27, 28, 112]. As such, these traditional models have limited generalizability, and thus low accuracy in highly dynamic and complex scenarios, due to their lack of semantic information and high-level content understanding. Recently, RNN-based models [84, 102, 117, 128, 88, 3]

have become popular due to their better capability of capturing long-term dependencies, as well as their use of deep learning. In this case, UVOS is formulated as a recurrent modeling issue over time, where spatial features are jointly exploited with long-term temporal context.

Figure 3: The pipeline of our FSNet. The Relational Cross-Attention Module (RCAM) abstracts more discriminative representations between the motion and appearance cues using the full-duplex strategy. Then four Bidirectional Purification Modules (BPM) are stacked to further re-calibrate inconsistencies between the motion and appearance features. Finally, we utilize the decoder to generate our prediction.

How to combine motion cues with appearance features is a long-standing problem in this field. To this end, Tokmakov et al.  [87] proposed to simply use the motion patterns required from the video. However, their method is unable to accurately segment objects between two similar consecutive frames, since it relies heavily on the guidance of optical flow. To resolve this, several works [88, 16, 83] have integrated the spatial and temporal features from the parallel network, which can be viewed as plain feature fusion from the independent spatial and temporal branch with an implicit modeling strategy. Li et al.  [54] proposed a multi-stage processing method to tackle UVOS, which first utilizes a fixed appearance-based network to generate objectness and then feeds this into the motion-based bilateral estimator to segment the objects.

2.2 Attention-based VOS

The attention-based VOS task is closely related to UVOS, since it aims at extracting attention-aware object(s) from a video clip. Traditional methods [101, 131, 115, 34, 114] first compute the single-frame saliency based on various hand-crafted static and motion features, and then conduct spatial-temporal optimization to preserve coherency across consecutive frames. Recent works [99, 48, 64, 124] aim to learn a highly-semantic representation and usually perform spatial-temporal detection in an end-to-end manner. Many schemes have been proposed to employ deep networks that consider temporal information, such as ConvLSTM [84, 25, 52], take optical-flows/adjacent-frames as input [53, 99, 113], 3D convolutional [48, 64]

, or directly exploit temporally concatenated deep features 

[49]. Besides, long-term influences are often taken into account and combined with deep learning. Li et al.  [55] proposed a key-frame strategy to locate representative high-quality video frames of salient objects and diffused their saliency to ill-detected non-key frames. Chen et al.  [10]

improved saliency detection by leveraging long-term spatial-temporal information, where high-quality “beyond-the-scope frames” are aligned with the current frames and both types of information are fed to deep neural networks for classification. Besides considering how to better leverage temporal information, other researchers have attempted to address different problems in VSOD, such as reducing the data labeling requirements 

[118], developing semi-supervised approaches [85], or investigating relative saliency [105]. Fan et al.  [25] recently introduced a VSOD model equipped with a saliency shift-aware ConvLSTM, together with an attention-consistent VSOD dataset with high-quality annotations.

3 Methodology

Figure 4: Illustration of our Relational Cross-Attention Module (RCAM) with a simplex (a & b) and full-duplex (c) strategy.

3.1 Overview

Suppose that a video clip contains consecutive frames . We first utilize optical flow field generator , i.e., FlowNet 2.0 [36], to generate optical flow maps , which are all computed by two adjacent frames (). To ensure the inputs match, we discard the last frame in the pipeline. Thus, the proposed pipeline takes both the appearance image and its paired motion map as the input. First, & pairs at frame 333Here, we omit the superscript “” for the convenient expression. are fed to two independent ResNet-50 [31] branches (i.e., motion and appearance blocks in Fig. 3). The appearance features and motion features extracted from layers are then sent to the Relational Cross-Attention Modules (RCAMs), which allows the network to embed spatial-temporal cross-modal features. Next, we employ the Bidirectional Purification Modules (BPMs) with cascaded units. BPMs focus on distilling representative carriers from fused features and motion-based features . Finally, the predictions (i.e., and ) at frame are generated from two decoder blocks.

3.2 Relational Cross-Attention Module

As discussed in  1, a single-modality (i.e., motion or appearance) guided stimulation may cause the model to make incorrect decisions. To alleviate this, we design a cross-attention module (RCAM) via the channel-wise attention mechanism, which focuses on distilling out effective squeezed cues from two modalities and then modulating each other. As shown in Fig. 4 (c), the two inputs of RCAM are appearance features and motion features , which are obtained from the two different branches of the standard ResNet-50 [31]. Specifically, for each k

-level, we first perform global average pooling (GAP) to generate channel-wise vectors

and from each and . Next, two 11 conv layers, i.e., and , with learnable parameters and

, generate two discriminated global descriptors. The sigmoid function

is then applied to convert the final descriptors into the interval [0, 1], i.e., into the valid attention vector for channel weighting. Then, we perform outer product between and to generate a candidate feature , and vice versa, as follows:


Then, we combine , , and lower-level fused feature

for in-depth feature extraction. With an element-wise addition operation

, conducted in the corresponding k-th level block in the ResNet-50, we finally obtain the fused features that contain comprehensive spatial-temporal correlations:


where denotes different feature hierarchies in the backbone. Note that

denotes the zero tensor. In our implementation, we use the top four feature pyramids,

i.e., , which is suggested by [106, 126].

3.3 Bidirectional Purification Module

In addition to the RCAM described above, which integrates common cross-modality features, we further introduce the bidirectional purification module (BPM) to improve the model robustness. Following the standard in action recognition [80] and saliency detection [109], our bidirectional purification phase is composed of BPMs connected in a cascaded manner. As shown in Fig. 3, we first employ the feature allocator to unify the feature representations from the previous stage:


where and denote different feature hierarchies and number of BPM, respectively. To be specific, is composed of two 33 conv, each with 32 filters to reduce the feature channels. Note that the allocator is conducive to reduce the computational burden as well as facilitate various element-wise operations.

Figure 5: Illustration of our Bidirectional Purification Module (BPM) with a simplex and full-duplex strategy.

Here, we consider a full-deplex scheme (see Fig. 5 (c)) that contains two simplex strategies (see Fig. 5 (a & b)) in the BPM. On one hand, the motion features contain temporal cues and can be used to enrich the fused features by the concatenation operation. On the other, the distractors in the motion features can be suppressed by multiplicating the fused features . Besides, to acquire robust feature representation, we introduce an efficient cross-modal fusion strategy in this scheme, which broadcasts high-level, semantically strong features to low-level, semantically weak features via interlaced decremental connection (IDC) with a top-down pathway [58]. Specifically, as the first part, the spatial-temporal feature combination branch (see Fig. 5 (b)) is formulated as:


where is an up-sampling operation followed by a 11 convolutional layer (conv) to reshape the candidate guidance to a consistent size with . Symbols and respectively denote element-wise addition and concatenation operations with an IDC strategy444For instance, when and ., followed by a 11 conv with 32 filters. For the other part, we formulate the temporal feature re-calibration branch (see Fig. 5 (a)) as:


where denotes element-wise multiplication with an IDC strategy, followed by a 11 conv with 32 filters.

3.4 Decoder

After feature aggregation and re-calibration with multi-pyramidal interaction, the last BPM unit produces two groups of discriminative features (i.e., & ) with a consistent channel number of 32. We integrate pyramid pooling module (PPM) [127] into each skip connection of the U-Net [77] as our decoder, and only adopt the top four layers in our implementation (). Since the features are fused from high to low level, global information is well retained at different scales of the designed decoder:


Here, indicates the upsampling operation after the pyramid pooling layer, while is a concatenation operation between two features. Then, a conv is used for reducing the channels from 64 to 32. Lastly, we use a 11 conv with a single filter after the upstream output (i.e., &

), followed by a sigmoid activation function to generate the predictions (

i.e., & ) at frame .

3.5 Training

Given a group of predictions and the corresponding ground-truths at frame , we employ the standard binary cross-entropy loss to measure the dissimilarity between output and target, which computes:



indicates a coordinate in the frame. The overall loss function is then formulated as:


For final prediction, we use since our experiments show that it better combines appearance and motion cues.

Unsupervised Semi-supervised
(Ours) [130] [94] [120] [59] [102] [19] [83] [89] [44] [88] [87] [16] [47] [69] [121] [40] [110] [92] [15] [7] [72]
w/ Flow
w/ CRF
Mean- 83.4 82.1 82.4 80.7 81.7 80.5 79.7 80.6 77.2 78.2 76.2 75.9 70.0 67.4 61.8 55.8 85.3 81.5 81.5 81.1 82.4 79.8 79.7
Mean- 83.1 83.0 80.7 79.1 80.5 79.5 77.4 75.5 77.4 75.9 70.6 72.1 65.9 66.7 61.2 51.1 86.9 82.2 82.0 82.2 79.5 80.6 75.4
Table 1: Video object segmentation (VOS) performance of our FSNet, compared with 14 SOTA unsupervised and seven semi-supervised models on DAVIS [74] validation set. ‘w/ Flow’: the optical flow algorithm is used. ‘w/ CRF’: conditional random field [45] is used for post-processing. The best scores are marked in bold.
DAVIS [74] MCL [42] FBMS [67] DAVSOD-Easy35 [25]
2018 MBN [54] 0.887 0.966 0.862 0.031 0.755 0.858 0.698 0.119 0.857 0.892 0.816 0.047 0.646 0.694 0.506 0.109
FGRN [52] 0.838 0.917 0.783 0.043 0.709 0.817 0.625 0.044 0.809 0.863 0.767 0.088 0.701 0.765 0.589 0.095
SCNN [85] 0.761 0.843 0.679 0.077 0.730 0.828 0.628 0.054 0.794 0.865 0.762 0.095 0.680 0.745 0.541 0.127
DLVS [99] 0.802 0.895 0.721 0.055 0.682 0.810 0.551 0.060 0.794 0.861 0.759 0.091 0.664 0.737 0.541 0.129
SCOM [13] 0.814 0.874 0.746 0.055 0.569 0.704 0.422 0.204 0.794 0.873 0.797 0.079 0.603 0.669 0.473 0.219
20192020 RSE [115] 0.748 0.878 0.698 0.063 0.682 0.657 0.576 0.073 0.670 0.790 0.652 0.128 0.577 0.663 0.417 0.146
SRP [17] 0.662 0.843 0.660 0.070 0.689 0.812 0.646 0.058 0.648 0.773 0.671 0.134 0.575 0.655 0.453 0.146
MESO [116] 0.718 0.853 0.660 0.070 0.477 0.730 0.144 0.102 0.635 0.767 0.618 0.134 0.549 0.673 0.360 0.159
LTSI [10] 0.876 0.957 0.850 0.034 0.768 0.872 0.667 0.044 0.805 0.871 0.799 0.087 0.695 0.769 0.585 0.106
SPD [55] 0.783 0.892 0.763 0.061 0.685 0.794 0.601 0.069 0.691 0.804 0.686 0.125 0.626 0.685 0.500 0.138
SSAV[25] 0.893 0.948 0.861 0.028 0.819 0.889 0.773 0.026 0.879 0.926 0.865 0.040 0.755 0.806 0.659 0.084
RCR[118] 0.886 0.947 0.848 0.027 0.820 0.895 0.742 0.028 0.872 0.905 0.859 0.053 0.741 0.803 0.653 0.087
PCSA[29] 0.902 0.961 0.880 0.022 N/A N/A N/A N/A 0.868 0.920 0.837 0.040 0.741 0.793 0.656 0.086
FSNet(Ours) 0.920 0.970 0.907 0.020 0.864 0.924 0.821 0.023 0.890 0.935 0.888 0.041 0.773 0.825 0.685 0.072
Table 2: Video salient object detection (VSOD) performance of our FSNet, compared with 13 SOTA models on several VSOD datasets. ‘’ denotes that we generate non-binary saliency maps without CRF [45] for fair comparison. ‘N/A’ means the results are not available.

3.6 Implementation Details

Training Settings. 

We implement our model in PyTorch 

[70], accelerated by an NVIDIA RTX TITAN. All the inputs are uniformly resized to 352352. To enhance the stability and generalizability of the learning algorithm, we employ the multi-scale () training strategy [30] in the training phase. Based on experiments in Tab. 4,

=4 (the number of BPM) achieves the best performance. We utilize the stochastic gradient descent (SGD) algorithm to optimize the entire network, with a momentum of

, learning rate of , and weight decay of .

Testing Settings and Runtime.  Given a frame along with its motion map, we resize them to 352352 and feed them into the corresponding branch. Similar to [130, 59, 102], We employ the conditional random field (CRF) [45] post-processing technique for a fair compairison. The inference time of our method is 0.08s per frame, regardless of flow generation and CRF post-processing.

4 Experiments

4.1 UVOS and VSOD

Datasets.  We evaluate the proposed model on four widely used VOS datasets. DAVIS [74] is the most popular of these, and consists of 50 (30 training and 20 validation) high-quality and densely annotated video sequences. MCL [42] contains 9 videos and is mainly used as testing data. FBMS [67] includes 59 natural videos, in which 29 sequences are used as the training set and 30 are for testing. SegTrack-V2 [51] is one of the earliest VOS dataset, and consists of 13 clips. In addition, DAVSOD [25] was specifically designed for the VSOD task. It is the most challenging visual attention consistent VSOD dataset with high-quality annotations and diverse attributes.

Metrics.  We adopt six standard metrics including: mean region similarity ([74], mean contour accuracy ([74], structure-measure (=0.5) [21, 11], maximum enhanced-alignment measure ([23, 24], maximum F-measure (=0.3) [2], and mean absolute error (MAE, [73].

Training.  Following a similar multi-task training setup as [53], we divide our training procedure into three steps: (i) We first use a well-known static saliency dataset DUTS [93] to train the spatial branch to avoid over-fitting, like in [99, 84, 25], (ii) We then train the temporal branch on the generated optical flow maps, and (iii) We finally load the weights pretrained on two sub-tasks into the spatial and temporal branches, and thus, the whole network is end-to-end trained on the training set of DAVIS

(30 clips) and FBMS (29 clips). Last step takes about 4 hours and converges after 20 epochs with a batch size of 8.

Figure 6: Qualitative results on five datasets, including DAVIS [74], MCL [42], FBMS [67], SegTrack-V2 [51], and DAVSOD [25].

Testing.  We follow the standard benchmarks [74, 25] to test our model on the validation set (20 sequences) of DAVIS, the test set of FBMS (30 clips), the test set (Easy35 split) of DAVSOD (35 clips), the whole of MCL (9 clips), and the whole of SegTrack-V2 (13 clips).

Evaluation on DAVIS As shown in Tab. 1, we compare our FSNet with 14 SOTA UVOS models on the DAVIS public leaderboard. We also compare it with seven recent semi-supervised approaches as reference. For fair comparison, we use a threshold of 0.5 to generate the final binary maps, as recommended by [120]. Our FSNet outperforms the best model (AAAI’20-MAT [130]) by a margin of 2.4% in and 1.0% in , achieving the new SOTA performance. Notably, the proposed UVOS model also outperforms the semi-supervised model (e.g., AGA [40]), even though it utilizes the first GT mask as the reference of object location.

We also compare FSNet against 13 SOTA VSOD models. We obtain the non-binary saliency maps555Note that all compared maps in VSOD, including ours, are non-binary. from the standard benchmark [25]. This can be seen from Tab. 2, our method consistently outperforms all other models since 2018, on all metrics. In particular, for the and metrics, our method improves the performance by 2.0% compared with the best AAAI’20-PCAS [29] model.

Evaluation on MCL.  This dataset has fuzzy object boundaries in the low-resolution frames, due to fast object movements. Therefore, the overall performance is lower than on DAVIS. As shown in Tab. 2, our method still stands out in these extreme circumstances, with a 3.08.0% increase in all metrics compared with ICCV’19-RCR [118] and CVPR’19-SSAV [25].

Evaluation on FBMS.  This is one of the most popular VOS datasets with diverse attributes, such as interacting objects and dynamic backgrounds, and no per-frame annotation. As shown in Tab. 2, our model achieves competitive performance in terms of . Further, compared to the previous best-performing SSAV [25], it obtains improvements in other metrics, including (0.890 vs. SSAV=0.879) and (0.935 vs. SSAV=0.926), making it more suitable to the human visual system (HVS) as mentioned in [21, 23].

Evaluation on SegTrack-V2.  This is the earliest VOS dataset from the traditional era. Thus, only a limited number of deep UVOS models have been tested on it. We only compare our FSNet against the top-3 models: AAAI’20-PCAS [29] (=0.866), ICCV’19-RCR [118] (=0.842), and CVPR’19-SSAV [25] (=0.850). Our method achieves the best performance (=0.870).

Evaluation on DAVSOD Most of the video sequences in DAVSOD are similar to those in the challenging DAVIS dataset. It also contains a large amount of single (salient) objects. We find that FSNet  outperforms all the reported algorithms. Compared with the current best solution (i.e., AAAI’20-PCAS), our model achieves large improvements of 3.2% in terms of .

Qualitative Results.  Some qualitative results on the five datasets are shown in Fig. 6, validating that our method achieves high-quality UVOS and VSOD results. As can be seen in the 1 row, the red car in the bottom-right corner moves slowly, so it does not get noticed. However, as our full-duplex strategy model considers both appearance and motion bidirectionally, it can automatically predict the smaller car in the center of the video. Overall, for these challenging situations, e.g., dynamic background (1 & 5 rows), occlusion (2 row), fast-motion (3 row), and deformation (4 row), our model is able to infer the real target object(s) with fine-grained details. From this point of view, we demonstrate that FSNet is a general framework for both UVOS and VSOD tasks.

4.2 Ablation Study

4.2.1 Stimulus Selection

We explore the influence of different stimuli (appearance only vs. motion only) in our framework. We use only video frames or motion maps (using [36]) to train the ResNet-50 [31] backbone together with the proposed decoder block (see  3.4). As shown in Tab. 3, performs slightly better than in terms of on DAVIS, which suggests that the “optical flow” setting can learn more visual cues than “video frames”. Nevertheless, outperforms in metric on MCL. This motivates us to explore how to effectively use appearance and motion cues simultaneously.

4.2.2 Effectiveness of RCAM

To validate the effectiveness of our RCAM (Rel.), we replace our fusion strategy with the vanilla fusion (Vanilla) using a concatenate operation followed by a convolutional layer to fuse two modalities. As expected (Tab. 3), the proposed Rel. performs consistently better than the vanilla fusion strategy on both DAVIS and MCL. We would like to point out that our RCAM has two important properties: (i) it enables mutual correction and attention, and (ii) it can alleviate error propagation within a network to an extent due to the mutual correction and bidirectional interaction.

Component Settings DAVIS MCL
Appearance Motion RCAM BPM
0.834 0.047 0.754 0.038
0.858 0.039 0.763 0.053
Vanilla 0.871 0.035 0.776 0.046
Rel. 0.900 0.025 0.833 0.031
Bi-Purf. 0.904 0.026 0.855 0.023
FSNet 0.920 0.020 0.864 0.023
Table 3: Ablation studies ( 4.2.1,  4.2.2, &  4.2.3) for our components on DAVIS and MCL. We set for BPM.
Param. FLOPs Runtime DAVIS MCL
(M) (G) (s/frame)
0.000 0.000 0.03 0.900 0.025 0.833 0.031
0.507 1.582 0.05 0.911 0.026 0.843 0.028
1.015 3.163 0.08 0.920 0.020 0.864 0.023
1.522 4.745 0.10 0.918 0.023 0.863 0.023
2.030 6.327 0.13 0.920 0.023 0.864 0.023
Table 4: Ablation study for the number () of BPMs on DAVIS [74] and MCL [42], with the focus on parameter and FLOPs of BPMs, and runtime of FSNet.

4.2.3 Effectiveness of BPM

To illustrate the effectiveness of the BPM (with ), we derive two different models: Rel. and FSNet, referring to the framework without or with BPM. We observe that the model with BPM gains 2.03.0% than the one without BPM, according to the statistics in Tab. 3. We attribute this improvement to BPM’s introduction of an interlaced decremental connection, which enables it to effectively fuse the different signals. Similarly, we remove the RCAM and derive another pair of settings (Vanilla & Bi-Purf.) to test the robustness of our BPM. The results show that even using the bidirectional vanilla fusion strategy (Bi-Purf.) can still enhance the stability and generalization of the model. This benefits from the purification forward process and re-calibration backward process in the whole network.

4.2.4 Number of Cascaded BPMs

Intuitively, more cascaded BPMs should lead to better boost performance. This is investigated and the evaluation results are shown in Tab. 4, where . Note that means that NO BPM is used. Clearly, as can be seen from Fig. 2 and Tab. 4, more BPMs leads to better results, but the performance reaches saturation after . Further, too many BPMs (i.e., ) will cause high model-complexity and may increase the risk of over-fitting. As a trade-off, we use throughout our experiments.

4.2.5 Effectiveness of Full-Duplex Strategy

To investigate the effectiveness of the RCAM and BPM modules with the full-duplex strategy, we study two unidirectional (simplex, see Fig. 4 & Fig. 5) variants of our model. In Tab. 5, the symbols , , and indicate the feature transmission directions in the designed RCAM or BPM. Specifically, indicates that the attention vector in the optical flow branch weights the features in the appearance branch, and vice versa. indicates that motion cues are used to guide the fused features extracted from both appearance and motion. The comparison results show that our elaborately designed modules (RCAM and BPM) jointly cooperate in a full-duplex fashion and outperform all simplex (unidirectional) settings.

Direction Setting DAVIS MCL
simplex 0.896 0.026 0.816 0.038
0.902 0.025 0.832 0.031
0.891 0.029 0.806 0.039
0.897 0.028 0.840 0.028
full-dup. 0.920 0.020 0.864 0.023
Table 5: Ablation study for the simplex and full-duplex strategies on DAVIS [74] and MCL [42]. We set for BPM.

5 Conclusion

We explore a simple yet efficient full-duplex strategy network (FSNet) that fully leverages the complementarity of appearance and motion cues to address the video object segmentation problem. This architecture consists of a relational cross-attention module (RCAM) and an efficient bidirectional purification module (BPM). The former is used to abstract features from a dual-modality, while the latter is utilized to re-calibrate inaccurate features step-by-step. In the BPM, the interlaced decremental connection is critical for broadcasting high-level coarse features to low-level fine-grained features. We thoroughly validate each module of our FSNet, providing several interesting findings. Finally, FSNet acts as a unified solution significantly advancing the SOTA of both VOS and VSOD. How to learn short-/long-term in an efficient Transformer-like [103, 132] scheme under the complicated/camouflaged [22] scenarios seems to be interesting future work.

6 Acknowledgement

This work was supported by the NSFC (No. 61703077), and SCU-Luzhou Municipal People’s Government Strategic Cooperation Project (No. 2020CDLZ-10), and China Postdoctoral Science Foundation Funded Project (No. 2020M682829).


  • [1] A. Abramov, K. Pauwels, J. Papon, F. Wörgötter, and B. Dellen (2012) Depth-supported real-time video segmentation with the kinect. In IEEE WACV, pp. 457–464. Cited by: §1.
  • [2] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk (2009) Frequency-tuned salient region detection. In IEEE CVPR, pp. 1597–1604. Cited by: §4.1.
  • [3] N. Ballas, L. Yao, C. Pal, and A. Courville (2016) Delving deeper into convolutional networks for learning video representations. In ICLR, Cited by: §2.1.
  • [4] D. Bharadia, E. McMilin, and S. Katti (2013) Full duplex radios. In ACM SIGCOMM, pp. 375–386. Cited by: footnote 2.
  • [5] G. Bhat, F. J. Lawin, M. Danelljan, A. Robinson, M. Felsberg, L. Van Gool, and R. Timofte (2020) Learning what to learn for video object segmentation. In ECCV, Cited by: §1.
  • [6] T. Brox and J. Malik (2010) Object segmentation by long term analysis of point trajectories. In ECCV, pp. 282–295. Cited by: §2.1.
  • [7] S. Caelles, K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool (2017) One-shot video object segmentation. In IEEE CVPR, pp. 221–230. Cited by: §2.1, Table 1.
  • [8] S. Caelles, J. Pont-Tuset, F. Perazzi, A. Montes, K. Maninis, and L. Van Gool (2019) The 2019 davis challenge on vos: unsupervised multi-object segmentation. arXiv preprint arXiv:1905.00737. Cited by: §1.
  • [9] B. Chen, H. Ling, X. Zeng, G. Jun, Z. Xu, and S. Fidler (2020) ScribbleBox: interactive annotation framework for video object segmentation. In ECCV, Cited by: §1.
  • [10] C. Chen, G. Wang, C. Peng, X. Zhang, and H. Qin (2019) Improved robust video saliency detection based on long-term spatial-temporal information. IEEE TIP 29, pp. 1090–1100. Cited by: §2.2, Table 2.
  • [11] M. Chen and D. Fan (2021) Structure-measure: a new way to evaluate foreground maps. IJCV. Cited by: §4.1.
  • [12] X. Chen, Z. Li, Y. Yuan, G. Yu, J. Shen, and D. Qi (2020) State-aware tracker for real-time video object segmentation. In IEEE CVPR, pp. 9384–9393. Cited by: §1.
  • [13] Y. Chen, W. Zou, Y. Tang, X. Li, C. Xu, and N. Komodakis (2018) SCOM: spatiotemporal constrained optimization for salient object detection. IEEE TIP 27 (7), pp. 3345–3357. Cited by: Table 2.
  • [14] Z. Chen, C. Guo, J. Lai, and X. Xie (2019) Motion-appearance interactive encoding for object segmentation in unconstrained videos. IEEE TCSVT. Cited by: §1.
  • [15] J. Cheng, Y. Tsai, W. Hung, S. Wang, and M. Yang (2018) Fast and accurate online video object segmentation via tracking parts. In IEEE CVPR, pp. 7415–7424. Cited by: §2.1, Table 1.
  • [16] J. Cheng, Y. Tsai, S. Wang, and M. Yang (2017) Segflow: joint learning for video object segmentation and optical flow. In IEEE ICCV, pp. 686–695. Cited by: §1, §2.1, Table 1.
  • [17] R. Cong, J. Lei, H. Fu, F. Porikli, Q. Huang, and C. Hou (2019) Video saliency detection via sparsity-based reconstruction and propagation. IEEE TIP 28 (10), pp. 4819–4831. Cited by: Table 2.
  • [18] M. Ding, Z. Wang, B. Zhou, J. Shi, Z. Lu, and P. Luo (2020) Every Frame Counts: Joint Learning of Video Segmentation and Optical Flow. In AAAI, pp. 10713–10720. Cited by: §1.
  • [19] M. Faisal, I. Akhter, M. Ali, and R. Hartley (2020) Exploiting geometric constraints on dense trajectories for motion saliency. In IEEE WACV, Cited by: Table 1.
  • [20] A. Faktor and M. Irani (2014) Video segmentation by non-local consensus voting.. In BMVC, Vol. 2, pp. 8. Cited by: §2.1.
  • [21] D. Fan, M. Cheng, Y. Liu, T. Li, and A. Borji (2017) Structure-measure: a new way to evaluate foreground maps. In IEEE ICCV, pp. 4548–4557. Cited by: §4.1, §4.1.
  • [22] D. Fan, G. Ji, M. Cheng, and L. Shao (2021) Concealed object detection. IEEE TPAMI. Cited by: §5.
  • [23] D. Fan, G. Ji, X. Qin, and M. Cheng (2021) Cognitive vision inspired object segmentation metric and loss function. SCIENTIA SINICA Informationis. Cited by: §4.1, §4.1.
  • [24] D. Fan, G. Ji, X. Qin, and M. Cheng (2021) Cognitive vision inspired object segmentation metric and loss function. SCIENTIA SINICA Informationis 6. Cited by: §4.1.
  • [25] D. Fan, W. Wang, M. Cheng, and J. Shen (2019) Shifting more attention to video salient object detection. In IEEE CVPR, pp. 8554–8564. Cited by: §2.2, Table 2, Figure 6, §4.1, §4.1, §4.1, §4.1, §4.1, §4.1, §4.1.
  • [26] K. Fragkiadaki, G. Zhang, and J. Shi (2012) Video segmentation by tracing discontinuities in a trajectory embedding. In IEEE CVPR, pp. 1846–1853. Cited by: §2.1.
  • [27] F. Galasso, R. Cipolla, and B. Schiele (2012) Video segmentation with superpixels. In ACCV, pp. 760–774. Cited by: §2.1.
  • [28] M. Grundmann, V. Kwatra, M. Han, and I. Essa (2010) Efficient hierarchical graph-based video segmentation. In IEEE CVPR, pp. 2141–2148. Cited by: §2.1.
  • [29] Y. Gu, L. Wang, Z. Wang, Y. Liu, M. Cheng, and S. Lu (2020) Pyramid constrained self-attention network for fast video salient object detection. In AAAI, Cited by: Table 2, §4.1, §4.1.
  • [30] K. He, X. Zhang, S. Ren, and J. Sun (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE TPAMI 37 (9), pp. 1904–1916. Cited by: §3.6.
  • [31] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE CVPR, pp. 770–778. Cited by: §3.1, §3.2, §4.2.1.
  • [32] Y. Heo, Y. J. Koh, and C. Kim (2020) Interactive video object segmentation using global and local transfer modules. In ECCV, Cited by: §1.
  • [33] P. Hu, G. Wang, X. Kong, J. Kuen, and Y. Tan (2020) Motion-guided cascaded refinement network for video object segmentation. IEEE TPAMI, pp. 1400–1409. Cited by: §1, §1.
  • [34] Y. Hu, J. Huang, and A. G. Schwing (2018) Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In ECCV, pp. 786–802. Cited by: §2.2.
  • [35] X. Huang, J. Xu, Y. Tai, and C. Tang (2020) Fast video object segmentation with temporal aggregation network and dynamic template matching. In IEEE CVPR, pp. 8879–8889. Cited by: §1.
  • [36] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In IEEE CVPR, pp. 2462–2470. Cited by: §1, §3.1, §4.2.1.
  • [37] S. D. Jain and K. Grauman (2016) Click carving: segmenting objects in video with point clicks. In IJCV, Cited by: §1.
  • [38] S. D. Jain, B. Xiong, and K. Grauman (2017) Fusionseg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In IEEE CVPR, pp. 2117–2126. Cited by: §1.
  • [39] G. Ji, Y. Chou, D. Fan, G. Chen, D. Jha, H. Fu, and L. Shao (2021) Progressively normalized self-attention network for video polyp segmentation. In MICCAI, Cited by: §1.
  • [40] J. Johnander, M. Danelljan, E. Brissman, F. S. Khan, and M. Felsberg (2019) A generative appearance model for end-to-end video object segmentation. In IEEE CVPR, pp. 8953–8962. Cited by: §2.1, Table 1, §4.1.
  • [41] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele (2017) Lucid data dreaming for object tracking. In IEEE CVPRW, Cited by: §1.
  • [42] H. Kim, Y. Kim, J. Sim, and C. Kim (2015) Spatiotemporal saliency detection for video sequences based on random walk with restart. IEEE TIP 24 (8), pp. 2552–2564. Cited by: Table 2, Figure 6, §4.1, Table 4, Table 5.
  • [43] C. Koch and S. Ullman (1987) Shifts in selective visual attention: towards the underlying neural circuitry. In Matters of intelligence, pp. 115–141. Cited by: §1.
  • [44] Y. J. Koh and C. Kim (2017) Primary object segmentation in videos based on region augmentation and reduction. In IEEE CVPR, pp. 7417–7425. Cited by: Table 1.
  • [45] P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, pp. 109–117. Cited by: Figure 2, §3.6, Table 1, Table 2.
  • [46] M. Lan, Y. Zhang, Q. Xu, and L. Zhang (2020) E3SN: Efficient End-to-End Siamese Network for Video Object Segmentation. In IJCAI, pp. 701–707. Cited by: §1.
  • [47] D. Lao and G. Sundaramoorthi (2018) Extending layered models to 3d motion. In ECCV, pp. 435–451. Cited by: Table 1.
  • [48] T. Le and A. Sugimoto (2017) Deeply supervised 3d recurrent fcn for salient object detection in videos.. In BMVC, Vol. 1, pp. 3. Cited by: §2.2.
  • [49] T. Le and A. Sugimoto (2018) Video salient object detection using spatiotemporal deep features. IEEE TIP 27 (10), pp. 5002–5015. Cited by: §2.2.
  • [50] Y. J. Lee, J. Kim, and K. Grauman (2011) Key-segments for video object segmentation. In IEEE ICCV, pp. 1995–2002. Cited by: §2.1.
  • [51] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg (2013) Video segmentation by tracking many figure-ground segments. In IEEE ICCV, pp. 2192–2199. Cited by: §2.1, Figure 6, §4.1.
  • [52] G. Li, Y. Xie, T. Wei, K. Wang, and L. Lin (2018) Flow guided recurrent neural encoder for video salient object detection. In IEEE CVPR, pp. 3243–3252. Cited by: §2.2, Table 2.
  • [53] H. Li, G. Chen, G. Li, and Y. Yu (2019) Motion guided attention for video salient object detection. In IEEE ICCV, pp. 7274–7283. Cited by: §1, §2.2, §4.1.
  • [54] S. Li, B. Seybold, A. Vorobyov, X. Lei, and C. Jay Kuo (2018) Unsupervised video object segmentation with motion-based bilateral networks. In ECCV, pp. 207–223. Cited by: §2.1, Table 2.
  • [55] Y. Li, S. Li, C. Chen, A. Hao, and H. Qin (2020) Accurate and robust video saliency detection via self-paced diffusion. IEEE TMM 22 (5), pp. 1153–1167. External Links: Document Cited by: §2.2, Table 2.
  • [56] Y. Li, Z. Shen, and Y. Shan (2020) Fast video object segmentation using the global context module. In ECCV, Cited by: §1.
  • [57] F. Lin, Y. Chou, and T. Martinez (2020) Flow adaptive video object segmentation. Image and Vision Computing 94, pp. 103864. Cited by: §1.
  • [58] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In IEEE CVPR, pp. 2117–2125. Cited by: §3.3.
  • [59] X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli (2019) See more, know more: unsupervised video object segmentation with co-attention siamese networks. In IEEE CVPR, pp. 3623–3632. Cited by: §3.6, Table 1.
  • [60] T. Ma and L. J. Latecki (2012) Maximum weight cliques with mutex constraints for video object segmentation. In IEEE CVPR, pp. 670–677. Cited by: §2.1.
  • [61] W. Maddern, G. Pascoe, C. Linegar, and P. Newman (2017) 1 year, 1000 km: the oxford robotcar dataset. IJRR 36 (1), pp. 3–15. Cited by: §1.
  • [62] S. Mahadevan, A. Athar, A. Osep, S. Hennen, L. Leal-Taixé, and B. Leibe (2020) Making a case for 3d convolutions for object segmentation in videos. In BMVC, Cited by: §1, §1.
  • [63] J. Miao, Y. Wei, and Y. Yang (2020) Memory aggregation networks for efficient interactive video object segmentation. In IEEE CVPR, pp. 10366–10375. Cited by: §1.
  • [64] K. Min and J. J. Corso (2019) TASED-net: temporally-aggregating spatial encoder-decoder network for video saliency detection. In IEEE ICCV, pp. 2394–2403. Cited by: §2.2.
  • [65] D. Nilsson and C. Sminchisescu (2018) Semantic video segmentation by gated recurrent flow propagation. In IEEE CVPR, pp. 6819–6828. Cited by: §1.
  • [66] P. Ochs and T. Brox (2012)

    Higher order motion models and spectral clustering

    In IEEE CVPR, pp. 614–621. Cited by: §2.1.
  • [67] P. Ochs, J. Malik, and T. Brox (2013) Segmentation of moving objects by long term video analysis. IEEE TPAMI 36 (6), pp. 1187–1200. Cited by: Table 2, Figure 6, §4.1.
  • [68] Y. Pan, T. Yao, H. Li, and T. Mei (2017) Video captioning with transferred semantic attributes. In IEEE CVPR, pp. 6504–6512. Cited by: §1.
  • [69] A. Papazoglou and V. Ferrari (2013) Fast object segmentation in unconstrained video. In IEEE ICCV, pp. 1777–1784. Cited by: Table 1.
  • [70] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In NIPS, pp. 8024–8035. Cited by: §3.6.
  • [71] Q. Peng and Y. Cheung (2019) Automatic video object segmentation based on visual and motion saliency. IEEE TMM 21 (12), pp. 3083–3094. Cited by: §1.
  • [72] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung (2017) Learning video object segmentation from static images. In IEEE CVPR, pp. 2663–2672. Cited by: §2.1, Table 1.
  • [73] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung (2012) Saliency filters: contrast based filtering for salient region detection. In IEEE CVPR, pp. 733–740. Cited by: §4.1.
  • [74] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. In IEEE CVPR, pp. 724–732. Cited by: Figure 2, 3rd item, Table 1, Table 2, Figure 6, §4.1, §4.1, §4.1, Table 4, Table 5.
  • [75] F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung (2015) Fully connected object proposals for video segmentation. In IEEE ICCV, pp. 3227–3234. Cited by: §2.1.
  • [76] A. Robinson, F. J. Lawin, M. Danelljan, F. S. Khan, and M. Felsberg (2020) Learning fast and robust target models for video object segmentation. In IEEE CVPR, Cited by: §1.
  • [77] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §3.4.
  • [78] S. Seo, J. Lee, and B. Han (2020) URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark. In ECCV, Cited by: §1.
  • [79] H. Seong, J. Hyun, and E. Kim (2020) Kernelized memory network for video object segmentation. In ECCV, Cited by: §1.
  • [80] L. Sevilla-Lara, Y. Liao, F. Güney, V. Jampani, A. Geiger, and M. J. Black (2018) On the integration of optical flow and action recognition. In GCPR, pp. 281–297. Cited by: §3.3.
  • [81] J. Shi and J. Malik (1998) Motion segmentation and tracking using normalized cuts. In IEEE ICCV, pp. 1154–1160. Cited by: §2.1.
  • [82] J. Shi et al. (1994) Good features to track. In IEEE CVPR, pp. 593–600. Cited by: §1.
  • [83] M. Siam, C. Jiang, S. Lu, L. Petrich, M. Gamal, M. Elhoseiny, and M. Jagersand (2019) Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In IEEE ICRA, pp. 50–56. Cited by: §2.1, Table 1.
  • [84] H. Song, W. Wang, S. Zhao, J. Shen, and K. Lam (2018) Pyramid dilated deeper convlstm for video salient object detection. In ECCV, pp. 715–731. Cited by: §2.1, §2.2, §4.1.
  • [85] Y. Tang, W. Zou, Z. Jin, Y. Chen, Y. Hua, and X. Li (2018) Weakly supervised salient object detection with spatiotemporal cascade neural networks. IEEE TCSVT 29 (7), pp. 1973–1984. Cited by: §2.2, Table 2.
  • [86] Z. Teed and J. Deng (2020) RAFT: recurrent all-pairs field transforms for optical flow. In ECCV, Cited by: §1.
  • [87] P. Tokmakov, K. Alahari, and C. Schmid (2017) Learning motion patterns in videos. In IEEE CVPR, pp. 3386–3394. Cited by: §2.1, Table 1.
  • [88] P. Tokmakov, K. Alahari, and C. Schmid (2017) Learning video object segmentation with visual memory. In IEEE ICCV, pp. 4481–4490. Cited by: §1, §1, §2.1, §2.1, Table 1.
  • [89] P. Tokmakov, C. Schmid, and K. Alahari (2019) Learning to segment moving objects. IJCV 127 (3), pp. 282–301. Cited by: Table 1.
  • [90] A. M. Treisman and G. Gelade (1980) A feature-integration theory of attention. Cognitive psychology 12 (1), pp. 97–136. Cited by: §1.
  • [91] Y. Tsai, M. Yang, and M. J. Black (2016) Video segmentation via object flow. In IEEE CVPR, pp. 3899–3908. Cited by: §1, §2.1.
  • [92] P. Voigtlaender, Y. Chai, F. Schroff, H. Adam, B. Leibe, and L. Chen (2019) Feelvos: fast end-to-end embedding learning for video object segmentation. In IEEE CVPR, pp. 9481–9490. Cited by: §2.1, Table 1.
  • [93] L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan (2017) Learning to detect salient objects with image-level supervision. In IEEE CVPR, pp. 136–145. Cited by: §4.1.
  • [94] W. Wang, X. Lu, J. Shen, D. J. Crandall, and L. Shao (2019) Zero-shot video object segmentation via attentive graph neural networks. In IEEE ICCV, Cited by: Table 1.
  • [95] W. Wang, J. Shen, X. Li, and F. Porikli (2015) Robust video object cosegmentation. IEEE TIP 24 (10), pp. 3137–3148. Cited by: §2.1.
  • [96] W. Wang, J. Shen, X. Lu, S. C. Hoi, and H. Ling (2020) Paying attention to video object pattern understanding. IEEE TPAMI. Cited by: §1.
  • [97] W. Wang, J. Shen, F. Porikli, and R. Yang (2018) Semi-supervised video object segmentation with super-trajectories. IEEE TPAMI 41 (4), pp. 985–998. Cited by: §1.
  • [98] W. Wang, J. Shen, and F. Porikli (2015) Saliency-aware geodesic video object segmentation. In IEEE CVPR, pp. 3395–3402. Cited by: §2.1.
  • [99] W. Wang, J. Shen, and L. Shao (2017) Video salient object detection via fully convolutional networks. IEEE TIP 27 (1), pp. 38–49. Cited by: §2.2, Table 2, §4.1.
  • [100] W. Wang, J. Shen, J. Xie, and F. Porikli (2017) Super-trajectory for video segmentation. In IEEE ICCV, pp. 1671–1679. Cited by: §2.1.
  • [101] W. Wang, J. Shen, R. Yang, and F. Porikli (2017) Saliency-aware video object segmentation. IEEE TPAMI 40 (1), pp. 20–33. Cited by: §2.2.
  • [102] W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. C. Hoi, and H. Ling (2019) Learning unsupervised video object segmentation through visual attention. In IEEE CVPR, pp. 3064–3074. Cited by: §2.1, §3.6, Table 1.
  • [103] W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In IEEE ICCV, Cited by: §5.
  • [104] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia (2021) End-to-end video instance segmentation with transformers. In IEEE CVPR, Cited by: §1.
  • [105] Z. Wang, X. Yan, Y. Han, and M. Sun (2019) Ranking video salient object detection. In ACM MM, pp. 873–881. Cited by: §2.2.
  • [106] J. Wei, S. Wang, and Q. Huang (2020) F3Net: fusion, feedback and focus for salient object detection. In AAAI, pp. 12321–12328. Cited by: §3.2.
  • [107] P. Wen, R. Yang, Q. Xu, C. Qian, Q. Huang, R. Cong, and J. Si (2020) DMVOS: Discriminative matching for real-time video object segmentation. In ACM MM, Cited by: §1.
  • [108] J. M. Wolfe, K. R. Cave, and S. L. Franzel (1989) Guided search: an alternative to the feature integration model for visual search.. J EXP PSYCHOL HUMAN 15 (3), pp. 419. Cited by: §1.
  • [109] Z. Wu, L. Su, and Q. Huang (2019) Stacked cross refinement network for edge-aware salient object detection. In IEEE ICCV, pp. 7264–7273. Cited by: §3.3.
  • [110] S. Wug Oh, J. Lee, K. Sunkavalli, and S. Joo Kim (2018) Fast video object segmentation by reference-guided mask propagation. In IEEE CVPR, pp. 7376–7385. Cited by: §2.1, Table 1.
  • [111] H. Xiao, B. Kang, Y. Liu, M. Zhang, and J. Feng (2019) Online meta adaptation for fast video object segmentation. IEEE TPAMI 42 (5), pp. 1205–1217. Cited by: §1.
  • [112] C. Xu, C. Xiong, and J. J. Corso (2012) Streaming hierarchical video segmentation. In ECCV, pp. 626–639. Cited by: §2.1.
  • [113] M. Xu, P. Fu, B. Liu, and J. Li (2021) Multi-stream attention-aware graph convolution network for video salient object detection. IEEE TIP 30 (), pp. 4183–4197. External Links: Document Cited by: §2.2.
  • [114] M. Xu, P. Fu, B. Liu, H. Yin, and J. Li (2021) A novel dynamic graph evolution network for salient object detection. Applied Intelligence (), pp. . External Links: Document Cited by: §2.2.
  • [115] M. Xu, B. Liu, P. Fu, J. Li, Y. H. Hu, and S. Feng (2019) Video salient object detection via robust seeds extraction and multi-graphs manifold propagation. IEEE TCSVT. Cited by: §2.2, Table 2.
  • [116] M. Xu, B. Liu, P. Fu, J. Li, and Y. H. Hu (2019) Video saliency detection via graph clustering with motion energy and spatiotemporal objectness. IEEE TMM 21 (11), pp. 2790–2805. Cited by: Table 2.
  • [117] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price, S. Cohen, and T. Huang (2018) Youtube-vos: sequence-to-sequence video object segmentation. In ECCV, pp. 585–601. Cited by: §1, §2.1.
  • [118] P. Yan, G. Li, Y. Xie, Z. Li, C. Wang, T. Chen, and L. Lin (2019) Semi-supervised video salient object detection using pseudo-labels. In IEEE ICCV, pp. 7284–7293. Cited by: §2.2, Table 2, §4.1, §4.1.
  • [119] L. Yang, Y. Fan, and N. Xu (2019) Video instance segmentation. In IEEE ICCV, pp. 5188–5197. Cited by: §1.
  • [120] Z. Yang, Q. Wang, L. Bertinetto, W. Hu, S. Bai, and P. H. Torr (2019) Anchor diffusion for unsupervised video object segmentation. In IEEE ICCV, pp. 931–940. Cited by: Figure 1, §1, Table 1, §4.1.
  • [121] Z. Yang, Y. Wei, and Y. Yang (2020) Collaborative video object segmentation by foreground-background integration. In ECCV, Cited by: Table 1.
  • [122] L. Zelnik-Manor and M. Irani (2001) Event-based analysis of video. In IEEE CVPR, Vol. 2, pp. II–II. Cited by: §1.
  • [123] K. Zhang, L. Wang, D. Liu, B. Liu, Q. Liu, and Z. Li (2020) Dual temporal memory network for efficient video object segmentation. In ACM MM, Cited by: §1.
  • [124] M. Zhang, J. Liu, Y. Wang, Y. Piao, S. Yao, W. Ji, J. Li, H. Lu, and Z. Luo (2021) Dynamic context-sensitive filtering network for video salient object detection. In IEEE ICCV, Cited by: §2.2.
  • [125] Y. Zhang, Z. Wu, H. Peng, and S. Lin (2020) A transductive approach for video object segmentation. In IEEE CVPR, pp. 6949–6958. Cited by: §1.
  • [126] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun (2018) Exfuse: enhancing feature fusion for semantic segmentation. In ECCV, pp. 269–284. Cited by: §3.2.
  • [127] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In IEEE CVPR, pp. 2881–2890. Cited by: §3.4.
  • [128] J. Zheng, W. Luo, and Z. Piao (2019) Cascaded convlstms using semantically-coherent data synthesis for video object segmentation. IEEE Access 7, pp. 132120–132129. Cited by: §2.1.
  • [129] T. Zhou, J. Li, S. Wang, R. Tao, and J. Shen (2020) MATNet: motion-attentive transition network for zero-shot video object segmentation. IEEE TIP 29, pp. 8326–8338. Cited by: §1.
  • [130] T. Zhou, S. Wang, Y. Zhou, Y. Yao, J. Li, and L. Shao (2020) Motion-attentive transition for zero-shot video object segmentation. In AAAI, Cited by: Figure 2, 3rd item, §1, §1, §3.6, Table 1, §4.1.
  • [131] X. Zhou, Z. Liu, C. Gong, and W. Liu (2018) Improving video saliency detection via localized estimation and spatiotemporal refinement. IEEE TMM 20 (11), pp. 2993–3007. Cited by: §2.2.
  • [132] M. Zhuge, D. Gao, D. Fan, L. Jin, B. Chen, H. Zhou, M. Qiu, and L. Shao (2021) Kaleido-bert: vision-language pre-training on fashion domain. In IEEE CVPR, pp. 12647–12657. Cited by: §5.