See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks

01/19/2020 ∙ by Xiankai Lu, et al. ∙ Shanghai Jiao Tong University IEEE Australian National University 16

We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task from a holistic view. We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism to improve further the state-of-the-art deep learning based solutions that primarily focus on learning discriminative foreground representations over appearance and motion in short-term temporal segments. The co-attention layers in our network provide efficient and competent stages for capturing global correlations and scene context by jointly computing and appending co-attention responses into a joint feature space. We train COSNet with pairs of video frames, which naturally augments training data and allows increased learning capacity. During the segmentation stage, the co-attention model encodes useful information by processing multiple reference frames together, which is leveraged to infer the frequently reappearing and salient foreground objects better. We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos. Our extensive experiments over three large benchmarks manifest that COSNet outperforms the current alternatives by a large margin.



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised video object segmentation (UVOS) aims to automatically separate primary foreground object(s) from their background in a video. Since UVOS does not require manual interaction, it has significant value in both academic and applied fields, especially in this era of information-explosion. However, due to the lack of prior knowledge about the primary object(s), in addition to the typical challenges for semi-supervised video object segmentation (e.g., object deformation, occlusion, and background clutters), UVOS suffers from another difficult problem, i.e., how to correctly distinguish the primary objects from a complex and diverse background.

Figure 1: Illustration of our intuition. Given an input frame (b), our method leverages information from multiple reference frames (d) to better determine the foreground object (a), through a co-attention mechanism. (c) An inferior result without co-attention.

We argue that the primary objects in UVOS settings should be the most (i) distinguishable in an individual frame (locally salient), and (ii) frequently appearing throughout the video sequence (globally consistent). These two properties are essential for determining the primary objects. For instance, by only glimpsing a short video clip as illustrated in Fig. 1(b), it is hard to determine the primary objects. Instead, if we view the entire video (or a sufficiently long sequence) as in Fig. 1(d), the foreground can be easily discovered. Although primary objects tend to be highly correlated at a macro level (entire video), they often exhibit different appearances at a micro level (shorter video clips) due to articulated body motions, occlusions, out-of-view movements, camera movements, and environment variations. Clearly, micro level variations are the major sources of challenges in video segmentation. Thus, it is desirable to take advantage of the global consistency property and leverage the information from other frames.

By considering UVOS from a global perspective, we can help to locate primary objects and alleviate the local ambiguities. This notion also motivated the earlier heuristic models for UVOS 

[14], yet it is largely ignored by current deep learning based models.

Current deep learning based UVOS models typically focus on the intra-frame discrimination property of primary objects in appearance or motion, while ignoring the valuable global-occurrence consistency across multiple frames. These methods compute optical flows across a few consecutive frames [53, 24, 9, 32, 33]

, which is limited to a local receptive window in the temporal domain. Although recurrent neural networks (RNNs) 

[49] are introduced to memorize previous frames, this sequential processing strategy may fail to explicitly explore the rich relations between different frames, hence does not attain a global perspective.

With these insights, we reformulate the UVOS task as a co-attention procedure and propose a novel CO-attention Siamese Network (COSNet) to model UVOS from a global perspective. Specifically, during the training phase, COSNet takes a pair of frames from the same video as input and learns to capture their rich correlations. This is achieved by a differentiable, gated co-attention mechanism, which enables the network to attend more to the correlated, informative regions, and produce further discriminative foreground features. For a testing frame (Fig. 1(b)), COSNet is able to produce more accurate results (Fig. 1(a)) from a global view, i.e., utilize the correlations between the testing frame and multiple reference frames. Fig. 1(c) shows the inferior result when considering only the information from the testing frame (Fig. 1(b)).

Another advantage of our COSNet is that it is remarkably efficient for augmenting training data, as it allows using a large number of arbitrary frame pairs within the same video. Additionally, as we explicitly model the relations between video frames, the proposed model does not need to compute optical flow, which is time-consuming and computationally expensive. Finally, the COSNet offers a unified, end-to-end trainable framework that efficiently mines rich contextual information within video sequences. We implement different co-attention mechanisms such as vanilla co-attention, symmetric co-attention, and channel-wise co-attention, which offers a more insightful glimpse into the task of UVOS. We quantitatively demonstrate that our co-attention mechanism is able to bring large improvement in performance, which confirms its effectiveness and the value of global information for UVOS. The proposed COSNet shows superior performance over the current state-of-the-art methods across three popular benchmarks: DAVIS16 [45], FBMS [41] and Youtube-Objects [47].

2 Related Work

We start by providing an overview of representative work on video object segmentation (§2.1), followed by a brief overview of differentiable neural attention (§2.2).

2.1 Video Object Segmentation

According to its supervision type, video object segmentation can be broadly categorized into unsupervised (UVOS) and semi-supervised video object segmentation. In this paper, we focus on the UVOS task, which extracts primary object(s) without manual annotation.

Early UVOS models typically analyzed long-term motion information (trajectories) [4, 40, 17, 42, 28, 41], leveraged object proposals [31, 37, 70, 30, 18, 27] or utilized saliency information [59, 14, 55, 21], to infer the target. Later, inspired by the success of deep learning, several methods [16, 54, 43] began to approach UVOS using deep learning features. These were typically limited due to their lack of end-to-end learning ability [54] and use of heavyweight fully-connected network architectures [16, 43]

. Recently, more research efforts have focused on the fully convolutional neural network based UVOS models. For example, Tokmakov

et al[52] proposed to separate independent object and camera motion using a learnable motion pattern network [52]. Li et al. learned an instance embedding network [32] from static images to better locate the object(s), and later they combined motion-based bilateral networks for identifying the background [33]. Two-stream fully convolution networks are also a popular choice [9, 24, 53, 32] to fuse motion and appearance information together for object inference. An alternative way to segment an object is through video salient object detection [49]. This method fine-tunes the pre-trained semantic segmentation network for extracting spatial saliency features, then trains ConvLSTM to capture temporal dynamics.

These deep UVOS models generally achieved promising results, which demonstrates well the advantages of applying neural networks to this task. However, they only consider the sequential nature of UVOS and short-term temporal information, lacking a global view and comprehensive use of the rich, inherent correlation information within videos.

Figure 2: Overview of COSNet in the training phase. A pair of frames is fed into a feature embedding module to obtain the feature representations , . Then, the co-attention module computes the attention summaries that encode the correlations between and . Finally, and are concatenated and handed over to a segmentation module to produce segmentation predictions.

For SVOS methods, the target object(s) is provided in the first frame and tracked automatically [59, 8, 5, 68, 2, 69, 64, 71] or interactively by users [1] in the subsequent frames. Numerous algorithms were proposed based on graphical models [54], object proposals [46], super-trajectories [60], etc. Recently, deep learning based methods achieved promising results. Some algorithms treated video object segmentation as a static segmentation task without using any temporal information [44], built a deep one-shot learning framework [5, 58], or used a mask-propagation network [25]. In addition, both object tracking [29, 8, 12, 36] and person re-identification [34, 66] have been fused into SVOS task to handle deformation and occlusion issues. Hu et al[22] proposed a Siamese network based SVOS model. Compared with our COSNet, the differences are distinct, rather than their dissimilar supervision manners. First, since [22] was proposed based on image matching strategy, they used a Siamese network to propagate the first-frame annotation to the subsequent frames. Our method substantially differs in that we learn the Siamese network to capture rich and global correspondences within videos to further assist automatic primary object discovery and segmentation. Second, we provide the first approach that uses a co-attention scheme to facilitate correspondence learning for video object segmentation.

2.2 Attention Mechanisms in Neural Networks

Differentiable attentions, which are inspired by human perception [13, 61], have been widely studied in deep neural networks [26, 56, 38, 23, 57, 62, 15]. With end-to-end training, neural attention allows networks to selectively pay attention to a subset of inputs. For example, Chu et al[11]

exploited multi-context attention for human pose estimation. In 

[7], spatial and channel-wise attention were proposed to dynamically select an image part for captioning.

More recently, co-attention mechanisms have been studied in vision-and-language tasks, such as visual question answering [35, 65, 63, 39] and visual dialogue [63]. In these works, co-attention mechanisms were used to mine the underlying correlations between different modalities. For example, Lu et al[35] created a model that jointly performs question-guided visual attention and image-guided question attention. In this way, the learned model can selectively focus on image regions and segments of documents. Our co-attention model is inspired by these works, but it is used to capture the coherence across different frames with a more elegant network architecture.

3 Proposed Algorithm

Our COSNet formulates UVOS as a co-attention procedure. A co-attention module learns to explicitly encode correlations between video frames. This enables COSNet to attend to the frequently coherent regions, thus further helping to discover the foreground object(s) and produce reasonable UVOS results. Specifically, during training, co-attention procedure can be decomposed into the correlation learning between any frame pairs from the same video (see Fig. 2). During testing, COSNet infers the primary target with a global view, i.e., takes advantage of the co-attention information between the testing frame and multiple reference frames. We will elaborate the co-attention mechanisms in COSNet in §3.1, and detail the whole architecture of COSNet in §3.2. In §3.3, we will provide more implementation details.

3.1 Co-attention Mechanisms in COSNet

Vanilla co-attention. As shown in Fig. 2, given two video frames and from the same video, and denote the corresponding feature representations from a feature embedding network. and

are 3D-tensors with the width

, height and channels. We leverage the co-attention mechanism [65, 35] to mine the correlations between and

in their feature embedding space. More specifically, we first compute the affinity matrix

between and :


where is a weight matrix. Here and are flattened into matrix representations. Each column in

represents the feature vector at position

with dimensions. As a result, each entry of reflects the similarity between each row of and each column of . Since the weight matrix is a square matrix, the diagonalization of can be represented as follows:



is an invertible matrix and

is a diagonal matrix. Then, as shown in the gray area in Fig. 3, Eq. 1 can be re-written as:


Through the vanilla co-attention in Eq. 3

, the feature representation of each frame first undergoes linear transformations, and then calculates the distance between any locations of themselves.

Symmetric co-attention. If we further constrain the weight matrix to be a symmetric matrix, the project matrix

becomes an orthogonal matrix:

, where is a identity matrix. A symmetric co-attention can be derived from Eq. 3:


Eq. 4 indicates that we project the feature embeddings and into an orthogonal common space and maintain their norm of and . This property has proved valuable for eliminating the correlation between different channels (i.e., - dimension) [50] and improving the network’s generalization ability [3, 48].

Figure 3: Illustration of our co-attention operation.

Channel-wise co-attention. Furthermore, the project matrix can be simplified into an identity matrix (i.e., without space transformation), and then the weight matrix becomes a diagonal matrix. In this case, (i.e., ) can be further diagonalized into two diagonal matrices and . Thus, Eq. 3 can be re-written as channel-wise co-attention:


This operation is equal to applying a channel-wise weight to and before computing the similarity. This helps to alleviate channel-wise redundancy, which shares a similar spirit to Squeeze-and-Excitation mechanism [7, 20]. During ablation studies (§4.2), we perform detailed experiments to assess the effect of the different co-attention mechanisms, i.e., vanilla co-attention (Eq. 3), symmetric co-attention (Eq. 4) and channel-wise co-attention (Eq. 5).

Figure 4: Schematic illustration of training pipeline (a) and testing pipeline (b) of COSNet.

After obtaining the similarity matrix , as shown in the green and red areas in Fig. 3, we normalize row-wise and column-wise with a softmax function:


where softmax() normalizes each column of the input. In Eq. 6, the -th column of is a vector with length . This vector reflects the relevance of each feature () in to the -th feature in . Next, the attention summaries for the feature embedding w.r.t. can be computed as (see the blue areas in Fig. 3):


where denotes the -th column of , ‘’ denotes the matrix times vector, is the -th column of , indicates the -th column of and is the -th element in . Similarly, for frame , we compute the corresponding co-attention enhanced feature as: .

Gated co-attention. Considering the underlying appearance variations between input pairs, occlusions, and background noise, it is better to weight the information from different input frames, instead of treating all the co-attention information equally. To this end, a self-gate mechanism is introduced to allocate a co-attention confidence to each attention summary. The gate is formulated as follows:



is the logistic sigmoid activation function, and

and are the convolution kernel and bias, respectively. The gate determines how much information from the reference frame will be preserved and can be learned automatically. After calculating the gate confidences, the attention summaries are updated by:


where ‘’ denotes channel-wise Hadamard product. These operations lead to a gated co-attention framework.

Then we concatenate the final co-attention representation and the original feature together:


where ‘[]’ denotes the concatenation operation. Finally, the co-attention enhanced feature can be fed into a segmentation network to produce a final result .

3.2 Full COSNet Architecture

Fig. 4 shows the training and testing pipelines of the proposed COSNet. Basically, COSNet is a Siamese network which consists of three cascaded parts: a DeepLabv3 [6] based feature embedding module, a co-attention module (detailed in §3.1) and a segmentation module.

Network architecture during training phase. In the training phase, the Siamese network based COSNet takes two streams as input, i.e., a pair of the frame images which are randomly sampled from the same video. First, the feature embedding module is used to build their feature representations: . Next, are refined by the co-attention module and the co-attention enhanced feature are computed through Eq. 10. Finally, the corresponding segmentation predictions are produced by the segmentation module which consists of multiple small kernel convolution layers. Detailed configurations of the three modules can be found in the next section.

As we discussed in §1, primary objects in videos have two essential properties: (i) intra-frame discriminability, and (ii) inter-frame consistency. To distinguish the foreground target(s) from the background (property (i)), we utilize data from existing salient object segmentation datasets [10, 67] to train our backbone feature embedding module. As primary salient object instances are annotated in each image of these datasets, the learned feature embedding can catch and discriminate the objects of most interest. Meanwhile, to ensure COSNet is able to capture the global inter-frame coherence of the primary video objects (property (ii)), we train the whole COSNet with video segmentation data, where the co-attention module plays a key role in capturing the correlations between video frames. Specifically, we take two randomly selected frames in a video sequence to build training pairs. It is worth mentioning that this operation naturally and effectively augments training data, compared to previous recurrent neural network based UVOS models that take only consecutive frames.

In this way, the COSNet is alternatively trained with static image data and dynamic video data. When using image data, we only train the feature embedding module, where an extra convolution layer with sigmoid activation is added to generate intermediate segmentation side-output. The video data is used to train the whole COSNet, including the feature embedding module, the co-attention module as well as the segmentation module. We employ the weighted binary cross entropy loss to train the network:


where denotes the binary ground-truth, is the intermediate or final segment prediction at pixel , and is the foreground-background pixel number ratio.

In addition, for the symmetric co-attention in Eq. 4

, we add an extra orthogonal regularization into the loss function to maintain the symmetry of weight matrix



where is the regularization parameter.

Figure 5: Performance improvement for an increasing number of reference frames (§4.2). (a) Testing frames with ground-truths overlaid. (b)-(e) Primary object predictions with considering different number of reference frames ( and ). (f) Binary segments through applying CRF to (e). We can see that without co-attention, the COSNet degrades to a frame-by-frame segmentation model ((b): ). Once co-attention is added ((c): ), similar foreground distraction can be suppressed efficiently. Furthermore, more inference frames contribute to better segmentation performance ((c)-(e)).

Network architecture during testing phase. Once the network is trained, we apply the COSNet to unseen videos. Intuitively, given a test video, we can feed each frame to be segmented, along with only one reference frame sampled from the same video, into the COSNet successively. Performing this operation frame-by-frame, we can obtain all the segmentation results. However, with such a simple strategy, the segmentation results still contain considerable noise, since the rich and global correlation information in the videos is not fully explored. Therefore, it is critical to include more references during the testing phase (see Fig. 4 (b)). One intuitive solution is to feed a set of different reference frames (uniformly sampled from the same video) into the inference branches and average all predictions. A more favored way is that for the query frame , with the reference frame set containing reference frames, Eq. 9 is further reformulated by considering more attention summaries :


In this way, during the testing phase, the co-attention based feature is able to efficiently capture the foreground information from a global view by considering more reference frames. Then we feed into the segmentation module to generate the final output . Following the widely used protocol [53, 52, 49], we apply CRF as a post-processing step. In §4.2, we will quantitatively demonstrate the performance improvement with the increasing number of reference frames.

3.3 Implementation Details

Detailed network architecture. The backbone network of our COSNet is DeepLabv3 [6], which consists of the first five convolution blocks from ResNet [19] and an atrous spatial pyramid pooling (ASPP) module [6]. For the vanilla co-attention module (Eq. 3), we implement the weight matrix using a fully connected layer with parameters. In addition, the channel-wise co-attention in Eq. 5 is built on a Squeeze-and-Excitation (SE)-like module [20]. Specifically, the channel weights generated through fully connected layer with nodes in one branch are applied to the feature embedding of the other branch [20]. Eq. 8 is implemented with convolution layer with sigmoid activation function. The segmentation module consists of two convolutional layers (with 256 filters and batch norm ) and a convolutional layer (with 1 filter and activation) for final segmentation prediction.

Training settings. The whole training procedure of our COSNet consists of two alternated steps. When using static data to fine-tune the DeepLabV3 based feature embedding module, we take advantage of image saliency datasets: MSRA10K [10] and DUT [67]. In this way, the pixels belong to the foreground target tend to close to each other. Meanwhile, we train the whole model with the training videos in DAVIS16 [45]. In this step, two randomly selected frames from the same sequence are fed into COSNet as training pairs. Given the input RGB frame images of size , the size of the feature embeddings and are . The entire network is trained using the SGD optimizer with an initial learning rate of 2.5. During training, the batch size is set to 8 and the hyper-parameter in Eq. 12 is set to

. We implement the whole algorithm with Pytorch. All experiments and analyses are conducted on a Nvidia TITAN Xp GPU and an Intel (R) Xeon E5 CPU. TThe overall training time is about 20 hours and a forward pass with one image (batch) takes around 0.18 seconds in the testing phase.

4 Experiments

4.1 Experimental Setup

We conduct experiments on the three most famous UVOS datasets: DAVIS16 [45], FBMS [41] and Youtube-Objects [47] datasets.

DAVIS16 is a recent dataset which consists of 50 videos in total (30 videos for training and 20 for testing). Per-frame pixel-wise annotations are offered. For quantitative evaluation, following the standard evaluation protocol from [45], we adopt three metrics, namely region similarity , boundary accuracy , and time stability .

FBMS is comprised of 59 video sequences. Different from the DAVIS dataset, the ground-truth of FBMS is sparsely labeled (only 720 frames are annotated). Following the common setting [53, 52, 30, 32, 33, 49, 9], we validate the proposed method on the testing split which consists of 30 sequences. The region similarity is used for evaluation.

Youtube-Objects contains 126 video sequences which belong to 10 objects categories with more than 20,000 frames in total. We use the region similarity to measure the segmentation performance.

Network Variant DAVIS FBMS Youtube-Objects
mean mean mean
Co-attention Mechanism
Vanilla co-attention (Eq. 3) 80.0 -0.5 75.2 -0.4 70.3 -0.2
Symmetric co-attention (Eq. 4) 80.5 - 75.6 - 70.5 -
Channel-wise co-attention (Eq. 5) 77.2 -3.3 72.7 -2.9 67.5 -3.0
w/o. Co-attention 71.3 -9.2 70.1 -5.5 62.9 -7.6
Fusion Strategy
Attention summary fusion (Eq. 13) 80.5 - 75.6 - 70.5 -
Prediction segmentation fusion 79.5 -1.0 74.2 -1.4 69.9 -0.6
Frames Selection Strategy
Global uniform sampling 80.53 - 75.61 - 70.54 -0.01
Global random sampling 80.52 -0.01 75.54 -0.02 70.55 -
Local consecutive sampling 80.26 -0.27 75.52 -0.09 70.43 -0.12
Table 1: Ablation study (§4.2) of COSNet on DAVIS16 [45], FBMS [41] and Youtube-Objects [47] datasets with different co-attention mechanisms, fusion strategies and sampling strategies.
Dataset Number of reference frames ()
0 1 2 5 7
DAVIS 71.3 77.6 79.7 80.5 80.5
FBMS 70.2 74.8 75.3 75.6 75.6
Youtube-Objects 62.9 67.7 70.5 70.5 70.5
Table 2: Comparisons with different numbers of reference frames during the testing stage on DAVIS16 [45], FBMS [41] and Youtube-Objects [47] datasets (§4.2). The mean is adopted.
Method  [17]  [51]  [31]  [40]  [14]  [9]  [42]  [28]  [52]  [24]  [53]  [30]  [49] COSNet
Mean 47.3 48.2 49.8 53.3 55.1 55.2 55.8 67.4 70.0 70.7 75.9 76.2 77.2 80.5
Recall 49.3 54.0 59.1 61.6 55.8 57.5 64.9 81.4 85.0 83.0 89.1 91.1 90.1 94.0
Decay 8.3 10.5 14.1 2.4 12.6 2.2 0.0 6.2 1.3 1.5 0.0 7.0 0.9 0.0
Mean 44.1 44.7 42.7 50.8 52.3 55.2 51.1 66.7 65.9 65.3 72.1 70.6 74.5 79.4
Recall 43.6 52.6 37.5 60.0 61.0 51.9 51.6 77.1 79.2 73.8 83.4 83.5 84.4 90.4
Decay 12.9 11.7 10.6 5.1 11.4 3.4 2.9 5.1 2.5 1.8 1.3 7.9 -0.2 0.0
Mean 39.1 25.0 26.9 30.2 42.5 27.7 36.6 28.2 57.2 32.8 26.5 39.3 29.1 31.9
Table 3: Quantitative results on the test set of DAVIS16 [45]11footnotemark: 1 (see §4.3), using the region similarity , boundary accuracy and time stability . We also report the recall and the decay performance over time for both and . The best scores are marked in bold.

4.2 Diagnostic Experiments

In this section, we focus on exploration studies to assess the important setups and components of COSNet. The experiments were performed on the test sets of DAVIS16 [45] and FBMS [41] as well as the whole Youtube-Objects [47]. The evaluation criterion is mean region similarity ().

Comparison of different co-attention mechanisms. We first study the effect of different co-attention mechanisms in COSNet, i.e., vanilla co-attention (Eq. 3), symmetric co-attention (Eq. 4) and channel-wise co-attention (Eq. 5). In Table 1, both the fully connected method and the symmetric method achieve better performance than the channel attention mechanism. This proves the importance of space transformation in co-attention. Furthermore, compared with vanilla co-attention, we find symmetric co-attention performs slightly better. We attribute this to the orthogonal constraint which reduces feature redundancy while preserving the norm of the features unchanged.

Effect of co-attention mechanism. When excluding the co-attention module and only using the base feature embedding network (DeepLabv3), we observe a significant performance drop (mean : 80.571.3 in DAVIS), clearly showing the effectiveness of our strategy, which leverages co-attention mechanism to model UVOS from a global view.

Attention summary fusion vs prediction fusion. In Eq. 13, we fuse the information from other reference frames by averaging the corresponding co-attention summaries. To verify its effectiveness, we implement another alternative baseline Prediction Fusion: , i.e., directly average the predictions by considering different reference frames. The results in Table 1 demonstrate the superiority of fusion in the feature embedding space.

Comparison of different frame selection strategies. To investigate frame selection strategy during the testing phase on the final prediction, we further conduct a series of experiments using different sampling methods. Specifically, we adopt global random sampling, global uniform sampling as well as local consecutive sampling. From Table 1, it can be observed that both global-level sampling strategy achieve approximate performance but better than local sampling method. Meanwhile, local sampling-based results are still superior to the results obtained from the backbone network. Overall comparisons further prove the importance of incorporating global context.

Method NLC [14] FST [42] FSEG [24] MSTP [21] ARP [30]
Mean 44.5 55.5 68.4 60.8 59.8
Method IET [32] OBN [33] PDB [49] SFL [9] COSNet
Mean 71.9 73.9 74.0 56.0 75.6
Table 4: Quantitative performance on the test sequences of FBMS [41] using region similarity (mean ).

Influence of the number of reference frames. It is also of interest to assess the influence of the number of reference frames on the final performance. Table 2 shows the results for this. When is equal to , this means that there is no co-attention for segmentation. We observe a large performance improvement when changes from 0 to 1, which proves the importance of co-attention. Furthermore, when changes from 2 to 5, the quantitative results show increased performance. When we further increase , the final performance does not change obviously. We set the value of to 5 in the evaluation experiments.

Fig. 5 further visualizes the qualitative segmentation result for an increasing number of inference frames. When , the feature embedding module has learned to discriminate the foreground target from the background. However, when a similar object distractor appears (e.g., the small camel in the first row, or the red car in the second row), the feature embedding module fails to capture the primary target, since no ground-truth is given. In this case, the proposed co-attention mechanism can refer to long-range frames and capture the primary object, thus effectively suppressing the similar target distraction.

Figure 6: Qualitative results on three datasets (§4.3). From top to bottom: dance-twirl from the DAVIS16 dataset [45], horses05 from the FBMS dataset [41], and bird0014 from the Youtube-Objects dataset [47].

4.3 Quantitative and Qualitative Results

Evaluation on DAVIS16 [45]. Table 1 shows the overall results, with all the top performance methods taken from the DAVIS 2016 benchmark111 [45]. COSNet outperforms all the reported methods across most metrics. Compared with the second best method, PDB [49], our COSNet achieves gains of 2.6 and 4.9 on Mean and Mean, respectively.

In Table 1, several other deep learning based state-of-the-art UVOS methods [9, 52, 24, 53, 33] leverage both appearance as well as extra motion information to improve the performance. Different from these methods, the proposed COSNet only utilizes appearance information but achieves superior performance. We attribute our performance improvement to the consideration of more temporal information through the co-attention mechanism. Compared with these methods using optical flow to catch successive temporal information, the advantage of exploiting the temporal correlation from a global view is clear when dealing with similar target distractions.

Evaluation on FBMS [41]. We also perform experiments on the FBMS dataset for completeness. Table 4 shows that our COSNet performs better (75.6% in mean ) than state-of-the-art methods [14, 42, 24, 21, 30, 32, 33, 49, 9]. In most competing methods, except for the RGB input, additional optical flow information is utilized to estimate the segmentation mask. Considering lots of foreground objects in FBMS share similar appearance with the background but have different motion patterns, optical flow information clearly benefits the prediction. By contrast, our COSNet only takes advantage of the original RGB information and achieves better performance.

Method  [42]  [55]  [30]  [53]  [49]  [24]  [9] COSNet
Airplane (6) 70.9 69.3 73.6 86.2 78.0 81.7 65.6 81.1
Bird (6) 70.6 76.0 56.1 81.0 80.0 63.8 65.4 75.7
Boat (15) 42.5 53.5 57.8 68.5 58.9 72.3 59.9 71.3
Car (7) 65.2 70.4 33.9 69.3 76.5 74.9 64.0 77.6
Cat (16) 52.1 66.8 30.5 58.8 63.0 68.4 58.9 66.5
Cow (20) 44.5 49.0 41.8 68.5 64.1 68.0 51.1 69.8
Dog (27) 65.3 47.5 36.8 61.7 70.1 69.4 54.1 76.8
Horse (14) 53.5 55.7 44.3 53.9 67.6 60.4 64.8 67.4
Motorbike (10) 44.2 39.5 48.9 60.8 58.3 62.7 52.6 67.7
Train (5) 29.6 53.4 39.2 66.3 35.2 62.2 34.0 46.8
Mean 53.8 58.1 46.2 67.5 65.4 68.4 57.0 70.5
Table 5: Quantitative performance of each category on Youtube-Objects [47]4.3) with the region similarity (mean ). We show the average performance for each of the 10 categories from the dataset and the final row shows an average over all the videos.

Evaluation on Youtube-Objects [47]. Table 5 illustrates the results of all compared methods for different categories. Our approach outperforms all compared methods [42, 55, 30, 53, 49, 24, 9] by a large margin. FSEG performs second best under the mean metric. It is worth noting that the Youtube-Objects dataset shares categories with the training samples in FSEG, which contributes to the enhanced performance [24]. In addition, all the categories in Youtube-Objects can be divided into two types: grid objects (e.g., Airplane, Train) and non-grid objects (e.g., Bird, Cat). Despite the objects in the latter class often undergoing shape deformation and quick appearance variation, the COSNet can capture long-term dependency and handle these scenarios better than all compared methods.

Qualitative Results. Fig. 6 shows the qualitative results across three datasets. DAVIS16 [45] contains many challenging videos with fast motion, deformation and multiple instances of the same category. We can see that the proposed COSNet can track the primary region or target tightly by leveraging a co-attention scheme to consider global temporal information. The co-attention mechanism helps the proposed COSNet to segment out primary objects from the cluttered background. The effectiveness can also be seen in the bird0014 sequence of the Youtue-Objects dataset. In addition, we observe that some videos contain multiple moving targets (e.g., horses05) in the FBMS dataset, and the proposed COSNet can deal with such scenarios well.

5 Conclusion

By regarding UVOS as a temporal coherence capturing task, we proposed a novel model, COSNet, to estimate the primary target(s). Through an alternated network training strategy with saliency image and video pairs, the proposed network learns to discriminate primary objects from the background in each frame and capture the temporal correlation across frames. The proposed method achieved superior performance on three representative video segmentation datasets. Extensive experimental results proved that our method can effectively suppress similar target distraction despite no annotation being given during the segmentation. The COSNet is a general framework for handling sequential data learning, and can be readily extended to other video analysis tasks, such as video saliency detection and optical flow estimation.

Acknowledgements This work was supported in part by the National Key Research and Development Program of China (2016YFB1001003), STCSM(18DZ1112300), and the Australian Research Council’s Discovery Projects funding scheme (DP150104645).


  • [1] X. Bai, J. Wang, D. Simons, and G. Sapiro (2009)

    Video SnapCut: robust video object cutout using localized classifiers

    TOG 28 (3), pp. 70. Cited by: §2.1.
  • [2] L. Bao, B. Wu, and W. Liu (2018) CNN in MRF: video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In CVPR, Cited by: §2.1.
  • [3] A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2017) Neural photo editing with introspective adversarial networks. In ICLR, Cited by: §3.1.
  • [4] T. Brox and J. Malik (2010) Object segmentation by long term analysis of point trajectories. In ECCV, Cited by: §2.1.
  • [5] S. Caelles, K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. V. Gool (2017) One-shot video object segmentation. In CVPR, Cited by: §2.1.
  • [6] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587. Cited by: §3.2, §3.3.
  • [7] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua (2017)

    SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning

    In CVPR, Cited by: §2.2, §3.1.
  • [8] J. Cheng, Y. Tsai, W. Hung, S. Wang, and M. Yang (2018) Fast and accurate online video object segmentation via tracking parts. In CVPR, Cited by: §2.1.
  • [9] J. Cheng, Y. Tsai, S. Wang, and M. Yang (2017) Segflow: joint learning for video object segmentation and optical flow. In ICCV, Cited by: §1, §2.1, §4.1, §4.3, §4.3, §4.3, Table 3, Table 4, Table 5.
  • [10] M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu (2015) Global contrast based salient region detection. IEEE TPAMI 37 (3), pp. 569–582. Cited by: §3.2, §3.3.
  • [11] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang (2017)

    Multi-context attention for human pose estimation

    In CVPR, Cited by: §2.2.
  • [12] H. Ci, C. Wang, and Y. Wang (2018) Video object segmentation by learning location-sensitive embeddings. In ECCV, Cited by: §2.1.
  • [13] M. Denil, L. Bazzani, H. Larochelle, and N. de Freitas (2012) Learning where to attend with deep architectures for image tracking. Neural Computation 24 (8), pp. 2151–2184. Cited by: §2.2.
  • [14] A. Faktor and M. Irani (2014) Video segmentation by non-local consensus voting. In BMVC, Cited by: §1, §2.1, §4.3, Table 3, Table 4.
  • [15] H. Fang, J. Cao, Y. Tai, and C. Lu (2018) Pairwise body-part attention for recognizing human-object interactions. In ECCV, Cited by: §2.2.
  • [16] K. Fragkiadaki, P. Arbelaez, P. Felsen, and J. Malik (2015) Learning to segment moving objects in videos. In CVPR, Cited by: §2.1.
  • [17] K. Fragkiadaki, G. Zhang, and J. Shi (2012) Video segmentation by tracing discontinuities in a trajectory embedding. In CVPR, Cited by: §2.1, Table 3.
  • [18] H. Fu, D. Xu, B. Zhang, and S. Lin (2014) Object-based multiple foreground video co-segmentation. In CVPR, Cited by: §2.1.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.3.
  • [20] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, Cited by: §3.1, §3.3.
  • [21] Y. Hu, J. Huang, and A. G. Schwing (2018) Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In ECCV, Cited by: §2.1, §4.3, Table 4.
  • [22] Y. Hu, J. Huang, and A. G. Schwing (2018) VideoMatch: matching based video object segmentation. In ECCV, Cited by: §2.1.
  • [23] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu (2015) Spatial transformer networks. In NIPS, Cited by: §2.2.
  • [24] S. D. Jain, B. Xiong, and K. Grauman (2017) Fusionseg: learning to combine motion and appearance for fully automatic segmention of generic objects in videos. In CVPR, Cited by: §1, §2.1, §4.3, §4.3, §4.3, Table 3, Table 4, Table 5.
  • [25] V. Jampani, R. Gadde, and P. V. Gehler (2017) Video propagation networks. In CVPR, Cited by: §2.1.
  • [26] S. Jetley, N. A. Lord, N. Lee, and P. H. Torr (2018) Learn to pay attention. In ICLR, Cited by: §2.2.
  • [27] Y. Jun Koh, Y. Lee, and C. Kim (2018) Sequential clique optimization for video object segmentation. In ECCV, Cited by: §2.1.
  • [28] M. Keuper, B. Andres, and T. Brox (2015) Motion trajectory segmentation via minimum cost multicuts. In ICCV, Cited by: §2.1, Table 3.
  • [29] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele (2017) Lucid data dreaming for object tracking. CVPR Workshops. Cited by: §2.1.
  • [30] Y. J. Koh and C. Kim (2017) Primary object segmentation in videos based on region augmentation and reduction. In CVPR, Cited by: §2.1, §4.1, §4.3, §4.3, Table 3, Table 4, Table 5.
  • [31] Y. J. Lee, J. Kim, and K. Grauman (2011) Key-segments for video object segmentation. In ICCV, Cited by: §2.1, Table 3.
  • [32] S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C.-C. Jay Kuo (2018) Instance embedding transfer to unsupervised video object segmentation. In CVPR, Cited by: §1, §2.1, §4.1, §4.3, Table 4.
  • [33] S. Li, B. Seybold, A. Vorobyov, X. Lei, and C.-C. Jay Kuo (2018) Unsupervised video object segmentation with motion-based bilateral networks. In ECCV, Cited by: §1, §2.1, §4.1, §4.3, §4.3, Table 4.
  • [34] X. Li and C. Change Loy (2018) Video object segmentation with joint re-identification and attention-aware mask propagation. In ECCV, Cited by: §2.1.
  • [35] J. Lu, J. Yang, D. Batra, and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In NIPS, Cited by: §2.2, §3.1.
  • [36] X. Lu, C. Ma, B. Ni, X. Yang, I. Reid, and M. Yang (2018) Deep regression tracking with shrinkage loss. In ECCV, Cited by: §2.1.
  • [37] T. Ma and L. J. Latecki (2012) Maximum weight cliques with mutex constraints for video object segmentation. In CVPR, Cited by: §2.1.
  • [38] V. Mnih, N. Heess, A. Graves, et al. (2014) Recurrent models of visual attention. In NIPS, Cited by: §2.2.
  • [39] D. Nguyen and T. Okatani (2018) Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In CVPR, Cited by: §2.2.
  • [40] P. Ochs and T. Brox (2011) Object segmentation in video: A hierarchical variational approach for turning point trajectories into dense regions. In ICCV, Cited by: §2.1, Table 3.
  • [41] P. Ochs, J. Malik, and T. Brox (2014) Segmentation of moving objects by long term video analysis. IEEE TPAMI 36 (6), pp. 1187–1200. Cited by: §1, §2.1, Figure 6, §4.1, §4.2, §4.3, Table 1, Table 2, Table 4.
  • [42] A. Papazoglou and V. Ferrari (2013) Fast object segmentation in unconstrained video. In ICCV, Cited by: §2.1, §4.3, §4.3, Table 3, Table 4, Table 5.
  • [43] D. Pathak, R. Girshick, P. Dollar, T. Darrell, and B. Hariharan (2017) Learning features by watching objects move. In CVPR, Cited by: §2.1.
  • [44] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung (2017) Learning video object segmentation from static images. In CVPR, Cited by: §2.1.
  • [45] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, Cited by: §1, §3.3, Figure 6, §4.1, §4.1, §4.2, §4.3, §4.3, Table 1, Table 2, Table 3.
  • [46] F. Perazzi, O. Wang, M. H. Gross, and A. Sorkine-Hornung (2015) Fully connected object proposals for video segmentation. In ICCV, Cited by: §2.1.
  • [47] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari (2012) Learning object class detectors from weakly annotated video. In CVPR, Cited by: §1, Figure 6, §4.1, §4.2, §4.3, Table 1, Table 2, Table 5.
  • [48] P. Rodríguez, J. Gonzalez, G. Cucurull, J. M. Gonfaus, and X. Roca (2017) Regularizing cnns with locally constrained decorrelations. In ICLR, Cited by: §3.1.
  • [49] H. Song, W. Wang, S. Zhao, J. Shen, and K. Lam (2018) Pyramid dilated deeper convlstm for video salient object detection. In ECCV, Cited by: §1, §2.1, §3.2, §4.1, §4.3, §4.3, §4.3, Table 3, Table 4, Table 5.
  • [50] Y. Sun, L. Zheng, W. Deng, and S. Wang (2017) SVDNet for pedestrian retrieval. In ICCV, Cited by: §3.1.
  • [51] B. Taylor, V. Karasev, and S. Soatto (2015) Causal video object segmentation from persistence of occlusions. In CVPR, Cited by: Table 3.
  • [52] P. Tokmakov, K. Alahari, and C. Schmid (2017) Learning motion patterns in videos. In CVPR, Cited by: §2.1, §3.2, §4.1, §4.3, Table 3.
  • [53] P. Tokmakov, K. Alahari, and C. Schmid (2017) Learning video object segmentation with visual memory. In ICCV, Cited by: §1, §2.1, §3.2, §4.1, §4.3, §4.3, Table 3, Table 5.
  • [54] Y. Tsai, M. Yang, and M. J. Black (2016) Video segmentation via object flow. In CVPR, Cited by: §2.1, §2.1.
  • [55] Y. Tsai, G. Zhong, and M. Yang (2016) Semantic co-segmentation in videos. In ECCV, Cited by: §2.1, §4.3, Table 5.
  • [56] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §2.2.
  • [57] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. In CVPR, Cited by: §2.2.
  • [58] W. Wang, J. Shen, X. Li, and F. Porikli (2015) Robust video object cosegmentation. IEEE TIP 24 (10), pp. 3137–3148. Cited by: §2.1.
  • [59] W. Wang, J. Shen, and F. Porikli (2015) Saliency-aware geodesic video object segmentation. In CVPR, Cited by: §2.1, §2.1.
  • [60] W. Wang, J. Shen, J. Xie, and P. Fatih (2017) Super-trajectory for video segmentation. In ICCV, Cited by: §2.1.
  • [61] W. Wang and J. Shen (2018) Deep visual attention prediction. IEEE TIP 27 (5), pp. 2368–2378. Cited by: §2.2.
  • [62] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, Cited by: §2.2.
  • [63] Q. Wu, P. Wang, C. Shen, I. Reid, and A. van den Hengel (2018) Are you talking to me? Reasoned visual dialog generation through adversarial learning. In CVPR, Cited by: §2.2.
  • [64] H. Xiao, J. Feng, G. Lin, Y. Liu, and M. Zhang (2018) MoNet: deep motion exploitation for video object segmentation. In CVPR, Cited by: §2.1.
  • [65] C. Xiong, V. Zhong, and R. Socher (2017) Dynamic coattention networks for question answering. In ICLR, Cited by: §2.2, §3.1.
  • [66] Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, and X. Yang (2016) Person re-identification via recurrent feature aggregation. In ECCV, Cited by: §2.1.
  • [67] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang (2013) Saliency detection via graph-based manifold ranking. In CVPR, Cited by: §3.2, §3.3.
  • [68] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos (2018) Efficient video object segmentation via network modulation. In CVPR, Cited by: §2.1.
  • [69] J. S. Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, and I. S. Kweon (2017) Pixel-level matching for video object segmentation using convolutional neural networks. In ICCV, Cited by: §2.1.
  • [70] D. Zhang, O. Javed, and M. Shah (2013) Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In CVPR, Cited by: §2.1.
  • [71] W. Zuo, X. Wu, L. Lin, L. Zhang, and M. Yang (2018) Learning support correlation filters for visual tracking. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.1.