Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks

01/19/2020 ∙ by Wenguan Wang, et al. ∙ 18

This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS). The suggested AGNN recasts this task as a process of iterative information fusion over video graphs. Specifically, AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges. The underlying pair-wise relations are described by a differentiable attention mechanism. Through parametric message passing, AGNN is able to efficiently capture and mine much richer and higher-order relations between video frames, thus enabling a more complete understanding of video content and more accurate foreground estimation. Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case. To further demonstrate the generalizability of our framework, we extend AGNN to an additional task: image object co-segmentation (IOCS). We perform experiments on two famous IOCS datasets and observe again the superiority of our AGNN model. The extensive experiments verify that AGNN is able to learn the underlying semantic/appearance relationships among video frames or related images, and discover the common objects.



There are no comments yet.


page 1

page 4

page 7

page 8

Code Repositories


Zero-shot Video Object Segmentation via Attentive Graph Neural Networks (ICCV2019 Oral)

view repo


Zero-shot Video Object Segmentation via Attentive Graph Neural Networks (ICCV19, Oral)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatically identifying the primary objects in videos is an important problem that could benefit a wide variety of applications, by reducing or eliminating manual effort needed to process and understand video. However, discovering the most prominent and distinct objects across video frames without having prior knowledge of what those foreground objects are is a challenging task.

Figure 1: Illustration of the proposed AGNN based ZVOS model. (a) Input video sequence, typically with object occlusion and scale variation. (b) The suggested AGNN represents video frames as nodes (blue circles), and the relations between arbitrary frame pairs as edges (black arrows), captured by an attention mechanism. After several message passing iterations, higher-order relations can be mined and more optimal foreground estimations are obtained from a global view. (c) Final video object segmentation results. Best viewed in color. Zoom in for details.

Traditional methods tend to tackle this issue by using handcrafted or learnable features in a local or sequential manner. For instance, handcrafted feature based methods use objectness [74], motion boundary [43], and saliency [67] cues over a few successive video frames, or explore trajectories [41], , link optical flow over multiple frames to capture long-term motion information. These are typically non-learning methods working in a purely unsupervised

manner. Recent deep learning based methods learn more powerful video object features from large-scale training data, yielding a

zero-shot solution [63] (still no annotation used for any testing frame). Many of these [7, 57, 21, 58, 31, 55]

employ two-stream networks to combine local motion and appearance information, and apply recurrent neural networks to model the dynamics in a frame-by-frame manner.

Though these methods greatly promoted the development of this field and gained promising results, they generally suffer from two limitations. First, they focus primarily on the local pair-wise or sequential relations between successive frames, while ignoring the ubiquitous, high-order relationships among the frames (since frames from the same video are usually correlated). Second, since they do not fully leverage the rich relationships, they fail to completely capture the video content and hence may easily get inferior foreground estimates. From another perspective, as video objects usually suffer from underlying object occlusions, huge scale variations and appearance changes (Fig. 1 (a)), it is difficult to correctly infer the foreground when only considering successive or local pair-wise relations in videos.

To alleviate these issues, we need to explore an effective framework that can comprehensively model the high-order relationships among video frames into modern neural networks. In this work, an attentive graph neural network (AGNN) is proposed to addresses zero-shot video object segmentation (ZVOS), which recasts ZVOS as an end-to-end, message passing based graph information fusion procedure (Fig. 1 (b)). Specifically, we construct a fully connected graph where video frames are represented as nodes and the pair-wise relations between two frames are described as the edge between their corresponding nodes. The correlation between two frames is efficiently captured by an attention mechanism, which avoids time-consuming optical flow estimation [7, 57, 21, 58, 31]. By using recursive message passing to iteratively propagate information over the graph, , each node receives the information from other nodes, AGNN can capture higher-order relationships among video frames and obtain more optimal results from a global view. In addition, as video object segmentation is a per-pixel prediction task, AGNN has a desirable, spatial information preserving property, which significantly distinguishes it from previous fully connected graph neural networks (GNNs).

AGNN operates on multiple frames, bringing the added advantage of natural training data augmentation, as the combination candidates are numerous. In addition, since AGNN offers a powerful tool for representing and mining much richer and higher-order relationships among video frames, it brings a more complete understanding of video content. More significantly, due to its recursive property, AGNN is flexible enough to process variable numbers of nodes during inference, enabling it to consider more input information and gain better performance (Fig. 1 (c)).

We extensively evaluate AGNN on three widely-used video object segmentation datasets, namely DAVIS [45], Youtube-Objects [47] and DAVIS [46], showing its superior performance over current state-of-the-art methods.

AGNN is a fully differential, end-to-end trainable framework that allows rich and high-order relations among frames (images) to be captured and is highly applicable to spatial prediction problems. To further demonstrate its advantages and generalizability, we apply AGNN to an additional task: image object co-segmentation (IOCS), which aims to extract the common objects from a group of semantically related images. It also gains promising results on two popular IOCS benchmarks, PASCAL VOC [11] and Internet [51], compared to existing IOCS methods.

Experiments on the ZVOS and additional IOCS tasks clearly demonstrate that AGNN is able to not only capture the relationships among correlated video frame images, but also mine the semantics among semantically related static images. Notably, this work can be viewed as a very early attempt to apply and extend GNNs for pixel-wise prediction tasks, which provides an effective video object segmentation solution and new insight into this task.

2 Related Work

2.1 Graph Neural Networks

GNN was first proposed in [15] and further developed in [53] to handle the underlying relationships among structured data. In [53], recurrent neural networks were used to model the state of each node, and the underlying correlation between nodes are learned via parameterized message passing over neighbors. Li  [33] further adapted GNN to sequential outputs. Gilmer  [14] Later formulated the message passing module in GNNs as a learnable neural network. Recently, GNNs have been successfully applied in many fields, including molecular biology [14]

, computer vision 

[48, 71, 76]

, machine learning 


and natural language processing 

[2]. Another popular trend in GNNs is to generalize the convolutional architecture over arbitrary graph-structured data [10, 40, 26]

, which is called graph convolution neural network (GCNN).

The proposed AGNN falls into the former category; it is a message passing based GNN, where all the nodes, edges, and message passing functions are parameterized by neural networks. It shares the general idea of mining relationships over graphs but has significant differences. First, our AGNN is unique in its spatial information preserving nature, which is opposed to conventional fully connected GNNs and crucial for per-pixel prediction task. Second, to efficiently capture the relationship between two image frames, we introduce a differentiable attention mechanism which addresses the correlated information and produces further discriminative edge features. Third, as far as we know, there is no prior attempt to explore GNNs in ZVOS.

2.2 Automatic Video Object Segmentation

To automatically separate primary objects from the background, conventional methods typically use handcrafted features (, color, optical flow) [43, 12, 59, 20]

and certain heuristic assumptions related to the foreground (, local motion differences 

[43], background priors [67]). Some others explore more efficient object representations, such as dense point trajectories [41, 42, 66] or object proposals [74, 27, 23, 36]. Most of these methods work in a purely unsupervised manner without using any training data.

Recently, with the renaissance of deep learning, more research efforts have been devoted to tackling this in deep learning frameworks, leading to a zero-shot solution [13, 21, 58, 7, 30, 31, 29, 37]. For instance, a multi-layer perception based detector was designed in [13] to detect moving objectness. Li  [30] integrated deep learning based instance embedding and motion saliency [30] to boost performance. Some others turned to fully convolutional networks (FCNs) [3, 34, 77]. They introduced two-stream networks to fuse appearance and motion information [29, 21, 7]

, or explored more efficient feature extraction models and LSTM variants 

[55], to better locate the foreground objects.

The differences from previous methods are multifold: our AGNN 1) provides a unified, end-to-end trainable, graph model based ZVOS solution; 2) efficiently mines diverse and high-order relations within videos, through iteratively propagating and fusing messages over the graph; and 3) utilizes a differentiable attention mechanism to capture the correlated information between frame pairs.

2.3 Image Object Co-Segmentation

IOCS [50, 39, 18] aims to jointly segment common objects belonging to the same semantic class in a given set of related images. Early methods usually formulate IOCS as an energy function defined over the whole or a part of the image set and consider intra- and inter-image cues [64, 25, 52, 65]. To capture the relationships between images, some methods applied scene matching techniques [51], global appearance models [68], discriminative clustering methodologies [22], manifold ranking [49] or saliency heuristics [16, 56]. There are only a very few deep IOCS models [4, 32], mainly due to the lack of a proper, end-to-end modeling strategy for this problem. [4, 32] tackled IOCS through a pair-wise comparison protocol and employed a Siamese network to capture the similarity between two related images. Our AGNN based ICOS solution is significantly different from [4, 32]. First, [4, 32] consider IOCS as a pair-wise image matching problem, while we formulate IOCS as an information propagation and fusion process among multiple images. That means our model can capture richer relations from a global view. Second, the Siamese network based systems only handle pair-wise relations, while our message passing based iterative inference can learn higher-order relations among multiple images. Third, our method is based on the graph model, yielding a more general and elegant framework for modeling IOCS.

Figure 2: Our AGNN based ZVOS model during the training phase (see §3.2 and §3.3). Zoom in for details.

3 Our Algorithm

Before elaborating on our proposed AGNN (§3.2), we first give a brief introduction to generic formulations of GNN models (§3.1). Finally, in §3.3, we provide detailed information on our network architecture.

3.1 General Formulations of GNNs

Based on deep neural networks and graph theory, GNNs are powerful for collectively aggregating information from data represented in graph domains [53, 14]. Specifically, a GNN model is defined according to a graph . Each node takes a unique value from , is associated with an initial node representation (or node state or node embedding) . Each edge is a pair , with an edge representation . For each node , we learn an updated node representation through aggregating representations of its neighbors. Here is used to produce an output , , a node label. More specifically, GNNs map graph to the node outputs through two phases. First, a parametric message passing phase runs for steps, which recursively propagates messages and updates node representations. At the -th iteration, for each node , we update its state according to its received message (, summarized information from its neighbors ) and its previous state :


where , and are the message function and state update function, respectively. After iterations of aggregation, captures the relations within the -hop neighborhood of node .

Second, a readout phase maps the node representation of the final -iteration to a node output, through a readout function :


The message function , update function , and readout function are all learned differentiable functions.

Next, we present our AGNN based ZVOS solution, which essentially extends traditional fully connected GNNs to (1) preserve spatial features; and (2) capture pair-wise relations (edges) via a differentiable attention mechanism.

3.2 Attentive Graph Neural Network

Problem Definition and Notations. Given a set of training samples and an unseen testing video with frames in total, the goal of ZVOS is to generate a corresponding sequence of binary segment masks: . To achieve this, AGNN represents as a directed graph , where node represents the -th frame , and edge indicates the relation from to . To comprehensively capture the underlying relationships between video frames, we assume is fully connected and includes self-connections at each node (see Fig. 2 (a)). For clarity, we refer to , which connects a node to itself, as a loop-edge; and , which connects two different nodes and , as a line-edge.

The core idea of our AGNN is to perform message propagation iterations over to efficiently mine rich and high-order relations within . This helps to better capture the video content from a global view and obtain more accurate foreground estimates. We then readout the segmentation predictions from the final node states . Next, we describe each component of our model in detail.

Figure 3: Detailed illustration of our (a) node embedding, (b) intra-attention based loop-edge embedding and corresponding loop-message generation, (c) inter-attention based straight-edge embedding and corresponding neighbor message generation.

FCN-Based Node Embedding. We leverage DeepLabV3 [5], a classical FCN based semantic segmentation architecture, to extract effective frame features, as node representations (see Fig. 2 (b) and Fig. 3 (a)). For node , its initial embedding can be computed as:



is a 3D tensor feature with

spatial resolution and channels, which preserves spatial information as well as high-level semantic information.

Intra-Attention Based Loop-Edge Embedding. A loop-edge is a special edge that connects a node to itself. The loop-edge embedding is used to capture the intra relations within node representation (, internal frame representation). We formulate as an intra-attention mechanism [61, 70], which has been proven complementary to convolutions and helpful for modeling long-range, multi-level dependencies across image regions [75]. In particular, the intra-attention calculates the response at a position by attending to all the positions within the same node embedding (see Fig. 2 (c) and Fig. 3 (b)):


where ‘’ represents the convolution operation, s indicate learnable convolution kernels, and is a learnable scale parameter. Eq. 4 makes the output element of each position in encode contextual information as well as its original information, thus enhancing the representability.

Inter-Attention Based Line-Edge Embedding. A line-edge connects two different nodes and . The line-edge embedding is used to mine the relation from node to , in the node embedding space (see Fig. 2 (b)). Here we compute an inter-attention mechanism [35] to capture the bi-directional relations between two nodes and (see Fig. 2 (c) and Fig. 3 (c)):


where . indicates the outgoing edge feature and the incoming one, for node . indicates a learnable weight matrix. and are flattened into matrix representations. Each element in reflects the similarity between each row of and each column of . As a result, can be viewed as the importance of node ’s embedding to , and vice versa. By attending to each node pair, explores their joint representations in the node embedding space.

Gated Message Aggregation. In our AGNN, for the message passed in the self-loop, we view the loop-edge embedding itself as a message (see Fig. 3 (b)), since it already contains the contextual and original node information (see Eq. 4):


For the message passed from node to (see Fig. 3 (c)), we have:


where softmax() normalizes each row of the input. Thus, each row (position) of is a weighted combination of each row (position) of , where the weights come from the corresponding column of . In this way, the message function assigns its edge-weighted feature (, message) to the neighbor nodes [62]. Then, is reshaped back to a 3D tensor with a size of .

In addition, because some nodes are noisy due to camera shift or out-of-view, their messages may be useless or even harmful. We apply a learnable gate to measure the confidence of a message :


where indicates the use of global average pooling to generate channel-wise responses,

is the logistic sigmoid function

, and and are the trainable convolution kernel and bias.

Following Eq. 1, we collect the messages from the neighbors and self-loop via gated summarization (see Fig. 2 (d)):


where ‘’ denotes the channel-wise Hadamard product. Here, the gate mechanism is used to filter out irrelevant information from noisy frames. See §4.3 for a quantitative study of this design.

ConvGRU based Node-State Update. In step , after aggregating all the information from the neighbor nodes and itself (Eq. 9), gets a new state by taking into account its prior state and its received message . To preserve the spatial information conveyed in and , we leverage ConvGRU [1] to update the node state (Fig. 2 (e)):


ConvGRU is proposed as a convolutional counterpart to previous fully connected GRU [9], and introduces convolution operation into input-to-state and state-to-state transitions.

Readout Function. After message passing iterations, we obtain the final state for each node . Finally, in the readout phase, we get a segmentation prediction map from through a readout function (see Fig. 2 (f)). Slightly different from Eq. 2, we concatenate the final node state and the original node feature (, ) together and feed the combined feature into :


Again, to preserve spatial information, the readout function is implemented as a small FCN network, which has three convolution layers with a sigmoid function to normalize the prediction to .

The convolution operations in the intra-attention (Eq. 4) and update function (Eq. 10) are realized with convolutional layers. The readout function (Eq. 11) consists of two convolutional layers cascaded by a convolutional layer. As a message passing based GNN model, these functions share weights among all the nodes. Moreover, all the above functions are carefully designed to avoid disturbing spatial information, which is essential for ZVOS since it is a pixel-wise prediction task.

3.3 Detailed Network Architecture

Our whole model is end-to-end trainable, as all the functions in AGNN are parameterized by neural networks. We use the first five convolution blocks of DeepLabV3 [5] as our backbone for feature extraction. For an input video , each frame (with a resolution of ) is represented as a node in the video graph and associated with an initial node state . Then, after a total of message passing iterations, for each node , we use the readout function in Eq. 11 to obtain a corresponding segmentation prediction map . More details on the training and testing phases are provided as follows.


Method KEY [28] MSG [41] NLC [12] CUT [24] FST [43] SFL [7] MP [57] FSEG [21] LVO [58] ARP [27] PDB [55] MOA [54] AGS [69] AGNN
Mean 49.8 53.3 55.1 55.2 55.8 67.4 70.0 70.7 75.9 76.2 77.2 77.2 79.7 80.7
Recall 59.1 61.6 55.8 57.5 64.9 81.4 85.0 83.0 89.1 91.1 90.1 87.8 91.1 94.0
Decay  14.1 2.4 12.6 2.2 0.0 6.2 1.3 1.5 0.0 7.0 0.9 5.0 1.9 0.03
Mean 42.7 50.8 52.3 55.2 51.1 66.7 65.9 65.3 72.1 70.6 74.5 77.4 77.4 79.1
Recall 37.5 60.0 61.0 51.9 51.6 77.1 79.2 73.8 83.4 83.5 84.4 84.4 85.8 90.5
Decay  10.6 5.1 11.4 3.4 2.9 5.1 2.5 1.8 1.3 7.9 -0.2 3.3 1.6 0.03
Mean  26.9 30.2 42.5 27.7 36.6 28.2 57.2 32.8 26.5 39.3 29.1 27.9 26.7 33.7


Table 1: Quantitative results on the validation set of DAVIS [45]4.1.2). The scores are borrowed from the public leaderboard1. (The best scores are marked in bold. The best two entries in each row are marked in gray. These notes are the same to other tables. )


Airplane (6) Bird (6) Boat (15) Car (7) Cat (16) Cow (20) Dog (27) Horse (14) Motorbike (10) Train (5) Avg.
FST [43] 70.9 70.6 42.5 65.2 52.1 44.5 65.3 53.5 44.2 29.6 53.8
COSEG [60] 69.3 76.0 53.5 70.4 66.8 49.0 47.5 55.7 39.5 53.4 58.1
ARP [27] 73.6 56.1 57.8 33.9 30.5 41.8 36.8 44.3 48.9 39.2 46.2
LVO [58] 86.2 81.0 68.5 69.3 58.8 68.5 61.7 53.9 60.8 66.3 67.5
PDB [55] 78.0 80.0 58.9 76.5 63.0 64.1 70.1 67.6 58.3 35.2 65.4
FSEG [21] 81.7 63.8 72.3 74.9 68.4 68.0 69.4 60.4 62.7 62.2 68.4
SFL [7] 65.6 65.4 59.9 64.0 58.9 51.1 54.1 64.8 52.6 34.0 57.0
AGS [69] 87.7 76.7 72.2 78.6 69.2 64.6 73.3 64.4 62.1 48.2 69.7
AGNN 81.1 75.9 70.7 78.1 67.9 69.7 77.4 67.3 68.3 47.8 70.8


Table 2: Quantitative performance of each category on Youtube-Objects [47]4.1.2) with mean . We show the average performance for each of the 10 categories, and the final row shows an average over all the videos.

Training Phase. As we operate on batches of a certain size (which is allowed to vary, depending on the GPU memory size), we leverage a random sampling strategy to train AGNN. Specifically, we split each training video with a total of frames into segments () and randomly select one frame from each segment. Then we feed the sampled frames into a batch and train AGNN. Thus the relationships among all the sampling frames in each batch are represented using an -node graph. Such a sampling strategy provides robustness to variations and enables the network to fully exploit all frames. The diversity among the samples enables our model to better capture the underlying relationships and improve its generalizability. Let us denote the ground-truth segmentation mask and predicted foreground map for a training frame as and . Our model is trained through the weighted binary cross entropy loss (see Fig. 2):


where indicates the foreground-background pixel number ratio in . It is worth mentioning that, as AGNN handles multiple video frames at the same time, it leads to a remarkably efficient training data augmentation strategy, as the combination candidates are numerous. In our experiments, during training, we randomly select 2 videos from the training video set and sample 3 frames () per video, due to the computation limitation. In addition, we set the total number of iterations as . Quantitative experimental settings can be found in §4.3.

Testing Phase. After training, we can apply the learned AGNN model to perform per-pixel object prediction over unseen videos. For an input test video with frames (with resolution), we split into subsets: , where . Each subset contains frames with an interval of frames: . Then we feed each subset into AGNN to obtain the segmentation maps of all the frames in the subset. In practice, we set during testing. We quantitatively study this setting in §4.3. As our AGNN does not require time-consuming optical flow computation and processes frames in one feed-forward propagation, it achieves a fast speed of per frame. Following the widely used protocol [58, 57, 55], we apply CRF as a post-processing step, which takes about per frame. More implementation details can be found in §4.1.1.

4 Experiments

We first report performance on the main task: unsupervised video object segmentation (§4.1). Then, in §4.2, to further demonstrate the advantages of our AGNN model, we test it on an additional task: image object co-segmentation. Finally, we conduct an ablation study in §4.3.

4.1 Main Task: ZVOS

4.1.1 Experimental Setup

Datasets and Metrics:  We use two well-known datasets:

  • [leftmargin=*]

  • DAVIS [45] is a challenging video object segmentation dataset which consists of 50 videos in total (30 for training and 20 for val) with pixel-wise annotations for every frame. Three evaluation criteria are used in this dataset, , region similarity (Intersection-over-Union) , boundary accuracy , and time stability .

  • Youtube-Objects [47] comprises 126 video sequences which belong to 10 object categories and contain more than 20,000 frames in total. Following its protocol, we use to measure the segmentation performance.

  • DAVIS [46] consists of 60 videos in the training set, 30 videos in the validation set and 30 videos in the test-dev set. Different from DAVIS2016 and Youtube-Objects, which only focus on object-level video object segmentation, DAVIS provides instance-level annotations.

Implementation Details: Following [44, 55], both static data from image salient object segmentation datasets, MSRA10K [8], DUT [72], and video data from the training set of DAVIS are iteratively used to train our model. In a ‘static-image’ iteration, we randomly sample 6 images from the static training data to train our backbone network (DeepLabV3) to extract more discriminative foreground features. To train the backbone network, a convolution layer with sigmoid function is appended as an intermediate output layer, which can access the static image supervision signal. This is followed by a ‘dynamic-video’ iteration, in which we use the sampling strategy described in §3.3 to sample 6 video frames to train our whole AGNN model. The ‘static-image’ and ‘dynamic-video’ iterations are executed alternately. To apply the trained AGNN model on DAVIS, we first use category agnostic mask-RCNN [17] to generate instance-level object proposals for each frame. Then, we run AGNN on the whole video and generate a coarse mask for the primary objects in each frame. Then the object-level masks are used to filter out the proposals from the background and highlight the foreground proposals. Through combining an instance bounding proposals and coarse masks, we obtain the instance-level mask for each primary object. Finally, to connect multiple instances across different frames, we use overlap ratio and optical flow as an association metric [38] to match different instance-level masks.

Figure 4: Qualitative results on two example videos (top: soapbox, bottom: judo) from the DAVIS val set and DAVIS test-dev set, respectively (see §4.1.3).

4.1.2 Quantitative Performance

Val-set of DAVIS. We compare the proposed AGNN with the top ZVOS methods from the DAVIS benchmark111, deadline: Mar. 2019  [45]. Table 1 shows the detailed results. We can see that our AGNN outperforms the best reported results (, AGS [69]) on DAVIS benchmark by a significant margin in terms of mean (80.7 vs 79.7) and (79.1 vs 77.4). Compared to PDB [55], which uses the same training protocol and training datasets, our AGNN yields significant performance gains of 3.5 and 4.6 in terms of mean and mean , respectively.

Youtube-Objects. Table 2 gives the detailed per-category performance and average results on Youtube-Objects. As can be seen, our AGNN performs favorably according to mean criterion. Furthermore, unlike other methods whose performance fluctuates across categories, AGNN mains a stable performance.This further proves its robustness and generalizability.

Test-dev set of DAVIS. In Table 3 we report the performance comparison with the recent instance-level ZVOS method, RVOS [63], on the DAVIS test-dev set. We can find that AGNN significantly outperforms RVOS over most evaluation criteria.


Method & Mean
Mean Recall Decay Mean Recall Decay
RVOS [63] 39.0 42.8 0.50 48.3 49.6 -0.01 43.7
AGNN 58.9 65.7 11.7 63.2 67.1 14.3 61.1


Table 3: Quantitative results on the DAVIS test-dev set [46].

4.1.3 Qualitative Performance

Fig. 4 depicts visual results for the proposed AGNN on two challenging video sequences soapbox and judo of DAVIS and DAVIS, respectively. For soapbox, the primary objects undergo huge scale variation, deformation and view changes, but our AGNN still generates accurate foreground segments. Our AGNN also handles judo well, although the different foreground instances suffer from similar appearance and rapid motions.

4.2 Additional Task: IOCS

Our AGNN model can be viewed as a framework for capturing high-order relations among images (or frames). To demonstrate its generalizability, we extend AGNN for IOCS task. Rather than extracting the foreground objects across multiple relatively similar video frames in videos, IOCS needs to infer the common objects from a group of semantically related images.

4.2.1 Experimental Setup

Datasets and Metrics: We perform experiments on two well-known IOCS datasets:

  • [leftmargin=*]

  • PASCAL VOC [11] has 1,464 training images and 1,449 validation images. Following [32], we split the validation set into 724 validation and 725 test images, and use mean as the performance measure.

  • Internet [51] contains 1,306 car, 879 horse, and 561 airplane images. Following [4, 49], we measure the performance on a subset of Internet (100 images per class are sampled) with mean .


Method GO-FMR [49] FCNs [34] CA [4] AGNN
Mean 52.0 55.21 59.24 60.78
Method FCA [4] CSA [4] DOCS [32] AGNN
Mean 59.41 59.76 57.82 60.78


Table 4: Quantitative performance on PASCAL VOC [11] with mean . We show the average performance for 20 categories averaged over all the images. See §4.2.2 for detailed analyses.


Method DC [22] Internet [51] TDK [6] GO-FMR [49] DDCRF [73] CA [4] FCA [4] CSA [4] DOCS [32] CoA [19] AGNN
Car 37.1 64.4 64.9 66.8 72.0 80.0 76.9 79.9 82.7 82.0 84.0
Horse 30.1 51.6 33.4 58.1 65.0 67.3 69.1 71.4 64.6 61.0 72.6
Airplane 15.3 57.3 46.2 60.4 67.7 72.8 70.6 73.1 70.3 67.0 76.1
Avg. 27.5 57.3 46.2 60.4 67.7 70.3 72.8 70.6 73.1 67.7 77.6


Table 5: Quantitative results on Internet [51] with mean 4.2.2). We show the per-class performance and an overall average.
Figure 5: Qualitative image object co-segmentation results on PASCAL VOC [11] (top) and Internet [51] (bottom). See §4.2.3.

Implementation Details: Following [4, 32], we employ PASCAL VOC to train our model. In each iteration, we randomly sample a group of images that belong to the same semantic class, and feed two groups with randomly selected classes (6 images in total) to the network. All other experimental settings are the same as ZVOS.

After training, we evaluate the performance of our method on the test sets of PASCAL VOC and Internet dataset. When processing an image, IOCS must leverage information from the whole image group (as the images are typically different and some are irrelevant) [49, 65]. To this end, for each image to be segmented, we uniformly split the other images into groups, where . Then we feed the first image group and to a batch of size , and store the node state for . After that, we feed the next group and the store node state of to get a new state of . After steps, the final state of contains its relationships to all other images and is used to produce its final co-segmentation result.

4.2.2 Quantitative Performance

PASCAL VOC. It is very challenging to segment the common objects in this dataset, since the objects undergo drastic variation in scale, position and appearance. In addition, some images have multiple objects belonging to different categories. On this dataset, we compare AGNN with six representative methods, including Siamese-based co-segmentation methods [4, 32], as well as deep semantic segmentation models (., FCNs [34]).

Table 4 shows detailed results in terms of mean . FCNs [34] segment each image individually (without considering other related images), and thus give poor performance. Both [4] and [32] consider pairs of images and gain better results. Our AGNN achieves the best performance because it considers high-order information from multiple images during inference, enabling it to capture richer semantic relations within the image groups.

Internet. We evaluate our model (pre-trained on PASCAL VOC) on Internet [4, 49]. Quantitative results in Table 5 again demonstrate the superiority of AGNN (4.5% performance gain compared with the second best method). The result of AGNN is higher than compared methods for three classes: Car (84.0%), Horse (72.6%), Airplane (76.1%).

4.2.3 Qualitative Results

Fig. 5 shows some sample results. Specifically, the first four images in the top row belong to the Cat category (red circle), while the last four images contain the Person category (yellow circle) with significant intra-class variation. For both cases, our AGNN successfully detects the common object instances amongst background clutter. For the second row, AGNN also performs well in the cases with remarkable intra-class appearance change.


Components Module DAVIS
Reference Full model (3 Iterations, N’= 5) 80.7 -
w/o. AGNN 72.2 -8.5
w/o. Gated Message (Eq. 9) 80.1 0.6
1 iteration 78.7 -2.0
2 iterations 79.1 -1.6
4 iterations 80.7 0.0
N’= 3 79.6 -1.1
N’= 6 80.7 0.0
N’= 7 80.7 0.0
Post-Process w/o. CRF 78.9 -1.8


Table 6: Ablation study (§4.3) on the val set of DAVIS [45].

4.3 Ablation Study

We perform an ablation study on DAVIS [45] to investigate the effect of each essential component of AGNN.

Effectiveness of Our AGNN. To quantify the contribution of our AGNN, we derive a baseline w/o. AGNN, which indicates the results from our backbone model, DeepLabV3. As shown in Table 6, AGNN indeed brings significant performance improvements (72.280.6 in term of mean ).

Gated Message Aggregation Strategy. In Eq. 9, we equip the message passing with a channel-wise gated mechanism to decrease the negative influence of irrelevant frames. To evaluate this design, we offer a baseline w/o. Gated Message, which aggregates messages directly. A performance degradation is observed after excluding the gates.

Message Passing Iterations . To investigate the message passing iterations , we report the performance as a function of s. We find that, with more iterations (), better results can be obtained. The performance of the message passing converges at .

Node Numbers During Inference. To evaluate the impact of the number of nodes during inference, we report performance with different values of . We observe that, with more input frames (), the performance raises accordingly. When even more frames are considered (), the final performance does not change obviously. This may be due to the redundant content in video sequences.

5 Conclusion

This paper proposes a novel AGNN based ZVOS framework for capturing the relations among videos frames and inferring the common foreground objects. It leverages an attention mechanism to capture the similarity between nodes and performs recursive message passing to mine the underlying high-order correlations. Meanwhile, we demonstrate the generalizability of AGNN by extending it to IOCS task. Extensive experiments on three ZVOS and two IOCS datasets indicate that our AGNN performs favorably against current state-of-the-art methods. This further illustrates the importance of AGNN which can capture diverse relations among similar video frames or semantically related images.

Acknowledgements This work was supported in part by ARO grant W911NF-18-1-0296, Beijing Natural Science Foundation under Grant 4182056, CCF-Tencent Open Fund, Zhijiang Lab’s International Talent Fund for Young Professionals, and the National Science Foundation (CAREER IIS-1253549).


  • [1] N. Ballas, L. Yao, C. Pal, and A. Courville (2016) Delving deeper into convolutional networks for learning video representations. In ICLR, Cited by: §3.2.
  • [2] D. Beck, G. Haffari, and T. Cohn (2018) Graph-to-sequence learning using gated graph neural networks. In ACL, Cited by: §2.1.
  • [3] J. Cao, Y. Pang, and X. Li (2019) Triply supervised decoder networks for joint detection and segmentation. In CVPR, Cited by: §2.2.
  • [4] H. Chen, Y. Huang, and H. Nakayama (2018) Semantic aware attention based deep object co-segmentation. In ACCV, Cited by: §2.3, 2nd item, §4.2.1, §4.2.2, §4.2.2, §4.2.2, Table 4, Table 5.
  • [5] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. CoRR abs/1706.05587. Cited by: §3.2, §3.3.
  • [6] X. Chen, A. Shrivastava, and A. Gupta (2014) Enriching visual knowledge bases via object discovery and segmentation. In CVPR, Cited by: Table 5.
  • [7] J. Cheng, Y. Tsai, S. Wang, and M. Yang (2017) Segflow: joint learning for video object segmentation and optical flow. In ICCV, Cited by: §1, §1, §2.2, Table 1, Table 2.
  • [8] M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu (2015) Global contrast based salient region detection. IEEE TPAMI 37 (3), pp. 569–582. Cited by: §4.1.1.
  • [9] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, Cited by: §3.2.
  • [10] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In NIPS, Cited by: §2.1.
  • [11] Mark. Everingham, Luc. Van Gool, Christopher. K. I. Williams, John. Winn, and Andrew. Zisserman The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. Note: Cited by: §1, Figure 5, 1st item, Table 4.
  • [12] A. Faktor and M. Irani (2014) Video segmentation by non-local consensus voting. In BMVC, Cited by: §2.2, Table 1.
  • [13] K. Fragkiadaki, P. Arbelaez, P. Felsen, and J. Malik (2015) Learning to segment moving objects in videos. In CVPR, Cited by: §2.2.
  • [14] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §2.1, §3.1.
  • [15] M. Gori, G. Monfardini, and F. Scarselli (2005) A new model for learning in graph domains. In IJCNN, Cited by: §2.1.
  • [16] J. Han, R. Quan, D. Zhang, and F. Nie (2018) Robust object co-segmentation using background prior. IEEE TIP 27 (4), pp. 1639–1651. Cited by: §2.3.
  • [17] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, Cited by: §4.1.1.
  • [18] Dorit. S. Hochbaum and Vikas. Singh (2009) An efficient algorithm for co-segmentation. In ICCV, Cited by: §2.3.
  • [19] K. Hsu, Y. Lin, and Y. Chuang (2018) Co-attention cnns for unsupervised object co-segmentation.. In IJCAI, Cited by: Table 5.
  • [20] Y. Hu, J. Huang, and A. G. Schwing (2018) Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In ECCV, Cited by: §2.2.
  • [21] S. D. Jain, B. Xiong, and K. Grauman (2017) Fusionseg: learning to combine motion and appearance for fully automatic segmention of generic objects in videos. In CVPR, Cited by: §1, §1, §2.2, Table 1, Table 2.
  • [22] A. Joulin, F. Bach, and J. Ponce (2010) Discriminative clustering for image co-segmentation. In CVPR, Cited by: §2.3, Table 5.
  • [23] Y. Jun Koh, Y. Lee, and C. Kim (2018) Sequential clique optimization for video object segmentation. In ECCV, Cited by: §2.2.
  • [24] M. Keuper, B. Andres, and T. Brox (2015) Motion trajectory segmentation via minimum cost multicuts. In ICCV, Cited by: Table 1.
  • [25] G. Kim, P. E. Xing, L. Fei-Fei, and T. Kanade (2011) Distributed cosegmentation via submodular optimization on anisotropic diffusion. In ICCV, Cited by: §2.3.
  • [26] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §2.1.
  • [27] Y. J. Koh and C. Kim (2017) Primary object segmentation in videos based on region augmentation and reduction. In CVPR, Cited by: §2.2, Table 1, Table 2.
  • [28] Y. J. Lee, J. Kim, and K. Grauman (2011) Key-segments for video object segmentation. In ICCV, Cited by: Table 1.
  • [29] G. Li, Y. Xie, T. Wei, K. Wang, and L. Lin (2018) Flow guided recurrent neural encoder for video salient object detection. In CVPR, Cited by: §2.2.
  • [30] S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C.-C. Jay Kuo (2018) Instance embedding transfer to unsupervised video object segmentation. In CVPR, Cited by: §2.2.
  • [31] S. Li, B. Seybold, A. Vorobyov, X. Lei, and C.-C. Jay Kuo (2018) Unsupervised video object segmentation with motion-based bilateral networks. In ECCV, Cited by: §1, §1, §2.2.
  • [32] W. Li, O. H. Jafari, and C. Rother (2018) Deep object co-segmentation. In ACCV, Cited by: §2.3, 1st item, §4.2.1, §4.2.2, §4.2.2, Table 4, Table 5.
  • [33] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2016) Gated graph sequence neural networks. In ICLR, Cited by: §2.1.
  • [34] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §2.2, §4.2.2, §4.2.2, Table 4.
  • [35] J. Lu, J. Yang, D. Batra, and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In NIPS, Cited by: §3.2.
  • [36] X. Lu, C. Ma, B. Ni, X. Yang, I. Reid, and M. Yang (2018) Deep regression tracking with shrinkage loss. In ECCV, Cited by: §2.2.
  • [37] X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli (2019) See more, know more: unsupervised video object segmentation with co-attention siamese networks. In CVPR, Cited by: §2.2.
  • [38] J. Luiten, P. Voigtlaender, and B. Leibe (2018) PReMVOS: proposal-generation, refinement and merging for video object segmentation. In ACCV, Cited by: §4.1.1.
  • [39] Lopamudra. Mukherjee, Vikas. Singh, and Charles. R. Dyer (2009) Half-integrality based algorithms for cosegmentation of images. In CVPR, Cited by: §2.3.
  • [40] M. Niepert, M. Ahmed, and K. Kutzkov (2016) Learning convolutional neural networks for graphs. In ICML, Cited by: §2.1.
  • [41] P. Ochs and T. Brox (2011) Object segmentation in video: A hierarchical variational approach for turning point trajectories into dense regions. In ICCV, Cited by: §1, §2.2, Table 1.
  • [42] P. Ochs, J. Malik, and T. Brox (2014) Segmentation of moving objects by long term video analysis. IEEE TPAMI 36 (6), pp. 1187–1200. Cited by: §2.2.
  • [43] A. Papazoglou and V. Ferrari (2013) Fast object segmentation in unconstrained video. In ICCV, Cited by: §1, §2.2, Table 1, Table 2.
  • [44] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung (2017) Learning video object segmentation from static images. In CVPR, Cited by: §4.1.1.
  • [45] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, Cited by: §1, Table 1, 1st item, §4.1.2, §4.3, Table 6.
  • [46] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool (2017) The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675. Cited by: §1, 3rd item, Table 3.
  • [47] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari (2012) Learning object class detectors from weakly annotated video. In CVPR, Cited by: §1, Table 2, 2nd item.
  • [48] S. Qi, W. Wang, B. Jia, J. Shen, and S. Zhu (2018) Learning human-object interactions by graph parsing neural networks. In ECCV, Cited by: §2.1.
  • [49] R. Quan, J. Han, D. Zhang, and F. Nie (2016) Object co-segmentation via graph optimized-flexible manifold ranking. In CVPR, Cited by: §2.3, 2nd item, §4.2.1, §4.2.2, Table 4, Table 5.
  • [50] C. Rother, T. Minka, A. Blake, and V. Kolmogorov (2006) Cosegmentation of image pairs by histogram matching-incorporating a global constraint into MRFs. In CVPR, Cited by: §2.3.
  • [51] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu (2013) Unsupervised joint object discovery and segmentation in internet images. In CVPR, Cited by: §1, §2.3, Figure 5, 2nd item, Table 5.
  • [52] Jose. C. Rubio, Joan. Serrat, Antonio. Lopez, and Nikos. Paragios (2012) Unsupervised co-segmentation through region matching. In CVPR, Cited by: §2.3.
  • [53] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009) The graph neural network model. IEEE TNNLS 20 (1), pp. 61–80. Cited by: §2.1, §3.1.
  • [54] M. Siam, C. Jiang, S. Lu, L. Petrich, M. Gamal, M. Elhoseiny, and M. Jagersand (2019) Video segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In ICRA, Cited by: Table 1.
  • [55] H. Song, W. Wang, S. Zhao, J. Shen, and K. Lam (2018) Pyramid dilated deeper convlstm for video salient object detection. In ECCV, Cited by: §1, §2.2, §3.3, Table 1, Table 2, §4.1.1, §4.1.2.
  • [56] Z. Tao, H. Liu, H. Fu, and Y. Fu (2017)

    Image cosegmentation via saliency-guided constrained clustering with cosine similarity

    In AAAI, Cited by: §2.3.
  • [57] P. Tokmakov, K. Alahari, and C. Schmid (2017) Learning motion patterns in videos. In CVPR, Cited by: §1, §1, §3.3, Table 1.
  • [58] P. Tokmakov, K. Alahari, and C. Schmid (2017) Learning video object segmentation with visual memory. In ICCV, Cited by: §1, §1, §2.2, §3.3, Table 1, Table 2.
  • [59] Y. Tsai, M. Yang, and M. J. Black (2016) Video segmentation via object flow. In CVPR, Cited by: §2.2.
  • [60] Y. Tsai, G. Zhong, and M. Yang (2016) Semantic co-segmentation in videos. In ECCV, Cited by: Table 2.
  • [61] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §3.2.
  • [62] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) Graph attention networks. In ICLR, Cited by: §2.1, §3.2.
  • [63] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and X. Giro-i-Nieto (2019) Rvos: end-to-end recurrent network for video object segmentation. In CVPR, Cited by: §1, §4.1.2, Table 3.
  • [64] S. Vicente, V. Kolmogorov, and C. Rother (2010) Cosegmentation revisited: models and optimization. In ECCV, Cited by: §2.3.
  • [65] S. Vicente, C. Rother, and V. Kolmogorov (2011) Object cosegmentation. In CVPR, Cited by: §2.3, §4.2.1.
  • [66] W. Wang, J. Shen, F. Porikli, and R. Yang (2018) Semi-supervised video object segmentation with super-trajectories. IEEE TPAMI 41 (4), pp. 985–998. Cited by: §2.2.
  • [67] W. Wang, J. Shen, and F. Porikli (2015) Saliency-aware geodesic video object segmentation. In CVPR, Cited by: §1, §2.2.
  • [68] W. Wang and J. Shen (2016) Higher-order image co-segmentation. IEEE TMM 18 (6), pp. 1011–1021. Cited by: §2.3.
  • [69] W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. C. Hoi, and H. Ling (2019) Learning unsupervised video object segmentation through visual attention. In CVPR, Cited by: Table 1, Table 2, §4.1.2.
  • [70] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, Cited by: §3.2.
  • [71] X. Wang and A. Gupta (2018) Videos as space-time region graphs. In ECCV, Cited by: §2.1.
  • [72] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang (2013) Saliency detection via graph-based manifold ranking. In CVPR, Cited by: §4.1.1.
  • [73] Z. Yuan, T. Lu, and Y. Wu (2017) Deep-dense conditional random fields for object co-segmentation.. In IJCAI, Cited by: Table 5.
  • [74] D. Zhang, O. Javed, and M. Shah (2013) Video object segmentation through spatially accurate and temporally dense extraction of primary object regions. In CVPR, Cited by: §1, §2.2.
  • [75] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2019) Self-attention generative adversarial networks. In ICML, Cited by: §3.2.
  • [76] Z. Zheng, W. Wang, S. Qi, and S. Zhu (2019) Reasoning visual dialogs with structural and partial observations. In CVPR, Cited by: §2.1.
  • [77] W. Ziqin, X. Jun, L. Li, Z. Fan, and S. Ling (2019) RANet: ranking attention network for fast video object segmentation. In ICCV, Cited by: §2.2.