Learning Video Object Segmentation from Unlabeled Videos

03/10/2020 ∙ by Xiankai Lu, et al. ∙ 0

We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos, unlike most existing methods which rely heavily on extensive annotated data. We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures intrinsic properties of VOS at multiple granularities. Our approach can help advance understanding of visual patterns in VOS and significantly reduce annotation burden. With a carefully-designed architecture and strong representation learning ability, our learned model can be applied to diverse VOS settings, including object-level zero-shot VOS, instance-level zero-shot VOS, and one-shot VOS. Experiments demonstrate promising performance in these settings, as well as the potential of MuG in leveraging unlabeled data to further improve the segmentation accuracy.



There are no comments yet.


page 1

page 3

page 8

Code Repositories


Learning Video Object Segmentation from Unlabeled Videos (CVPR2020)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video object segmentation (VOS) has two common settings, zero-shot and one-shot. Zero-shot VOS (Z-VOS)111Some conventions [36, 60] also use ‘unsupervised VOS’ and ‘semi-supervised VOS’ to name the Z-VOS and O-VOS settings [3]. In this work, for notational clarity, the terms ‘supervised’, ‘weakly supervised’ and ‘unsupervised’ are only used to address the different learning paradigms. is to automatically segment out the primary foreground objects, without any test-time human supervision; whereas one-shot VOS (O-VOS) focuses on extracting the human determined foreground objects, typically assuming the first-frame annotations are given ahead inference 1. Current leading methods for both Z-VOS and O-VOS are superviseddeep learning models that require extensive amounts of elaborately annotated data to improve the performance and avoid over-fitting. However, obtaining pixel-wise segmentation labels is labor-intensive and expensive (Fig. 1 (a)).

It is thus attractive to design VOS models that can learn from unlabeled videos. With this aim in mind, we develop a unified, unsupervised/weakly supervised VOS method that mines multi-granularity cues to facilitate video object pattern learning (Fig. 1 (b)). This allows us to take advantage of nearly infinite amounts of video data. Below we give a more formal description of our problem setup and main idea.

Figure 1: (a) Current leading VOS methods are learned in a supervised manner, requiring large-scale elaborately labeled data. (b) Our model, MuG, provides an unsupervised/weakly-supervised framework that learns video object patterns from unlabeled videos. (c) Once trained, MuG can be applied to diverse VOS settings, with strong modeling ability and high generability.

Problem Setup and Main Idea. Let and denote the input video space and output VOS space, respectively. Deep learning based VOS solutions seek to learn a differentiable, ideal video-to-segment mapping . To approximate , recent leading VOS models typically work in a supervised learning manner, requiring input samples and their desired outputs , where . In contrast, we address the problem in settings with much less supervision: (1) the unsupervised case, when we only have samples drawn from , , and want to approximate , and (2) the weakly supervised learning setting, in which we have annotations for , which is a related output domain for which obtaining annotations is easier than , and we approximate using samples from .

The standard way of evaluating learning outcomes follows an empirical risk/loss minimization formulation [43]:


where denotes the hypothesis (solution) space, and

is an error function that evaluates the estimate

against VOS-related prior knowledge . To make a good approximation of , current supervised VOS methods directly use the desired output , i.e., , as the prior knowledge, with the price of vast amounts of well-annotated data.

In our method, the prior knowledge

, in the unsupervised learning setting, is built upon several heuristics and intrinsic properties of VOS itself, while in the weakly supervised learning setting, it additionally considers a related, easily-annotated output domain

. For example, part of the fore-background knowledge could be from a saliency model [70] (Fig. 1 (b)), or in a form of CAM maps [73, 76]

from a pre-trained image classifier 

[14] (i.e., a related image classification domain )222Note that any unsupervised or weakly supervised object segmentation/saliency model can be used; saliency [70], and CAM [73, 76] are just chosen due to their popularity and relatively high performance.. Exploring VOS in an unsupervised or weakly supervised setting is appealing not only because it alleviates the annotation burden of , but also because it inspires an in-depth understanding of the nature of VOS by exploring . Specifically, we analyze several different types of cues at multiple granularities, which are crucial for video object pattern modeling:

  • [leftmargin=*]

  • At the frame granularity, we leverage information from an unsupervised saliency method [70] or CAM [73, 76] activation maps to enhance the foreground and background discriminability of our intra-frame representation.

  • At the short-term granularity, we impose the local consistency within the representations of short video clips, to describe the continuous and coherent visual patterns within a few seconds.

  • At the long-range granularity, we address semantic correspondence among distant frames, which makes the cross-frame representations robust to local occlusions, appearance variations and shape deformations.

  • At the whole-video granularity, we encourage the video representation to capture global and compact video content, by learning to aggregate multi-frame information and be discriminative to other videos’ representations.

All these constraints are formulated under a unified, multi-granularity VOS (MuG) framework, which is fully differentiable and allows unsupervised/weakly supervised video object pattern learning, from unlabeled videos. Our extensive experiments over various VOS settings, i.e., object-level Z-VOS, instance-level Z-VOS, and O-VOS, show that MuG outperforms other unsupervised and weakly supervised methods by a large margin, and continuously improves its performance with more unlabeled data.

2 Related Work

2.1 Video Object Segmentation

Z-VOS. As there is no indication for objects to be segmented, conventional ZVOS methods resorted to certain heuristics, such as saliency [60, 62, 61, 7], object proposals [19, 37, 24], and discriminative motion patterns [31, 10, 33]. Recent advances have been driven by deep learning techniques. Various effective networks have been explored, from some early, relatively simple architectures, such as recurrent network [45, 32, 63], and two-stream network [6, 49, 77], to recent, more powerful designs, such as teacher-student adaption [44], neural co-attention [26]

and graph neural network 

[58, 68].

Figure 3: Left: Main idea of short-term granularity analysis. Right: Training details for intra-clip coherence modeling.

O-VOS. As the annotations for the first frame are assumed available at the test phase, O-VOS focuses on how to accurately propagate the initial labels to subsequent frames. Traditional methods typically used optical flow based propagation strategy [29, 9, 59, 28]. Now, deep learning based solutions become the main stream, which can be broadly classified into three categories, i.e., online learning, propagation and matching based methods. Online learning based methods [3, 55, 35] fine-tune the segmentation network for each test video on the first-frame annotations. Propagation based methods [18, 67, 71] rely on the segments of the previous frames and work in a frame-by-frame manner. Matching based methods [66, 54, 27] segment each frame according to its correspondence/matching relation to the first frame.

Typically, current deep learning based VOS solutions, under either Z-VOS or O-VOS setting, are trained using a large amount of elaborately annotated data for supervised learning. In contrast, the proposed method trains a VOS network from scratch using unlabeled videos. This is essential for understanding how visual recognition works in VOS and for narrowing down the annotation budget.

2.2 VOS with Unlabeled Training Videos

Learning VOS from unlabeled videos is an essential, yet rarely touched avenue. Among a few efforts, [34] represents an early attempt in this direction, which uses a modified, purely unsupervised version of [7] to generate proxy masks as pseudo annotations. With a similar spirit, some methods use heuristic segmentation masks [17] or weakly supervised location maps [23] as supervisory signals. With a broader view, some works [47, 11, 74] capitalized on untrimmed videos tagged with semantic labels. In addition to increased annotation efforts, they are hard to handle such a class-agnostic VOS setting. Recently, self-supervised video learning has been applied for O-VOS [56, 65], which imposes the learned features to capture certain constraints on local coherence, such as cross-frame color consistency [56] and temporal cycle-correspondence [65].

Our method is distinctive for two aspects. First, it explores various intrinsic properties of videos as well as class-agnostic fore-background knowledge in a unified, multi-granularity framework, bringing a more comprehensive understanding of visual patterns in VOS. Second, it shows strong video object representation learning ability and, for the first time, it is applied to diverse VOS settings after only being trained once. This gives a new glimpse into the connections between the two most influential VOS settings.

3 Proposed Algorithm

3.1 Multi-Granularity VOS Network

For a training video containing frames: , its features are specified as , obtained from a fully convolutional feature extractor : . Four-granularity characteristics are explored to guide the learning of (Fig. 2), described as follows.

Frame Granularity Analysis: Fore-background Knowledge Understanding. As is VOS-aware, basic fore-background knowledge is desired to be encoded. In our method, such knowledge (Fig. 1 (b)) is initially from a background prior based saliency model [70] (in an unsupervised learning setting), or in a form of CAM maps [73, 76] (in a weakly supervised learning setting).

Formally, for each frame , let us denote its corresponding initial fore-background mask as (i.e

., a binarized saliency or CAM activation map). In our frame granularity analysis, the learning of

is guided by the supervision signals of , i.e., utilizing the intra-frame information to regress :


Here is the cross-entropy loss, and where maps the input single-frame feature into a fore-background prediction map . is implemented by a convolutional layer with sigmoid activation.

Figure 2: Overview of our approach. Intrinsic properties over frame, short-term, long-term and whole video granularities are explored to guide the video object pattern learning.

Short-Term Granularity Analysis: Intra-Clip Coherence Modeling. Short-term coherence is an essential property in videos, as temporally close frames typically exhibit continuous visual content changes [15]. To capture this property, we apply a forward-backward patch tracking mechanism [57]. It learns by tracking a sampled patch forwards in a few successive frames and then backwards until the start frame, and penalizing the distance between the initial and final backwards tracked positions of that patch.

Formally, given two consecutive frames and , we first crop a patch from and apply on and , separately. Then we get two feature embeddings: and . With a design similar to the classic Siamese tracker [2], we forward track the patch on the next frame by conducting a cross-correlation operation ‘’ on and :


where is a sigmoid-normalized response map whose size is rescaled into . The new location of in is then inferred according to the peak value on . After obtaining the forward tracked patch in , we backward track to and get a backward tracking response map :


Ideally, the peak of should correspond to the location of in the initial frame . Thus we build a consistency loss that measures the alignment error between the initial and forward-backward tracked positions of :


where is a -dimensional Gaussian-shape map with the same center of

and variance proportional to the size of

. As in [57], the above forward-backward tracking mechanism is extended to a multi-frame setting (Fig. 3). Specifically, after obtaining the forward tracked patch in , is further tracked to the next frame , and a new tracked patch is obtained. Then is reversely tracked to and further to the initial frame , and the local consistency loss in Eq. 5 is computed. Moreover, during training, we first random sample a short video clip consisting of six successive frames. Then we perform above forward-backward tracking based learning strategy over three frames random drawn from the six-frame video clip. With above designs, captures the spatiotemporally local correspondence and is content-discriminative (due to its cross-frame target re-identification nature).

Figure 4: Illustration of our long-term granularity analysis.

Long-Term Granularity Analysis: Cross-Frame Semantic Matching. In addition to the local consistency among adjacent frames, there also exist strong semantic correlations among distant frames, as frames from the same video typically contain similar content [30, 69]. Capturing this property is essential for , as it makes robust to many challenges, such as appearance variants, shape deformations, object occlusions, etc. To address this issue, we conduct a long-term granularity analysis, which casts cross-frame correspondence learning as a dual-frame semantic matching problem (Fig. 4). Specifically, given a training pair of two disordered frames randomly sampled from , we compute a similarity affinity between their embeddings: by a co-attention operation [52]:


where and are flat matrix formats of and , respectively. ‘softmax’ indicates column-wise softmax normalization. Given the normalized cross-correlation , in line with [41], we use a small neural network to regress the parameters of a geometric transformation , i.e

., six-degree of freedom (translation, rotation and scale).

gives the relations between the spatial coordinates in and considering the corresponding semantic similarity:


where is a 2-D spatial coordinate of , and the corresponding sampling coordinates in . Using , we can warp to . Similarly, we can also compute , i.e., a 2-D warping from to . Let us consider two sampling coordinates and in and , respectively, we introduce a semantic matching loss [41]:


where refers to the image lattice, gives the similarity value between the positions and in and , and determines if the correspondence between and is geometrically consistent. If , ; otherwise .

Video Granularity Analysis: Global and Discriminative Representation Learning. So far, we have used the pairwise cross-frame information in local and long terms to boost the learning of . is also desired to learn a compact and globally discriminative video representation. To achieve this, with a global information aggregation module, we perform a video granularity analysis within an unsupervised video embedding learning framework [1], which leverages supervision signals from different videos.

Starting with our global information aggregation module, we split into segments of equal durations: . For each segment , we randomly sample a single frame, resulting in a -frame abstract of . reduces the redundancy among successive frames while preserving global information.

With a similar spirit of key-value retrieval networks [46], for each , we set it as a query and the rest frames as reference. Then we compute the normalized cross-correlation between the query and reference:


where , and ‘[]’ denotes the concatenation operation. and are flat feature matrices of the query and reference, respectively. Subsequently, is used as a weight matrix for global information summarization:


Our global information aggregation module gathers information from the reference set by a correlation-based feature summarization procedure. For the query frame , we obtain its global information augmented representation by:


During training, the video granularity analysis essentially discriminates between a set of surrogate video classes [1]. Specifically, given training videos, we randomly sample a single frame from each video, leading to training instances: . The core idea is that, for a query frame in the -th video, its global feature embedding is close to the instance from the same -th video, and far from other unrelated instances (from the other videos). We solve this as a binary classification problem via maximum likelihood estimation (MLE). In particular, for , instance should be classified into , while other instances

shouldn’t be. The probability of

being recognized as instance is:


where ‘GAP’ stands for global average pooling. Similarly, given , the probability of other instances be recognized as instance is:


Correspondingly, the probability of not being recognized as instance is . The joint probability of being recognized as instance and not being is: , under the assumption that different instances being recognized as are independent.

Then the loss function is defined as the negative log likelihood over

query frames from videos:


Next we will describe the network architecture during the training and inference phases. An appealing advantage of our multi-granularity VOS network is that, after being trained in a unified mode, it can be directly applied to both Z-VOS and O-VOS settings with only slight adaption.

3.2 One Training Phase for both Z-VOS and O-VOS

Network Architecture. Our whole module is end-to-end trainable. The video representation space is learned by a fully convolutional network, whose design is inspired by ResNet-50 [13]. In particular, the first four groups of convolutional layers in ResNet are preserved and dilated convolutional layer [72] is used to maintain enough spatial details as well as ensure a large receptive field, resulting in a 512-channel feature representation x whose spatial dimensions are of an input video frame .

During training, we use a mini-batch of videos and scale all the training frames into pixels. For frame granularity analysis, all the frames access to the supervision signal from the loss in Eq. 2.

For short-term granularity analysis, six successive video frames are first randomly sampled from each training video, resulting a six-frame video clip. For each video clip, we further sample three video frames orderly and randomly crop a patch as . With the feature embedding of , we forward-backward track and get its final backward tracking response map via Eq. 4. For computing the loss in Eq. 5, the Gaussian-shape map is obtained by convolving the center position of with a two-dimension Gaussian map with a kernel width proportional (0.1) to the patch size.

For long-term granularity analysis, after randomly sampling two disordered frames () from a training video , we compute the correlation map by the normalized inner production operation in Eq. 6. For the geometric transformation parameter estimator , it is achieved by two convolutional layers and one linear layer, as in [41]. Then the semantic matching loss in Eq. 8 is computed.

For video granularity analysis, we split each training video into segments, and get the global information augmented representation for each query frame by Eq. 11. Then, we compute the soft-max embedding learning loss using Eq. 14, which leverages supervision signals from the training videos.

Iterative Training by Bootstrapping. As seen in Fig. 1 (b), the fore-background knowledge from the saliency [70] or CAM [73, 76] is ambiguous and noisy. Inspired by Bootstrapping [40], we apply an iterative training strategy: after training with the initial fore-background maps, we use our trained model to re-label the training data. With each iteration, the learner bootstraps itself by mining better fore-background knowledge and then leading a better model. Specifically, for each training frame , given the initial fore-background mask and current prediction of the model in -th training iteration, the loss in Eq. 2 in -th iteration is formulated in a bootstrapping format:


where and gives the value in position m. In such a design, the ‘confident’ fore-background knowledge is generated as a convex combination of the initial fore-background information and model prediction .

In the -th training iteration, the overall loss to optimize the whole network parameters is the combination of the losses in Eq. 1548 and 14:


where s are coefficients: and .

The above designs enable a unified un-/weakly supervised feature learning framework. Once the model is trained, the learned representations can be used for Z-VOS and O-VOS, with slight modifications. In practice, we find that our model can perform well after being trained with 2 iterations; please see §4.2 for related experiments.

3.3 Inference Modes for Z-VOS and O-VOS

Now we detail our inference modes for object-level Z-VOS, instance-level Z-VOS, and O-VOS settings.

Object-Level Z-VOS Setting. For each test frame, object-level Z-VOS aims to predict a binary segmentation mask where the primary foreground objects are separated from the background while the identities of different foreground objects are not distinguished. In the classic VOS setting, since there is no any test-time human intervention, how to discover the primary video objects is the central problem. Considering the fact that interested objects frequently appear throughout the video sequence, we readout the segmentation results from the global information augmented feature r, instead of directly using intra-frame information to predict the fore-background mask (i.e., ). This is achieved by an extra segmentation readout layer , which takes the global frame embedding r as the input and produces the final object-level segmentation prediction. is also trained by the cross-entropy loss, as in Eq. 15. For notation clarity, we omit this term in the overall training loss in Eq. 16. Please note that is only used in Z-VOS setting; for O-VOS setting, the segmentation masks are generated with a different strategy.

Instance-Level Z-VOS Setting. Our model can also be adapted for the instance-level Z-VOS setting, in which different object instances must be discriminated, in addition to separating the primary video objects from the background without test-time human supervision. For each test frame, we first apply mask-RCNN [12] to produce a set of category agnostic object proposals.Then we apply our trained model for producing a binary foreground-background mask per frame. After combining object bounding-box proposals with binary object-level segmentation masks, we can filter out the background proposals and obtain pixel-wise, instance-level object candidates for each frame. Finally, to link those object candidates across different frames, similar to [27], we use overlap ratio and optical flow as the cross-frame candidate-association metric. Note that, mask-RCNN can be replaced with non-learning Edgebox [78] and GrabCut, resulting a purely unsupervised/weakly-supervised protocol.

O-VOS Setting. In O-VOS, for each test video sequence, instance-level annotations regarding multiple general foreground objects in the first frame are given. In such a setting, our trained network works in a per-frame matching based mask propagation fashion. Concretely, assume there are a total of object instances (including the background) in the first-frame annotation, each spatial position

will be associated with a one-hot class vector

, whose element indicates whether pixel n belong to -th object instance. Starting from the second frame, we use both the last segmented frame as well as current under-segmented frame to build an input pair for our model. Then we compute their similarity affinity in the feature space: . After that, for each pixel m in

, we compute its probability distribution

over the object instances as:


where is the affinity value between pixel n in and m in . For m, it is assigned to -th instance: , where . Then we get its label vector . In this way, from the segmented frame , we move to the next input frame pair and get the segmentation result for . As our method does not use any first-frame fine-tuning [6, 35] or online learning [55] technique, it is fast for inference.


Unsuper. Weakly-super.
Aspects Module mean mean
Reference Full model (2 iterations) 58.0 - 61.2 -
Initial Fore-/Background
Heuristic Saliency [70] 37.2 -20.8 - -
CAM [73] - - 45.3 -15.9
w/o. Frame Granularity 40.2 -17.8 40.2 -21.0
w/o. Short-term Granularity 51.3 -6.7 57.1 -4.1
w/o. Long-term Granularity 52.8 -5.2 56.0 -5.2
w/o. Video Granularity 56.4 -1.6 60.4 -0.8
Iterative Training
via Bootstrapping
1 iteration 50.8 -7.2 54.9 -6.3
3 iterations 58.0 0.0 61.2 0.0
4 iterations 58.0 0.0 61.2 0.0
More Data + LaSOT dataset [8] 59.5 +1.5 62.3 +1.1
Post-Process w/o. CRF 55.3 -2.7 58.7 -2.5
Table 1: Ablation study on DAVIS [36] val set, under the object-level Z-VOS setting. Please see §4.2 for details.

4 Experiment

4.1 Common Setup

Implementation Details. We train the whole network from scratch on the OxUvA [51] tracking dataset, as in [22]. OxUvA comprises 366 video sequences with more than 1.5 million frames in total. We train our model with SGD optimizer. For our bootstrapping based iterative training, two iterations are used and each takes about 8 hours.


Supervision Non Learning Unsupervised Learning Weakly-supervised
Method TRC [10] CVOS [48] KEY [24] MSG [31] NLC [7] FST [33] Motion Masks [34] TSN [17] Ours COSEG [50] Ours
Mean  47.3 48.2 49.8 53.3 55.1 55.8 48.9 31.2 58.0 52.8 61.2
Recall 49.3 54.0 59.1 61.6 55.8 64.7 44.7 18.7 65.3 50.0 65.9
Decay  8.3 10.5 14.1 2.4 12.6 0.0 19.2 -0.4 2.0 10.7 11.6
Mean  44.1 44.7 42.7 50.8 52.3 51.1 39.1 18.4 51.5 49.3 56.1
Recall 43.6 52.6 37.5 60.0 51.9 51.6 28.6 5.6 53.2 52.7 54.6
Decay  12.9 11.7 10.6 5.1 11.4 2.9 17.9 1.9 2.1 10.5 20.3
Mean  39.1 25.0 26.9 30.1 42.5 36.6 36.4 37.5 30.1 28.2 23.6
Table 2: Evaluation of object-level Z-VOS on DAVIS val set [36] ( §4.3), with region similarity , boundary accuracy and time stability . (The best scores in each supervision setting are marked in bold. These notes are the same to other tables.)


Supervision Non Learning Unsupervised Learning Weakly-supervised Learning
Method CRANE [47] NLC [7] FST [33] ARP [19] Motion Masks [34] TSN [17] Ours SOSD [75] BBF [42] COSEG [50] Ours
  Mean  23.9 27.7 53.8 46.2 32.1 52.2 57.7 54.1 53.3 58.1 62.4
Table 3: Evaluation of object-level Z-VOS on Youtube-Objects [39]4.3), with mean . See the supplementary for more details.

Configuration and Reproducibility.

MuG is implemented on PyTorch. All experiments are conducted on an Nvidia TITAN Xp GPU and an Intel (R) Xeon E5 CPU. All our implementations, trained models, and segmentation results will be released to provide the full details of our approach.

4.2 Diagnostic Experiments

A series of ablation studies are performed for assessing the effectiveness of each essential component of MuG.

Initial Fore-Background Knowledge. Baselines Heuristic Saliency and CAM give the scores of initial fore-background knowledge, based on their CRF-binarized outputs. As seen, with the low-quality initial knowledge, our MuG gains huge performance improvements ( and promotions), showing the significance of our multi-granularity video object pattern learning scheme.

Multi-Granularity Analysis. Next we investigate the contributions of multi-granularity cues in depth. As shown in Table 1, the intrinsic, multi-granularity properties are indeed meaningful, as disabling any granularity analysis component causes performance to erode. For instance, removing the frame granularity analysis during learning hurts performance (mean : , ), due to the lack of fore-/background information. Similarly, performance drops when excluding short- or long-term granularity analysis, suggesting the importance of capturing local consistency and semantic correspondence. Moreover, considering video granularity information also improves the final performance, proving the meaning of comprehensive video content understanding in video object pattern modeling.

Iterative Training Strategy. From Table 1, we can see that with more iterations of our bootstrapping training strategy (), better performance can be obtained. However, further iterations () give only marginal performance change. We thus use two iterations in all the experiments.

More Training Data. To show the potential of our unsupervised/weakly supervised VOS learning scheme, we probe the upper bound by training on additional videos. With more training data (1400 videos) from LaSOT dataset [8], performance boosts can be observed in both two settings.

4.3 Performance for Object-Level Z-VOS

Datasets. Experiments are conducted on two famous Z-VOS datasets: DAVIS [36] and Youtube-Objects [39], which have pixel-wise, object-level annotations. DAVIS has 50 videos (3,455 frames), covering a wide range of challenges, such as fast motion, occlusion, dynamic background, etc. It is split into a train set (30 videos) and a val set (20 videos). Youtube-Objects contains 126 video sequences that belong to 10 categories (such as cat, dog, etc.) and has 25,673 frames in total. The val set of DAVIS and whole Youtube-Objects are used for evaluation.

Evaluation Criteria. For fair comparison, we follow the official evaluation protocols of each dataset. For DAVIS, we report region similarity , boundary accuracy and time stability . For Youtube-Objects, the performance is evaluated in terms of region similarity .

Post-processing. Following the common protocol in this area [49, 45, 6], the final segmentation results are optimized by CRF [21] (about 0.3s per frame).

Quantitative Results. Table 2 presents the comparison results with several non-learning, unsupervised or weakly supervised learning competitors in DAVIS dataset. In particular, MuG exceeds current leading unsupervised learning-based methods (i.e., Motion Masks [34] and TSN [17] ) in large margins (58.0 vs 48.9 and 58.0 vs 31.2). MuG also outperforms classical weakly-supervised Z-VOS method COSEG [50], and all the previous heuristic methods. Table 3 summarizes comparison results on Youtube-Objects dataset, showing again our superior performance in both unsupervised and weakly supervised learning settings.

Runtime Comparison. The inference time of MuG is about 0.6s per frame, which is faster than most deep learning based competitors (e.g., MotionMask [34] (1.1s), TSN [17] (0.9s)). This is because, except CRF [21], there is no other pre-/post-processing step (e.g., superpixel [50], optical flow [33], etc.) and online fine-tuning [19].


Supervision Fully Supervised Unsupervised Weakly-super.
Method  [63]  [45]  [53] Ours* Ours Ours* Ours
Mean  45.6 40.4 22.5 36.5 37.3 40.6 41.7
Mean  42.1 37.7 17.7 33.8 35.0 37.7 38.9
Recall 48.5 42.6 16.2 38.2 39.3 42.5 44.3
Decay  2.6 4.0 1.6 2.1 3.8 1.9 2.7
Mean  49.0 43.0 27.3 38.0 39.6 43.5 44.5
Recall 51.5 44.6 24.8 38.6 41.1 44.9 46.6
Decay  2.6 3.7 1.8 3.2 4.6 1.0 1.7
Table 4: Evaluation of instance-level Z-VOS on DAVIS test-dev set [4]4.4), denotes purely unsupervised/weakly-supervised protocol with non-learning Edgebox [78] and GrabCut.


Supervision Non Learning Unsupervised Learning Weakly-supervised
Method HVS [29] JMP [9] FCP [37] SIFT Flow [25] BVS [28] Vondrick et al[56] mgPFF [20] TimeCycle [65] CorrFlow [22] Ours FlowNet2 [16] Ours
Mean  54.6 57.0 58.4 51.1 60.0 38.9 40.5 55.8 48.9 63.1 41.6 65.7
Recall 61.4 62.6 71.5 58.6 66.9 37.1 34.9 64.9 44.7 71.9 45.7 77.6
Decay  23.6 39.4 -2.0 18.8 28.9 22.4 18.8 0.0 19.2 28.1 19.9 26.4
Mean  52.9 53.1 49.2 44.0 58.8 30.8 34.0 51.1 39.1 61.8 40.1 63.5
Recall 61.0 54.2 49.5 50.3 67.9 21.7 24.2 51.6 28.6 64.2 38.3 67.7
Decay  22.7 38.4 -1.1 20.0 21.3 16.7 13.8 2.9 17.9 30.5 26.6 27.2
Mean  36.0 15.9 30.6 16.4 34.7 45.9 53.1 36.6 36.4 43.0 29.8 44.4
Table 5: Evaluation of O-VOS on DAVIS val set [36]4.5), with region similarity , boundary accuracy and time stability .


Supervision Non Learning Unsupervised Learning Weakly-supervised
SIFT Flow BVS DeepCluster Transitive Inv Vondrick et al. mgPFF TimeCycle CorrFlow FlowNet2
Method  [25]  [28]  [5]  [64]  [56]  [20]  [65]  [22] Ours  [16] Ours
Mean  34.0 37.3 35.4 29.4 34.0 44.6 42.8 50.3 54.3 26.0 56.1
Mean  33.0 32.9 37.5 32.0 34.6 42.2 43.0 48.4 52.6 26.7 54.0
Recall - 31.8 - - 34.1 41.8 43.7 53.2 57.4 23.9 60.7
Mean  35.0 41.7 33.2 26.8 32.7 46.9 42.6 52.2 56.1 25.2 58.2
Recall - 41.4 - - 26.8 44.4 41.3 56.0 58.1 24.6 62.2
Table 6: Evaluation of O-VOS on DAVIS val set [38]4.5), with region similarity , boundary accuracy and average of .

4.4 Performance for Instance-Level Z-VOS

Datasets. We test the performance for instance-level Z-VOS on DAVIS [4] dataset, which has 120 videos and 8,502 frames in total. It has three subsets, namely, train, val, and test-dev, containing 60, 30, and 30 video sequences, respectively. We use the ground-truth masks provided by the newest DAVIS challenge [4], as the original annotations are biased towards the O-VOS scenario.

Evaluation Criteria.

Three standard evaluation metrics, provided by DAVIS

, are used, i.e., region similarity , boundary accuracy and the average value of .

Quantitative Results. Three top-performing ZVOS methods from the DAVIS benchmark are included. As shown in Table 4, our model achieves comparable performance with the fully supervised methods (i.e., AGS [63] and PDB [45]). Notably, it significantly outperforms recent RVOS [53] (mean : and in unsupervised and weakly-supervised learning setting, respectively).

Runtime Comparison. The processing time for each frame is about 0.7s which is comparable to AGS [63] and PDB [45], and slightly slower than RVOS [53] (0.3s).

4.5 Performance for O-VOS

Datasets. DAVIS [36] and DAVIS [38] datasets are used for performance evaluation under the O-VOS setting.

Evaluation Criteria. Three standard evaluation criteria are reported: region similarity , boundary accuracy and the average value of . For DAVIS dataset, we further report the time stability .

Quantitative Results. Table 5 and Table 6 give evaluation results on DAVIS and DAVIS, respectively. Table 5 shows that our unsupervised method exceeds representative self-supervised methods (i.e., TimeCyle [65] and CorrFlow [65]) and the best non-learning method (i.e., BVS [28]) across most metrics. In particular, with the learned CAM as supervision, our weakly supervised method further improves the performance, e.g., mean of 65.7. Table 6 verifies again our method performs favorably against the current best unsupervised method, CorrFlow, according to mean (54.3 vs 50.3). Note that CorrFlow and our method use the same training data. This demonstrates our MuG is able to learn more powerful video object patterns, compared to previous self-learning counterparts.

Figure 5: Visual results on three videos (top: blackswan, middle: tram, bottom: scooter-black) under object-level Z-VOS, instance-level Z-VOS and O-VOS setting, respectively (see §4.6). For scooter-black, its first-frame annotation is also depicted.

Runtime Comparison. In instance-level Z-VOS setting, MuG runs about 0.4s per frame. This is faster than matching based methods (e.g., SIFT Flow [25] (5.1s) and mgPFF [20] (1.3s)), and favorably against self-supervised learning methods, e.g., TimeCycle [65] and CorrFlow [22].

4.6 Qualitative Results

Fig. 5 presents some visual results for object-level ZVOS (top row), instance-level Z-VOS (middle row) and O-VOS (bottom row). For blackswan in DAVIS [36], the primary objects undergo view changes and background clutter, but our MuG still generates accurate foreground segments. The effectiveness of instance-level Z-VOS can be observed in tram of DAVIS [4]. In addition, MuG can produce high-quality results with the given first-frame annotations in O-VOS setting (see the results on the last row for scooter-black in DAVIS [38]), although the different instances suffer from fast motion and scale variation. More results can be found in supplementary materials.

5 Conclusion

We proposed MuG – an end-to-end trainable, unsupervised/weakly supervised learning approach for segmenting objects from the videos. Different from current popular supervised VOS solutions requiring extensive amounts of elaborately annotated training samples, our MuG models video object patterns by comprehensively exploring supervision signals from different granularities of unlabeled videos. Our model sets new state-of-the-arts over diverse VOS settings, including object-level Z-VOS, instance-level Z-VOS, and O-VOS. Our model opens up the probability of learning VOS from nearly infinite amount of unlabeled videos and unifying different VOS settings from a single view of video object pattern understanding.


  • [1] D. Alexey, S. J. Tobias, M. Riedmiller, and T. Brox (2014)

    Discriminative unsupervised feature learning with convolutional neural networks

    In NIPS, Cited by: §3.1, §3.1.
  • [2] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr (2016) Fully-convolutional siamese networks for object tracking. In ECCV, Cited by: §3.1.
  • [3] S. Caelles, K. Maninis, J. Pont-Tuset, L. Leal-Taixe, D. Cremers, and L. Van Gool (2017) One-shot video object segmentation. In CVPR, Cited by: §2.1, footnote 1.
  • [4] S. Caelles, J. Pont-Tuset, F. Perazzi, A. Montes, K. Maninis, and L. Van Gool (2019) The 2019 davis challenge on vos: unsupervised multi-object segmentation. arXiv:1905.00737. Cited by: §4.4, §4.6, Table 4.
  • [5] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In ECCV, Cited by: Table 6.
  • [6] J. Cheng, Y. Tsai, S. Wang, and M. Yang (2017) Segflow: joint learning for video object segmentation and optical flow. In ICCV, Cited by: §2.1, §3.3, §4.3.
  • [7] A. Faktor and M. Irani (2014) Video segmentation by non-local consensus voting. In BMVC, Cited by: §2.1, §2.2, Table 2, Table 3.
  • [8] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling (2019) LaSOT: a high-quality benchmark for large-scale single object tracking. In CVPR, Cited by: Table 1, §4.2.
  • [9] Q. Fan, F. Zhong, D. Lischinski, D. Cohen-Or, and B. Chen (2015)

    JumpCut: non-successive mask transfer and interpolation for video cutout

    ACM Trans. Graph. 34 (6), pp. 195:1–195:10. Cited by: §2.1, Table 5.
  • [10] K. Fragkiadaki, G. Zhang, and J. Shi (2012) Video segmentation by tracing discontinuities in a trajectory embedding. In CVPR, Cited by: §2.1, Table 2.
  • [11] G. Hartmann, M. Grundmann, J. Hoffman, D. Tsai, V. Kwatra, O. Madani, S. Vijayanarasimhan, I. Essa, J. Rehg, and R. Sukthankar (2012) Weakly supervised learning of object segmentations from web-scale video. In ECCV, Cited by: §2.2.
  • [12] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, Cited by: §3.3.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.2.
  • [14] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, Cited by: §1.
  • [15] J. Hurri and A. Hyvärinen (2003) Simple-cell-like receptive fields maximize temporal coherence in natural video. Neural Computation 15 (3), pp. 663–691. Cited by: §3.1.
  • [16] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In CVPR, Cited by: Table 5, Table 6.
  • [17] C. Ioana, B. Simion-Vlad, and L. Marius (2017) Unsupervised learning from video to detect foreground objects in single images. In ICCV, Cited by: §2.2, §4.3, §4.3, Table 2, Table 3.
  • [18] V. Jampani, R. Gadde, and P. V. Gehler (2017) Video propagation networks. In CVPR, Cited by: §2.1.
  • [19] Y. J. Koh and C. Kim (2017) Primary object segmentation in videos based on region augmentation and reduction. In CVPR, Cited by: §2.1, §4.3, Table 3.
  • [20] S. Kong and C. Fowlkes (2019) Multigrid predictive filter flow for unsupervised learning on videos. arXiv preprint arXiv:1904.01693. Cited by: §4.5, Table 5, Table 6.
  • [21] P. Krähenbühl and V. Koltun (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, Cited by: §4.3, §4.3.
  • [22] Z. Lai and W. Xie (2019) Self-supervised learning for video correspondence flow. In BMVC, Cited by: §4.1, §4.5, Table 5, Table 6.
  • [23] J. Lee, E. Kim, S. Lee, J. Lee, and S. Yoon (2019) Frame-to-frame aggregation of active regions in web videos for weakly supervised semantic segmentation. In ICCV, Cited by: §2.2.
  • [24] Y. J. Lee, J. Kim, and K. Grauman (2011) Key-segments for video object segmentation. In ICCV, Cited by: §2.1, Table 2.
  • [25] C. Liu, J. Yuen, and A. Torralba (2010) Sift flow: dense correspondence across scenes and its applications. IEEE TPAMI 33 (5), pp. 978–994. Cited by: §4.5, Table 5, Table 6.
  • [26] X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli (2019) See more, know more: unsupervised video object segmentation with co-attention siamese networks. In CVPR, Cited by: §2.1.
  • [27] J. Luiten, P. Voigtlaender, and B. Leibe (2018) PReMVOS: proposal-generation, refinement and merging for video object segmentation. In ACCV, Cited by: §2.1, §3.3.
  • [28] N. Marki, F. Perazzi, O. Wang, and A. Sorkine-Hornung (2016) Bilateral space video segmentation. In CVPR, Cited by: §2.1, §4.5, Table 5, Table 6.
  • [29] G. Matthias, K. Vivek, H. Mei, and E. Irfan (2010) Efficient hierarchical graph-based video segmentation. In CVPR, Cited by: §2.1, Table 5.
  • [30] H. Mobahi, C. Ronan, and W. Jason (2009) Deep learning from temporal coherence in video. In ICML, Cited by: §3.1.
  • [31] P. Ochs and T. Brox (2011) Object segmentation in video: A hierarchical variational approach for turning point trajectories into dense regions. In ICCV, Cited by: §2.1, Table 2.
  • [32] B. Pang, K. Zha, H. Cao, C. Shi, and C. Lu (2019) Deep rnn framework for visual sequential applications. In CVPR, Cited by: §2.1.
  • [33] A. Papazoglou and V. Ferrari (2013) Fast object segmentation in unconstrained video. In ICCV, Cited by: §2.1, §4.3, Table 2, Table 3.
  • [34] D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan (2017) Learning features by watching objects move. In CVPR, Cited by: §2.2, §4.3, §4.3, Table 2, Table 3.
  • [35] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung (2017) Learning video object segmentation from static images. In CVPR, Cited by: §2.1, §3.3.
  • [36] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, Cited by: Table 1, §4.3, §4.5, §4.6, Table 2, Table 5, footnote 1.
  • [37] F. Perazzi, O. Wang, M. H. Gross, and A. Sorkine-Hornung (2015) Fully connected object proposals for video segmentation. In ICCV, Cited by: §2.1, Table 5.
  • [38] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool (2017) The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675. Cited by: §4.5, §4.6, Table 6.
  • [39] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari (2012) Learning object class detectors from weakly annotated video. In CVPR, Cited by: §4.3, Table 3.
  • [40] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich (2014) Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596. Cited by: §3.2.
  • [41] I. Rocco, R. Arandjelović, and J. Sivic (2018) End-to-end weakly-supervised semantic alignment. In CVPR, Cited by: §3.1, §3.2.
  • [42] F. Sadat Saleh, M. Sadegh Aliakbarian, M. Salzmann, L. Petersson, and J. M. Alvarez (2017) Bringing background into the foreground: making all classes equal in weakly-supervised video semantic segmentation. In ICCV, Cited by: Table 3.
  • [43] H. Shen (2018) Towards a mathematical understanding of the difficulty in learning with feedforward neural networks. In CVPR, Cited by: §1.
  • [44] M. Siam, C. Jiang, S. Lu, L. Petrich, M. Gamal, M. Elhoseiny, and M. Jagersand (2019) Video segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In ICRA, Cited by: §2.1.
  • [45] H. Song, W. Wang, S. Zhao, J. Shen, and K. Lam (2018) Pyramid dilated deeper convlstm for video salient object detection. In ECCV, Cited by: §2.1, §4.3, §4.4, §4.4, Table 4.
  • [46] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus (2015) End-to-end memory networks. In NIPS, Cited by: §3.1.
  • [47] K. Tang, R. Sukthankar, J. Yagnik, and L. Fei-Fei (2013) Discriminative segment annotation in weakly labeled video. In CVPR, Cited by: §2.2, Table 3.
  • [48] B. Taylor, V. Karasev, and S. Soatto (2015) Causal video object segmentation from persistence of occlusions. In CVPR, Cited by: Table 2.
  • [49] P. Tokmakov, K. Alahari, and C. Schmid (2017) Learning video object segmentation with visual memory. In ICCV, Cited by: §2.1, §4.3.
  • [50] Y. Tsai, G. Zhong, and M. Yang (2016) Semantic co-segmentation in videos. In ECCV, Cited by: §4.3, §4.3, Table 2, Table 3.
  • [51] J. Valmadre, L. Bertinetto, J. F. Henriques, R. Tao, A. Vedaldi, A. W. Smeulders, P. H. Torr, and E. Gavves (2018) Long-term tracking in the wild: a benchmark. In ECCV, Cited by: §4.1.
  • [52] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §3.1.
  • [53] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and X. Giro-i-Nieto (2019) Rvos: end-to-end recurrent network for video object segmentation. In CVPR, Cited by: §4.4, §4.4, Table 4.
  • [54] P. Voigtlaender, Y. Chai, F. Schroff, H. Adam, B. Leibe, and L. Chen (2019) FEELVOS: fast end-to-end embedding learning for video object segmentation. In CVPR, Cited by: §2.1.
  • [55] P. Voigtlaender and B. Leibe (2017) Online adaptation of convolutional neural networks for video object segmentation. In BMVC, Cited by: §2.1, §3.3.
  • [56] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy (2018) Tracking emerges by colorizing videos. In ECCV, Cited by: §2.2, Table 5, Table 6.
  • [57] N. Wang, Y. Song, C. Ma, W. Zhou, W. Liu, and H. Li (2019) Unsupervised deep tracking. In CVPR, Cited by: §3.1, §3.1.
  • [58] W. Wang, X. Lu, J. Shen, D. J. Crandall, and L. Shao (2019) Zero-shot video object segmentation via attentive graph neural networks. In ICCV, Cited by: §2.1.
  • [59] W. Wang, J. Shen, F. Porikli, and R. Yang (2018) Semi-supervised video object segmentation with super-trajectories. IEEE TPAMI 41 (4), pp. 985–998. Cited by: §2.1.
  • [60] W. Wang, J. Shen, and F. Porikli (2015) Saliency-aware geodesic video object segmentation. In CVPR, Cited by: §2.1, footnote 1.
  • [61] W. Wang, J. Shen, and L. Shao (2015) Consistent video saliency using local gradient flow optimization and global refinement. IEEE TIP 24 (11), pp. 4185–4196. Cited by: §2.1.
  • [62] W. Wang, J. Shen, and L. Shao (2017) Video salient object detection via fully convolutional networks. IEEE TIP 27 (1), pp. 38–49. Cited by: §2.1.
  • [63] W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. C. Hoi, and H. Ling (2019) Learning unsupervised video object segmentation through visual attention. In CVPR, Cited by: §2.1, §4.4, §4.4, Table 4.
  • [64] X. Wang, K. He, and A. Gupta (2017) Transitive invariance for self-supervised visual representation learning. In CVPR, Cited by: Table 6.
  • [65] X. Wang, A. Jabri, and A. A. Efros (2019) Learning correspondence from the cycle-consistency of time. In CVPR, Cited by: §2.2, §4.5, §4.5, Table 5, Table 6.
  • [66] Z. Wang, J. Xu, L. Liu, F. Zhu, and L. Shao (2019) Ranet: ranking attention network for fast video object segmentation. In ICCV, Cited by: §2.1.
  • [67] S. Wug Oh, J. Lee, K. Sunkavalli, and S. Joo Kim (2018) Fast video object segmentation by reference-guided mask propagation. In CVPR, Cited by: §2.1.
  • [68] Y. Yan, Q. Zhang, B. Ni, W. Zhang, M. Xu, and X. Yang (2019) Learning context graph for person search. In CVPR, Cited by: §2.1.
  • [69] Y. Yan, N. Zhuang, B. Ni, J. Zhang, M. Xu, Q. Zhang, Z. Zheng, S. Cheng, Q. Tian, X. Yang, W. Zhang, et al. (2019) Fine-grained video captioning via graph-based multi-granularity interaction learning. IEEE TPAMI. Cited by: §3.1.
  • [70] C. Yang, L. Zhang, H. Lu, X. Ruan, and M. Yang (2013) Saliency detection via graph-based manifold ranking. In CVPR, Cited by: 1st item, §1, §3.1, §3.2, Table 1, footnote 2.
  • [71] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos (2018) Efficient video object segmentation via network modulation. In CVPR, Cited by: §2.1.
  • [72] F. Yu and V. Koltun (2016) Multi-scale context aggregation by dilated convolutions. In ICLR, Cited by: §3.2.
  • [73] Y. Zeng, Y. Zhuge, H. Lu, L. Zhang, M. Qian, and Y. Yu (2019) Multi-source weak supervision for saliency detection. In CVPR, Cited by: 1st item, §1, §3.1, §3.2, Table 1, footnote 2.
  • [74] D. Zhang, L. Yang, D. Meng, D. Xu, and J. Han (2017) SPFTN: a self-paced fine-tuning network for segmenting objects in weakly labelled videos. In CVPR, Cited by: §2.2.
  • [75] Y. Zhang, X. Chen, J. Li, C. Wang, and C. Xia (2015) Semantic object segmentation via detection in weakly labeled video. In CVPR, Cited by: Table 3.
  • [76] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

    Learning deep features for discriminative localization

    In CVPR, Cited by: 1st item, §1, §3.1, §3.2, footnote 2.
  • [77] T. Zhou, S. Wang, Y. Zhou, Y. Yao, J. Li, and L. Shao (2020) Motion-attentive transition for zero-shot video object segmentation. In AAAI, Cited by: §2.1.
  • [78] C. L. Zitnick and P. Dollár (2014) Edge boxes: locating object proposals from edges. In ECCV, Cited by: §3.3, Table 4.