Weakly Supervised Instance Segmentation for Videos with Temporal Mask Consistency

03/23/2021 ∙ by Qing Liu, et al. ∙ Facebook Johns Hopkins University 0

Weakly supervised instance segmentation reduces the cost of annotations required to train models. However, existing approaches which rely only on image-level class labels predominantly suffer from errors due to (a) partial segmentation of objects and (b) missing object predictions. We show that these issues can be better addressed by training with weakly labeled videos instead of images. In videos, motion and temporal consistency of predictions across frames provide complementary signals which can help segmentation. We are the first to explore the use of these video signals to tackle weakly supervised instance segmentation. We propose two ways to leverage this information in our model. First, we adapt inter-pixel relation network (IRN) to effectively incorporate motion information during training. Second, we introduce a new MaskConsist module, which addresses the problem of missing object instances by transferring stable predictions between neighboring frames during training. We demonstrate that both approaches together improve the instance segmentation metric AP_50 on video frames of two datasets: Youtube-VIS and Cityscapes by 5% and 3% respectively.



There are no comments yet.


page 1

page 7

page 8

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Instance segmentation is a challenging task, where all object instances in an image have to be detected and segmented. This task has seen rapid progress in recent years [he2017mask, liu2018path, chen2019hybrid]

, partly due to the availability of large datasets like COCO 

[lin2014microsoft]. However, it can be forbiddingly expensive to build datasets at this scale for a new domain of images or videos, since segmentation boundaries have to be annotated for every object in an image.

Figure 1: Two types of error for IRN [ahn2019weakly] trained with still images: (a) partial segmentation and (b) missing instance. We observe optical flow is able to capture pixels of the same instance better (circles in (a)) and we propose flowIRN to model this information. In (b), a fish is missed on one frame. We propose MaskConsist to leverage temporal consistency and transfer stable mask predictions to neighboring frames during training.

Alternatively, weak labels like classification labels can be used to train instance segmentation models [zhou2018weakly, cholakkal2019object, zhu2019learning, ahn2019weakly, laradji2019masks, ge2019label, shen2019cyclic, arun2020weakly]. While weak labels are significantly cheaper to annotate, training weakly supervised models can be far more challenging. They typically suffer from two sources of error: (a) partial instance segmentation and (b) missing object instances, as shown in Fig. 1. Weakly supervised methods often identify only the most discriminative object regions that help predict the class label. This results in partial segmentation of objects, as shown in Fig. 1(a). For instance, the recent work on weakly supervised instance segmentation IRN [ahn2019weakly] relies on class activation maps (CAMs) [zhou2016learning], which suffer from this issue as also observed in other works [kolesnikov2016seed, wei2017object, zhang2018adversarial]. Further, CAMs do not differentiate between overlapping instances of the same class. It can also miss object instances when multiple instances are present in an image, as shown in Fig. 1(b). In particular, an instance could be segmented in one image but not in another image where it is occluded or its pose alters.

Interestingly, these issues are less severe in videos, where object motion provides an additional signal for instance segmentation. As shown in Fig. 1

, optical flow in a video is tightly coupled with instance segmentation masks. This is unsurprising since pixels belonging to the same (rigid) object move together and have similar flow vectors. We incorporate such video signals to train weakly supervised instance segmentation models, in contrast to existing methods 

[ahn2019weakly, laradji2019masks, shen2019cyclic, arun2020weakly] only targeted at images.

Typical weakly supervised approaches involve two steps: (a) generating pseudo-labels, comprising noisy instance segmentation masks consistent with the weak class labels, and (b) training a supervised model like Mask R-CNN based on these pseudo-labels. We leverage video information in both stages. In the first step, we modify IRN to assign similar labels to pixels with similar motion. This helps in addressing the problem of partial segmentation. We refer to the modified IRN as flowIRN. In the second step, we introduce a new module called MaskConsist, which counters the problem of missing instances by leveraging temporal consistency between objects across consecutive frames. It matches prediction between neighboring frames and transfers the stable predictions to obtain additional pseudo-labels missed by flowIRN during training. This is a generic module that can be used in combination with any weakly supervised segmentation methods as we show in our experiments.

To the best of our knowledge, we are the first work to utilize temporal consistency between frames to train a weakly supervised instance segmentation model for videos. We show that this leads to more than and improvement in average precision compared to image-centric methods, like IRN, on video frames from two challenging video datasets: Youtube-VIS (YTVIS) [yang2019video] and Cityscapes [Cordts2016Cityscapes], respectively. We also observe similar gains on the recently introduced video instance segmentation task [yang2019video] in YTVIS.

2 Related Work

Different types of weak supervision have been used in the past for semantic segmentation: bounding boxes [dai2015boxsup, papandreou2015weakly, khoreva2017simple, song2019box], scribbles [lin2016scribblesup, vernaza2017learning, tang2018normalized], and image-level class labels [kolesnikov2016seed, jin2017webly, hou2018self, wei2018revisiting, huang2018weakly, lee2019ficklenet, shimoda2019self, sun2020mining]. Similarly, for instance segmentation, image-level [zhou2018weakly, cholakkal2019object, zhu2019learning, ahn2019weakly, laradji2019masks, ge2019label, shen2019cyclic, arun2020weakly] and bounding box supervision [khoreva2017simple, hsu2019bbtp] have been explored. In this work, we focus on only using class labels for weakly supervised instance segmentation.

Weakly supervised semantic segmentation: Most weakly supervised semantic segmentation approaches rely on class attention maps (CAMs) [zhou2016learning] to provide noisy pseudo-labels as supervision [kolesnikov2016seed, huang2018weakly, shimoda2019self, sun2020mining]. Sun et al. [sun2020mining] used co-attention maps generated from image pairs to train the semantic segmentation network. Another line of work leverages motion and temporal consistency in videos [tsai2016semantic, tokmakov2016learning, saleh2017bringing, hong2017weakly, lee2019frame, wang2020deep] to learn more robust representation. For instance, frame-to-frame (F2F) [lee2019frame] used optical flow to warp CAMs from neighboring frames and aggregated the warped CAMs to obtain more robust pseudo-labels.

Weakly supervised instance segmentation: For training instance segmentation models with bounding box supervision, Hsu et al. [hsu2019bbtp] proposed a bounding box tightness constraint and multiple instance learning (MIL) based objective. Another line of work that only uses class labels extracts semantic responses from CAMs or other attention maps and then combines them with object proposals [rother2004grabcut, pont2016multiscale] to generate instance segmentation masks  [zhou2018weakly, cholakkal2019object, zhu2019learning, laradji2019masks, shen2019cyclic]. However, these methods’ performance heavily depends on the quality of proposals used, which are mostly pre-trained on other datasets. Shen et al. [shen2019cyclic] extracted attention maps from a detection network and then jointly learn the detection and segmentation networks in a cyclic manner. Arun et al. [arun2020weakly] proposed a conditional network to model the noise in weak supervision and combined it with object proposals to generate instance masks. The first end-to-end network (IRN) [ahn2019weakly] was proposed by Ahn et al. to directly predict instance offset and semantic boundary which were combined with CAMs to generate instance mask predictions. Our method adapts [ahn2019weakly] for the first step of training and combines it with a novel MaskConsist module. However, other weakly supervised can also be integrated into our framework if code is available.

Segmentation in videos: A series of approaches have emerged for segmentation in videos [Bertasius_2020_CVPR, hu2019learning, athar2020stem, mohamed2020instancemotseg, liang2020polytransform]. Some works proposed to leverage the video consistency [vondrick2018tracking, wang2019learning, lu2020learning, luiten2020unovost]. Recently, Yang et al. [yang2019video] extended the traditional instance segmentation task from images to videos and proposed Video Instance Segmentation task (VIS). VIS aims to simultaneously segment and track all object instances in the video. Every pixel is labeled with a class label and an instance track-ID. MaskTrack [yang2019video] added a tracking-head to Mask R-CNN [he2017mask] to build a new model for this task. Bertasius et al. [Bertasius_2020_CVPR]

improved MaskTrack by proposing a mask propagation head. This head propagated instance features across frames in a clip to get more stable predictions. To the best of our knowledge, there has been no work that has explored weakly supervised learning for the video instance segmentation task. We evaluate our method on this task by combining it with a simple tracking approach.

3 Approach

Figure 2: Our pipeline mainly consists of two modules: flowIRN and MaskConsist. FlowIRN adapts IRN [ahn2019weakly]

by incorporating optical flow to modify CAMs (f-CAMs), as well as introducing a new loss function: flow-boundary loss (f-Bound loss). MaskConsist matches the predictions from two successive frames and transfers high-quality predictions from one frame as pseudo-labels to another. It has three components: intra-frame matching, inter-frame matching and temporally consistent labels, shown in orange, green and blue, respectively. First, flowIRN is trained with frame-level class labels. Next, MaskConsist is trained with the pseudo-labels generated by flowIRN.

We first introduce preliminaries of inter-pixel relation network (IRN) [ahn2019weakly] and extend it to incorporate video information, resulting in flowIRN. Next, we introduce MaskConsist which enforces temporal consistency in predictions across successive frames. Our framework has a 2-stage training process: (1) train flowIRN and (2) use masks generated by flowIRN on the training frames as supervision to train the MaskConsist model, as shown in Fig. 2.

3.1 Preliminaries of IRN

IRN [ahn2019weakly] extracts inter-pixel relations from Class Attention Maps (CAMs) and uses it to infer instance locations and class boundaries. For a given image, CAMs provide pixel-level scores for each class that are then converted to class labels. Every pixel is assigned the label corresponding to the highest class activation score at the pixel, if this score is above a foreground threshold. Otherwise, it is assigned the background label.

IRN is a network with two branches that predict (a) a per-pixel displacement vector pointing towards the center of the instance containing the pixel and (b) a per-pixel boundary likelihood indicating if a pixel lies on the boundary of an object or not. Since the model is weakly supervised, neither displacement nor boundary labels are available during training. Instead, IRN introduces losses that enforce constraints on displacement and boundary predictions based on the foreground/background labels inferred from CAMs.

During inference, a two-step procedure is used to obtain instance segmentation masks. First, all pixels with displacement vectors pointing towards the same centroid are grouped together to obtain per-pixel instance labels. However, these predictions tend to be noisy. In the second step, the predictions are refined using a pairwise affinity term . For two pixels and ,


where is the boundary likelihood predicted by IRN for pixel , and is the set of pixels lying on the line connecting and

. If two pixels are separated by an object boundary, at least one pixel on the line connecting them should belong to this boundary. This results in low affinity between the two pixels. Conversely, the affinity would be high for pixels which are part of the same instance. In IRN, the affinity term is used to define the transition probability for a random walk algorithm that smooths the final per-pixel instance and class label assignments.

3.2 FlowIRN Module

We introduce flowIRN which improves IRN by incorporating optical flow information in two components, flow-amplified CAMs and flow-boundary loss.

Flow-Amplified CAMs: We observed that CAMs identify only the discriminative regions of an object (like the face of an animal) but often miss other regions corresponding to the object. This has been noted in previous works as well [kolesnikov2016seed, wei2017object, zhang2018adversarial]

. Since the objects of interest in a video are usually moving foreground objects, we address this issue by first amplifying CAMs in regions where large motion is observed. More specifically, given the estimated optical flow

for the current frame, we replace CAMs used in IRN with:


where is an amplification coefficient and is a flow magnitude threshold. This operation is applied to CAMs of all classes equally, preserving the relative ordering of class scores. Class labels obtained from CAMs are not flipped; only foreground and background assignments are affected.

Flow-boundary loss: In IRN, boundary prediction is supervised by the pseudo segmentation labels from CAMs, which does not distinguish instances of the same class, particularly overlapping instances. However, in videos, optical flow could disambiguate such instances, since pixels of the same rigid object move together and have consistent motion. Hence, we use spatial gradient of optical flow to identify if two pixels are from the same object instance or not. Points from the same object can be from different depths relative to the camera, and might not have the same optical flow. In practice, we observed that the gradient is more robust to this depth change. We explain this in detail in the appendix. We use the affinity term from Eq. 1 to define a new flow-boundary loss:


where is a two-dimensional vector denoting the gradient of the optical flow at a pixel with respect to its spatial co-ordinates respective. is a small pixel neighborhood around as defined in IRN, and is the regularization parameter. The first term implies that pixels with similar flow-gradients could have high-affinity (belonging to the same instance), while pixels with different flow-gradients should have low-affinity (belonging to different instances). The second term is used to regularize the loss and prevent the trivial solution of being constantly. We train flowIRN with the above loss and the original losses in IRN.

3.3 MaskConsist

Figure 3: Three steps of MaskConsist module. (a) Intra-frame matching expands the original predictions of current Mask R-CNN by merging highly overlapping predictions. (b) Inter-frame matching identifies one-to-one matching between predictions across frames. (c) Temporally consistent labels transfer matched predictions from one frame to another after warping with optical flow.

The instance-level masks generated by flowIRN can now be used as pseudo-labels to train a fully supervised instance segmentation model, like Mask R-CNN. In practice, this yields better performance on the validation set than the original flowIRN model. However, the pseudo-labels generated by flowIRN can miss object instances on some frames if both CAMs and optical-flow couldn’t identify them.

MaskConsist solves this by transferring “high-quality” mask predictions from a neighboring frame as new pseudo-labels to the current frame while training a Mask R-CNN. At each training iteration, we train the network with a pair of neighboring frames and in the video. In addition to the pseudo-labels from flowIRN, a subset of predictions by the current Mask R-CNN on are used as additional pseudo-labels for and vice-versa. Predictions are transferred only if (a) they are temporally stable and (b) overlap with existing pseudo-labels form flowIRN. This avoids false-negatives by MaskConsist. MaskConsist contains three steps: intra-frame matching, inter-frame matching, and temporally consistent label assignment. These steps are explained next and visualized in Fig. 3.

Intra-frame matching: At each training iteration, we first generate a large set of candidate masks that can be transferred to neighboring frames. The pseudo-labels from flowIRN might be incomplete, but we expect the predictions from the Mask R-CNN to become more robust as training proceeds. Hence, high-confidence predictions from the model at a given iteration can be used as the candidate set. However, we empirically observed that during early stages of training, masks predicted by Mask R-CNN can be fragmented. We overcome this by also considering the union of all mask predictions which have good overlap with a flowIRN pseudo-label of the same class. The candidate set of predictions after this step includes the original predictions (in practice, we use top 100 predictions) for the frame, as well as the new predictions obtained by combining overlapping predictions, as shown in Fig. 3(a). For a frame at time , we refer to the original set of predictions from the model as , and this expanded set as . Each prediction corresponds to a triplet: mask, bounding box and the class with highest score for the box, denoted by respectively.

Inter-frame matching: Next, we wish to transfer some predictions from the current frame as pseudo-labels to the neighboring frame and vice-versa. We only transfer a prediction if it is stably predicted by the current model on both frames. To do this, we first create a bipartite graph between the two frames. The nodes from each frame correspond to the expanded prediction set and respectively as shown in Fig. 3(b). The edge weight between prediction and is defined as:

where is a bi-linear warping function that warps the prediction from one frame to another based on the optical flow between them (explained in the appendix). The edge weight is non-zero only if the two predictions share the same class. The weight is high if the warped mask from frame has high overlap with the mask in .

The correspondence between predictions of both frames is then obtained by solving the bipartite graph-matching problem with these edge weights, using the Hungarian algorithm. This results in a one-to-one matching between a subset of predictions from and . We denote the matching result as , containing pairs of matched predictions from both frames. This comprises pairs of predictions that are temporally stable.

Temporally consistent labels: We use the predictions from frame which are matched to some predictions in in the previous step to define new pseudo-labels for frame as shown in Fig. 3(c). Since there can be a lot of spuriously matched predictions, we only transfer high-quality predictions that have some overlap with the original pseudo-labels in frame . As the process presented in Alg. 1, let be the original set of pseudo-labels obtained from flowIRN for frame and be the matched prediction pairs between two frames. We transfer only those masks from which have an overlap greater than with any of the original masks in of the same class. Further, when transferring to , we (a) warp the mask using optical flow and (b) merge it with the matched prediction in as shown in Fig. 3(c) to ensure that the mask is not partially transferred. This new set of labels transferred from to are denoted by . The steps are explained below. Here, simply takes the union of masks from two predictions to form a new prediction.

2 for  do
3        for  do
4               if ,  then
7                      break
9        end for
11 end for
Algorithm 1 Temporally consistent assignment

Simultaneously, new pseudo-labels are obtained for by transferring predictions from in a similar fashion. We combine them with the original pseudo-labels from flowIRN to obtain the final set of pseudo-labels . We also note that while combining these two sets of labels, it is important to suppress smaller masks that are contained within the others. Concretely, we apply non-maximal suppression (NMS) based on an Intersection over Minimum (IoM) threshold. IoM is calculated between two masks as the intersection area over the area of the smaller mask. This avoids label redundancy and helps improve performance as we demonstrate later in ablation experiments. The merged pseudo-labels are used as supervision to train the Mask R-CNN model as shown in Fig. 2, without altering the Mask R-CNN in any way.

Our overall MaskConsist approach does not require extra forward or backward pass, and only adds a small overhead to the original Mask R-CNN during training. During inference, the matching is unnecessary and MaskConsist works similar to Mask R-CNN.

4 Experiments

Unless otherwise specified, models in this section are trained only with frame-level class labels and do not use bounding-box or segmentation labels. We evaluate our model on two tasks: frame-level instance segmentation and video-level instance segmentation. We report performances on two popular video datasets.

4.1 Datasets

Youtube-VIS (YTVIS) [yang2019video] is a recently proposed benchmark for the task of video instance segmentation. It contains training, validation, and test videos collected from YouTube, containing categories. Every th frame in the training split is annotated with instance segmentation mask.

As the annotation of validation and test splits are not released and only video-level instance segmentation performance is available on the evaluation server, we hold out a subset of videos from the original training split by randomly selecting videos from each category. This results in a train_val split of videos (there are videos belonging to multiple object categories) to conduct frame-level and video-level instance segmentation evaluations. The remaining videos are used as the train_train split.

Cityscapes [Cordts2016Cityscapes] contains high-quality pixel-level annotations for frames collected in street scenes from different cities. object categories are annotated with semantic segmentation masks and of them are annotated with instance segmentation masks. The standard training frames and their neighboring and frames are used for training, and the frames in validation split are used for evaluation.

4.2 Implementation details

Optical flow network: We use the self-supervised DDFlow [liu2019ddflow] for optical flow extraction. The model is pre-trained on “Flying Chairs” dataset [DFIB15] and then fine-tuned on YTVIS or Cityscapes training videos in an unsupervised way. The total training time is hours on four P100 GPUs and the average inference time per frame is 280ms.

flowIRN: To get flow-amplified CAMs, we set the amplification co-efficient and threshold for YTVIS, and and for Cityscapes. The optical flow is extracted between two consecutive frames (frame and ). The regularization weight is set to . We train the network for epochs. Other training and inference hyper-parameters are set the same as in [ahn2019weakly]

. Empirically, we observe that IRN (and flowIRN) is limited by lack of good CAMs when trained only on Cityscapes data. Hence, for experiments on Cityscapes, we train the first-stage of all weakly supervised models (before the Mask R-CNN/MaskConsist stage) first on PASCAL VOC 2012 

[everingham2010pascal] training-split and then fine-tune on Cityscapes.


We use ResNet-50 as the backbone, initialized with ImageNet pre-trained weights. For both datasets, the bounding-box IoU threshold is set at

for intra-frame matching, and IoM-NMS threshold at for label combining. The model is trained for iterations for YTVIS, and iterations for Cityscapes, with base learning rate . SGD optimizer is used with step schedule , decay at and of total steps. The temporal consistency is calculated between frame and () for YTVIS, frame and () for Cityscapes. Inference on one frame (short side px) takes 210ms. Nvidia Tesla P100 GPU is used in training and test. All hyper-parameters for flowIRN and MaskConsist are selected based on the performance on a small held-out validation split of the corresponding training set.

 Methods Video Info Supervision
 Mask R-CNN [he2017mask] Mask
 WSIS-BBTP [hsu2019bbtp] Bbox
 WISE [laradji2019masks] Class
 F2F [lee2019frame]+MCG [pont2016multiscale] Class
 IRN [ahn2019weakly] Class
 IRN [ahn2019weakly]+F2F[lee2019frame] Class
 Ours Class
 Ours (self-training) Class 36.00
Table 1: Frame-level instance segmentation performance () on YTVIS train_val split.

Experiment setup: On YTVIS, all methods are trained using the training frames (every th frame) in train_train split. On Cityscapes, all methods are trained with training frames (frame ) and their two neighboring frames ( and ). Unless otherwise specified, our model is trained in two-steps: first train flowIRN on training frames, then use the pseudo-labels generated by the flowIRN on the training frames to train MaskConsist. For fair comparison, all baseline methods are also trained in two steps: first train the weakly supervised model (e.g., IRN) with frame-level class labels, then use pseudo-labels obtained to train a Mask R-CNN model. This is common practice in weakly supervised segmentation works [ahn2019weakly, laradji2019masks], and improves of all models by at least in our experiments. The same hyper-parameters reported in the original work or published code are retained for all baselines.

We also observe that a three-step training process, where the masks generated by our MaskConsist model are used to train another MaskConsist model, further improves performance. We refer to this as ours self-training. Note that unlike other baselines, this involves an additional round of training. On other baseline methods, we also attempted self-training: another round of training using pseudo-labels from the trained Mask R-CNN. However, this either degraded or did not improve performance on the validation set.

During frame-level inference, the trained MaskConsist or Mask R-CNN (for other baselines) is applied on each frame with score threshold of and NMS threshold of to obtain prediction masks. For video-level evaluation, we apply an unsupervised tracking method [luiten2020unovost] on per-frame instance mask predictions to obtain instance mask tracks, with the same hyper-parameters as the original work. We will release our code after paper acceptance.

Methods Supervision Instance seg Semantic seg
Mask R-CNN [he2017mask] Mask
WISE [laradji2019masks] Class
F2F [lee2019frame]+MCG [pont2016multiscale] Class
IRN [ahn2019weakly] Class
IRN [ahn2019weakly]+F2F[lee2019frame] Class
Ours Class
Ours (self-training) Class 16.82 41.31
Table 2: Frame-level instance segmentation () and semantic segmentation (IoU) on Cityscapes validation split.

4.3 Frame instance segmentation

First, we compare frame-level performance with existing instance segmentation models on YTVIS and Cityscapes.

Evaluation metrics: On both YTVIS and Cityscapes, the average precision with mask intersection over union (IoU) threshold at () is used as the metric for instance segmentation. Cityscapes is a popular benchmark for semantic segmentation and we also report the semantic segmentation performance using standard IoU metric.

Methods Train_Val Split Validation Split
Fully supervised learning methods IoUTracker+ [yang2019video] - - - - -
DeepSORT [wojke2017simple] - - - - -
MaskTrack [yang2019video] - - - - -
Weakly supervised learning methods WISE [laradji2019masks]
IRN [ahn2019weakly]
Table 3: Video instance segmentation results on Youtube-VIS dataset.
Figure 4: Example Video instance segmentation results from our method on Youtube-VIS dataset.

Baselines: To the best of our knowledge, there is no existing weakly supervised instance segmentation model designed for videos. Existing works are designed for still images and report results on standard image benchmarks like [everingham2010pascal]. To compare with these models on video data, we train them (where code is available) with independent video frames of YTVIS or Cityscapes. We also extend existing weakly supervised “video” semantic segmentation models to perform instance segmentation. For upper-bound comparisons, we report results from Mask R-CNN [he2017mask] trained with ground truth masks, and WSIS-BBTP [hsu2019bbtp] trained with bounding box annotations. We list the baselines below and more details can be found in the appendix:

  • [leftmargin=*]

  • WISE [laradji2019masks]: train on independent frame with class label.

  • IRN [ahn2019weakly]: train on independent frame with class label.

  • F2F [lee2019frame] + MCG [pont2016multiscale]: use videos with class labels to train F2F to obtain semantic segmentation and combine MCG proposals to obtain instance-level masks as in [zhou2018weakly].

  • F2F [lee2019frame] + IRN [ahn2019weakly]: use optical flow to aggregate CAMs as in F2F to train IRN.

Results: Results on YTVIS are shown in Tab. 1. All methods use two-step training as stated in the experiment setup. WISE and F2F+MCG both use processed CAMs as weak labels and combine results with object proposals (MCG) to distinguish instances. Comparing WISE and F2F+MCG, F2F uses video information that boosts its performance by around . IRN+F2F is the closest comparison to our approach, since it is also built on top of IRN and uses video information. Our model outperforms IRN+F2F by more than , and can also benefit from an additional round of self-training (Ours self-training). However, we do not observe any gains when training the Mask R-CNN for another round for other methods.

In Tab. 2, we report frame-level instance segmentation and semantic segmentation results on Cityscapes. For instance segmentation, our method outperforms WISE and IRN by more than under . We convert the instance segmentation results to semantic segmentation by merging instance masks of the same class and assigning labels based on scores. On semantic segmentation, our method still outperforms IRN by a large margin.

4.4 Video instance segmentation

Given per-frame instance segmentation predictions, we apply the Forest Path Cutting algorithm [luiten2020unovost] to obtain a mask-track for each instance and report VIS results.

Evaluation metric: We use the same metrics as [yang2019video]: mean average precision for IoU between (), average precision with IoU threshold at / (/ ), and average recall for top / ( / ). As each instance in a video contains a sequence of masks, the computation of IoU uses the sum of intersections over the sum of unions across all frames in a video. The evaluation is carried out on YTVIS train_val split using YTVIS code (https://github.com/youtubevos) , and also on YTVIS validation split using the official YTVIS server.

Baselines: Since there is no existing work on weakly supervised video instance segmentation, we construct our own baselines by combining the tracking algorithm in [luiten2020unovost] with two weakly supervised instance segmentation baselines: WISE [laradji2019masks] and IRN [ahn2019weakly]. We also present published results from fully supervised methods [yang2019video, wojke2017simple] for reference.

As presented in Tab. 3, our model outperforms IRN and WISE by a large margin. On the metric, there is a boost of more than on both train_val and validation splits. We also observe that the performance gap between WISE and IRN decreases compared with frame-level results in Tab. 1, implying temporal consistency is important to realize gains in video instance segmentation. Note that the fully supervised methods are first trained on MS-COCO [lin2014microsoft] and then fine-tuned on YTVIS training split, while ours is only trained on YTVIS data. Qualitative VIS results from our method are shown in Fig. 4. Our method generates temporally stable instance predictions and is able to capture different overlapping instances. One failure case is shown in the bottom row. As skateboard and person always appear and move together in YTVIS, our assumption on different instances having different motion is not valid. Thus, these two instances are not well distinguished.

4.5 Effect of modeling temporal information

Our framework explicitly models temporal information in both flowIRN and MaskConsist modules. We explore the effectiveness of each module in this section.

YTVIS Cityscapes
IRN [ahn2019weakly]
flowIRN 28.45 10.75
Table 4: Ablation study of flowIRN components. Results are reported on training data to evaluate pseudo-label quality. No second-step Mask R-CNN or MaskConsist training is applied here.

Ablation study of flowIRN: In Tab. 4, we present the instance segmentation results () of different flowIRN variants. All models are directly tested on the training data to evaluate pseudo-label quality and no second-step training is used in this experiment. Compared to original IRN[ahn2019weakly], both flow-amplified CAMs (f-CAMs) and flow-boundary loss (f-Bound) incorporate optical flow information and improve IRN performance. Combining the two leads to our design of flowIRN, which improves by on YTVIS and on Cityscapes compared to IRN.

Figure 5: Improvement introduced by f-CAMs and f-Bound. Top: output of IRN. Middle: optical flow extracted for the input frame. Bottom: output after incorporating f-CAMs or f-Bound.

In Fig. 5, we show two qualitative examples of incorporating f-CAMs and f-Bound. In the left example, the car (in the circle) moves fast and is partially missed by IRN. After applying f-CAMs, the whole object is well captured in the segmentation mask. In the second example (right column), IRN fails to separate two overlapping persons while the boundary is recognizable in optical flow. After applying f-Bound loss, two instances are correctly predicted.

Ablation study of MaskConsist: In Tab. 5, we explore the contribution of different components of MaskConsist by disabling one of the three components each time. We observe that inter-frame matching plays the most important role in MaskConsist. It enables the model to incorporate temporal consistency during training and achieves the largest performance boost. IoM-NMS helps avoid false positives corresponding to partial masks from inter-frame matching and improves the performance on top of intra-frame and inter-frame matching. Our best results on both datasets are achieved by combining all three components.

MaskConsist Components
Intra-F Inter-F IoM-NMS YTVIS Cityscapes
34.66 16.05
Table 5: Ablation study of MaskConsist components. The numbers in this table are generated by models with two-step training.

In Tab. 6, we further explore the effectiveness of MaskConsist module by combining it with other weakly supervised instance segmentation methods: WISE [laradji2019masks] and IRN [ahn2019weakly]. Cross in the “w/ MC” column denotes the use of Mask R-CNN instead of MaskConsist. The results show that, by incorporating mask matching and consistency in the second stage of training, MaskConsist module consistently improves original weakly supervised methods by about . Combining flowIRN module with MaskConsist achieves the best performance on both YTVIS and Cityscapes.

We also quantitatively evaluate how consistent the predictions of MaskConsist are on consecutive frames. As presented in the fifth column of Tab. 6, we report the temporal consistency () metric similar to [liu2020efficient]. This metric measures the between mask predictions and flow warped masks on consecutive frames in YTVIS. We observe consistent improvement in by adding MaskConsist to training.

Methods w/ MC
YTVIS Cityscapes
WISE [laradji2019masks]
IRN [ahn2019weakly]
Table 6: MaskConsist works on top of different weakly supervised instance segmentation methods and improves both and .
Figure 6: Comparison of Mask R-CNN and MaskConsist on YTVIS. Both models are trained from flowIRN pseudo-label.

In Fig. 6, we present two examples of Mask R-CNN and MaskConsist predictions on YTVIS clips. Both models are trained with flowIRN pseudo-labels. Mask R-CNN predictions are more susceptible to noisy pseudo-labels and less consistent across frames, while MaskConsist achieves more stable segmentation results.

Further discussion: Regarding the two types of errors presented in Fig. 1, we observe that our model has larger relative improvement over IRN on more strict metric: ( vs. ) on , compared with ( vs. ) on , indicating our model generates more accurate mask for high IoU metric. While our method outperforms IRN on , our method also predicts more instances per frame (avg. instances for ours vs. avg. instances for IRN), indicating our method is able to predict more instances with higher accuracy. These demonstrate that the two problems of partial segmentation and missing instance are both alleviated in our model.

5 Conclusion

We observed that image-centric weakly supervised instance segmentation methods often segment an object instance partially or miss an instance completely. We proposed the use of temporal consistency between frames in a video to address these issues when training models from video frames. Our model (a) leveraged the constraint that pixels from the same instance move together, and (b) transferred temporally stable predictions to neighboring frames as pseudo-labels. We proved the efficacy of these two approaches through comprehensive experiments on two video datasets of different scenes. Our model outperformed the state-of-the-art approaches on both datasets.

Appendix A Supplementary Material

a.1 Explanation of using optical flow gradient

Figure 7: 3D motion results in 2D optical flow following imaging principle.

Here we explain that spatial gradient of optical flow helps identify if two pixels are from the same instance. Such information is encoded in the flow-boundary loss (Eq. (3) in the manuscript).

As shown in Fig. 7, let pixel in the image is a projection of point in the physical world with velocity . The 3D motion results in optical flow on the image plane . We consider for a short time window, most objects move in parallel to the image plane for Youtube VIS and Cityscapes data thus . For simple mathematical notation, we consider 3D motion only along X-axis . The optical flow along image axis can be written as:

where is the camera focal length. We explain why we chose to use difference of optical flow first-order gradient instead of directly using difference of optical flow as followed.

For two neighboring pixels and , if they are from the same rigid object, we have , . Then, their optical flow difference is:

which is not equal to zero when there is difference in depth for these two pixels.

We propose to use the difference of spatial gradient of optical flow. The spatial gradient of flow along axis is defined as:

For two neighboring pixels and on the same instance surface, assuming the surface is smooth thus , their difference of flow gradient is written as:

In practice, the two pixels are in local neighbor and depth values are often large, thus is better approximation than . This indicates using difference of optical flow gradient is a better signal for pixel affinity calculation.

A similar inference can also be applied to axis. In practice, we calculate the first-order gradient difference on both directions and encourage the norm to be zero.

a.2 Prediction warping with optical flow

In order to warp Mask R-CNN predictions with optical flow, we first warp the predicted masks using bi-linear interpolation as used in Spatial Transformer Network

[jaderberg2015spatial]. The warped mask is then converted to binary mask at threshold of . Then the bounding box of warped prediction is generated from the warped mask and class label is directly copied. In practice, we implement the mask warping function using orch.nn.func

ional.grid_sample in PyTorch 

[NEURIPS2019_9015] framework.

a.3 Training details for baseline methods

WISE [laradji2019masks]: We train WISE using code published in [PRMcode] and [WISEcode]. For YTVIS, the model is trained for epochs with learning rate starting at . For Cityscapes, the model is pre-trained on PASCAL VOC 2012 and then fine-tuned for epochs with learning rate starting at . The MCG proposals are generated using code in [MCGcode]. We generate pseudo-labels on the training split and train Mask R-CNN model as stated in Sec. 4.2 Implementation Details in the manuscript.

F2F [lee2019frame]+MCG [pont2016multiscale]: We first use F2F to generate semantic segmentation masks and then combine with MCG to generate instance masks as pseudo-labels. To train F2F on YTVIS and Cityscapes, we follow the F2F paper to generate the aggregated CAMs: warp the CAMs from consecutive frames to the key frame. The weakly supervised network backbone used in F2F is not available and we use SEC [kolesnikov2016seed] (code in [SECcode]) which is advised by the F2F authors. For YTVIS, the model is trained for iterations with learning rate starting at . For Cityscapes, the model is pre-trained on PASCAL VOC 2012 and then fine-tuned for epochs with learning rate starting at . To combine with MCG proposals, we adopt a similar approaches in WISE[laradji2019masks]. The resulting instance masks are used as pseudo-labels to train Mask R-CNN.

IRN [ahn2019weakly]: We train IRN using code published in [IRNcode]. For YTVIS, the model is trained for epochs with learning rate starting at . For Cityscapes, the model is pre-trained on PASCAL VOC 2012 and then fine-tuned for epochs with learning rate starting at . The other training and inference parameters are set as default. The resulting instance masks on the training data are then used as pseudo-labels to train Mask R-CNN.

IRN [ahn2019weakly]+F2F [lee2019frame]: We use the flow-warped CAMs as in F2F to train IRN: warp CAMs from 5 neighboring frames to key frame to generate aggregated CAMs. The aggregated CAMs are then used to replace the CAMS used in original IRN. The training parameters are set the same as training original IRN model.

a.4 More visualization of video intance segmentation results

More video instance segmentation results are presented in Fig. 8. Every 5th frame in a video clip from YTVIS train_val split are presented in each row. In the top nine examples, our method generates consistent instance masks with good object coverage and accurate instance boundary. We also include two failure cases in the last two rows, where the object is heavily occluded or object classes have strong co-occurrence pattern.

Figure 8: More video instance segmentation results from our proposed method.