Video object segmentation in 2017 DAVIS Challenge  is non-trivial – a video typically consists of more than one annotated object, with many distractors, small objects and fine structures. The complexity of the problem increases with severe inter-object occlusions and fast motion.
Conventional approaches that rely on temporal continuity suffer from issues like drifting and inability to handle large displacement. To overcome these issues, we formulate an effective mechanism to prevent the target from being lost via adaptive object re-identification. Specifically, our Video Object Segmentation with Re-identification (VS-ReID) model includes a mask propagation module and a ReID module. The mask propagation module is a two-stream convolutional neural network, inspired by. The RGB branch of the mask propagation module accepts a bounding box and a guided probability map as input, and produces a segmentation mask for the main instance as an output. The guided probability map is obtained from adjacent frames’ predictions by flow warping. In addition to the RGB branch, we also train an optical flow branch to incorporate the temporal information. The final segmentation mask of the image patch is obtained by averaging the predictions of these two branches.
To cope with frequent occlusions and large pose variations in dynamic scenes, we leverage object re-identification module to retrieve missing instances. Specifically, when missing instances are re-identified with a high confidence, they are assigned with a higher priority to be recovered during the mask propagation process. For each retrieved instance, we take its frame as the starting point and use the mask propagation module to bi-directionally generate the probability maps in its adjacent frames.
With the updated probability maps, the mask propagation module and ReID module of VS-ReID are iteratively applied to the whole video sequence until no more high confidence instances can be found. Finally, for each frame, the instance segmentation results are obtained by merging the probability maps of all the instances. With both flow warping to ensure temporal continuity and object re-identification to recover missing objects, VS-ReID records a global mean (Region Jaccard and Boundary F measure) of 0.699, the best performance in 2017 DAVIS Challenge.
2 Related Work
The realm of object segmentation witnesses drastic progress these days, including the marriage of deep learning and graphical models[17, 9] and the efforts to enable real-time inference on high-res images [7, 16]. Since most visual sensory data are videos, it is crucial to extend object segmentation from image to video. Existing video segmentation methods [10, 11] rely on temporal continuity to establish spatio-temporal correlation. However, real-life videos exhibit severe deformation and occlusion, rendering such assumption to suffer from issues like drifting and inability to handle large object displacement. In this work, we propose a novel method known as Video Object Segmentation with Re-identification (VS-ReID) to overcome these issues.
Our VS-ReID model includes a mask propagation module and a re-identification module. The mask propagation module propagates the probability map from the predicted frame to the adjacent frames. Meanwhile, we employ the re-identification module to retrieve instances that are missing during the mask propagation process. Two modules are iteratively applied to the whole video sequence. Next, we will first present these two modules respectively in Sec. 3.1 and Sec. 3.2, then introduce the algorithm of VS-ReID in Sec. 3.3.
3.1 Mask Propagation Module
The inference algorithm of mask propagation is summarized in Algorithm 1. Given two adjacent frames , and the pixel-level probability map for instance in the frame , , we aim to predict the probability map for instance in the frame , .
, we first obtain the coarse estimation of, , from by flow guided warping. We use FlowNet2.0  to extract the optical flow between frame and . The probability map is warped to according to by a bilinear warping function . After that, we employ a convolutional neural network, mask propagation network , to further refine the coarse estimation. Rather than full-resolution images as in [11, 6], our mask propagation network accepts size-normalized patches that enclose objects of interest as input and produces the refined probability patch. Using object patches as input allows our model to better cope with objects of different scales. More specifically, we crop the patches , and from full-image by instance bounding box . Then we resize those patches into a fixed size and feed them into the mask propagation network to get the probability patch . Finally, is resized back to the original size, and fill into a full size zero map to generate the prediction of . Unlike full-image based network [11, 6], our method can easily capture the small objects and fine structures.
Mask Propagation Network. As shown in Fig. 1, our mask propagation network is a two-stream convolutional neural network, inspired by . However, several important modifications are necessary to further improve the network performance. First, we adopt the much deeper ResNet-101 
network to increase the model capacity. Second, as we mentioned before, since our network takes patches as input, it is capable of capturing more details compared with full-image based network. We also slightly enlarge the bounding box to keep more contextual information. Third, to increase the resolution of prediction, we enlarge the size of feature maps by decreasing the convolutional stride and replace convolutions by dilated convolutions. Similar to, atrous spatial pyramid pooling and multi-scale testing are also employed. Last but not least, after independent branch training, two streams are jointly fine-tuned to further improves the performance.
3.2 Re-identification Module
Our mask propagation module is based on the short-term memory and it highly relies on temporal continuity. However, frequent occlusions and large pose variations are very common in dynamic scenes and likely to cause failures in mask propagation. To overcome these issues, we leverage object re-identification module to retrieve missing instances. Re-identification module incorporates long-term memory, which complements mask propagation module and makes our system more robust.
As summarized in Algorithm 2, during the iterative refinement in VS-ReID, our re-identification module takes a single frame , the current pixel-level probability map which is predicted in the previous round of inference for instance in the frame , and the template of instance , as input, produces the retrieved boundary box , and corresponding re-identification score . In this module, we obtain the candidate bounding boxes in frame through a detection network . For each candidate bounding box , the re-identification score between and is conducted through measuring the cosine similarity between their features that are extracted from a re-identification network . Suppose is the most similar candidate bounding box, it is only accepted as the final result if two conditions are satisfied: First, is sufficiently similar with the template , that is, the re-identification score between and is larger than a threshold ; Second, current is not consistent with , otherwise we do not need to retrieve the instance k in frame i. To evaluate this condition, we compute the IoU score between and current bounding box from . If it is less than another threshold , which means they are inconsistent, we believe that we have made a wrong prediction of in the previous rounds and accept as the retrieve bounding box. Those two thresholds are selected on the validation set.
In this section, we will introduce the VS-ReID algorithm that combines previous two components to infer the masks of all instances on the whole video sequence.
As shown in Fig. 2, given a video sequence and the mask (i.e. ground-truth probability map) of the objects in the first frame, VS-ReID first initializes the probability maps . We enumerate all instances and forward propagate their probability maps from the first frame to the last frame by the mask propagation module. After initialization, the re-identification module and mask propagation module are iteratively applied to the whole video sequence until no more high confidence instances can be found. To be more specific, we first applied re-identification module to the whole video for all instances. We keep the retrieved bounding box with the highest similarity score . Suppose is the bounding box of instance in frame , we then try to recover the probability map of instance in frame , . The recovery process is quite similar to the process of mask propagation, with one difference: there is no guided probability map from adjacent frames. So we replace that with the probability patch of instance cropped from the first frame. Once we obtain the recovered probability map, we can take it as the starting point and use the mask propagation module to bi-directionally recover more probability maps of instance in adjacent frames. However, sometimes existing probability maps will be impaired during this iterative refinement. An example is shown in Fig. 3 (a), suppose we have 6 frames in a video sequence. In the first round of iterative refinement, we retrieve the instance in the first frame and propagate the recovered mask to the end of video sequence. In the second round, we retrieve the instance again in the last frame and do the backward propagation. In this case, all probability maps we predicted in the first round will be overwritten. Because of the longer propagation distance, the probability map for instance in the second frame might be impaired in the second round. To avoid this issue, we devise a checkpoint mechanism with a new variable recording the starting point by which is updated. The initial value of is , and every probability map prefers to be updated by a closer starting point. As shown in Fig. 3 (b), the backward propagation will be interrupted at the fourth frame, since the first frame is closer to the third frame compared with the last one. Finally, we combine all to generate the mask prediction through:
where is a normalizing factor, is the frame index, is a pixel’s location, is the number of instances in the video sequence.
3.4 Implementation Details
Two branches of mask propagation network are first trained individually. The RGB branch is pre-trained on the MS-COCO  and PASCAL VOC  dataset. During the pre-training, we use the randomly deformed ground-truth mask as the guided probability map. Subsequently, the network is fine-tuned on the DAVIS training set. The flow branch is initialized by RGB branch’s weights and fine-tuned on the DAVIS training set. Finally, those two branches are jointly fine-tuned together on the DAVIS training and validation sets.
We evaluate our VS-ReID on the DAVIS 2017  dataset. DAVIS 2017 dataset contains 150 video sequences with all frames annotated with high-quality object masks. There are 60 videos in the train set, 30 videos in the val set, 30 videos in the test-dev set and 30 videos in the test-challenge set. In our experiments, we employ both train set and val set for training, and all performances are reported on the test-dev set. Followed , we adopt region() and boundary() measures to evaluate the performance.
4.1 Ablation Study
|+ full-image to bbox||0.532||0.577||0.555||+ 0.038|
|+ flow-stream||0.568||0.600||0.584||+ 0.007|
|+ re-id module||0.633||0.670||0.652||+ 0.068|
|+ multi-scale testing||0.644||0.678||0.661||+ 0.009|
In this section, we investigate the effects of each component in VS-ReID model. Table 1 summarizes how performance gets improved by adding each component step-by-step into our VS-ReID model.
We choose  as our baseline model. After modified the input from full-image to bounding box, global-mean increases by and the boundary () measure achieves a significant improvement of . It demonstrates that bounding box input overcomes large scale variations and contributes to capture the boundary details. As mentioned in Sec. 3.1, to incorporate the temporal information, we train an optical flow branch and joint fine-tuning it with the RGB branch. This two-stream architecture also slightly improves the performance. Employing the iterative refinement we introduced in Sec. 3.3 greatly improves the global-mean by , which shows that the re-identification module and iterative refinement are essential. We also visualize the example videos which are improved by this iterative refinement in Fig. 5. Once an instance is recovered, it will benefit adjacent frames’ prediction. Finally, multi-scale testing further improves the results.
As shown in Table 2, VS-ReID achieves a global mean of 0.699 on test-challenge set, the best performance in 2017 DAVIS Challenge. By inspecting closer, we observe that VS-ReID wins both -Mean and -Mean measures and outperforms the second place method by more than . Thanks to the re-identification module that incorporates the long-term memory, our -Decay and -Decay are also relatively small. In Fig. 5, we demonstrate some examples of VS-ReID prediction on DAVIS test-dev set and test-challenge set.
In this work we tackle the problem of video object segmentation and explore the utility of object re-identification. We propose Video Object Segmentation with Re-identification (VS-ReID) model which includes two dedicated modules: a mask propagation module and a ReID module. It is observed that our ReID module combined with bidirectional refinement is capable of retrieving missing instances and greatly improves the performance. These two modules are employed iteratively, enabling our final model to win the DAVIS video segmentation challenge.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
-  M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2), 2010.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-  E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, 2017.
-  A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Lucid data dreaming for object tracking. arXiv preprint arXiv:1703.09554, 2017.
-  X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang. Not all pixels are equal: Difficulty-aware semantic segmentation via deep layer cascade. In CVPR, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV. 2014.
-  Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In ICCV, 2015.
-  Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Deep learning markov random field for semantic segmentation. TPAMI, 2017.
-  F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A.Sorkine-Hornung. Learning video object segmentation from static images. In CVPR, 2017.
-  F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
-  J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Joint detection and identification feature learning for person search. In CVPR, 2017.
-  H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. Icnet for real-time semantic segmentation on high-resolution images. arXiv preprint arXiv:1704.08545, 2017.
S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang,
and P. H. Torr.
Conditional random fields as recurrent neural networks.In ICCV, 2015.