Video Object Segmentation with Re-identification

08/01/2017 ∙ by Xiaoxiao Li, et al. ∙ 0

Conventional video segmentation methods often rely on temporal continuity to propagate masks. Such an assumption suffers from issues like drifting and inability to handle large displacement. To overcome these issues, we formulate an effective mechanism to prevent the target from being lost via adaptive object re-identification. Specifically, our Video Object Segmentation with Re-identification (VS-ReID) model includes a mask propagation module and a ReID module. The former module produces an initial probability map by flow warping while the latter module retrieves missing instances by adaptive matching. With these two modules iteratively applied, our VS-ReID records a global mean (Region Jaccard and Boundary F measure) of 0.699, the best performance in 2017 DAVIS Challenge.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video object segmentation in 2017 DAVIS Challenge [13] is non-trivial – a video typically consists of more than one annotated object, with many distractors, small objects and fine structures. The complexity of the problem increases with severe inter-object occlusions and fast motion.

Conventional approaches that rely on temporal continuity suffer from issues like drifting and inability to handle large displacement. To overcome these issues, we formulate an effective mechanism to prevent the target from being lost via adaptive object re-identification. Specifically, our Video Object Segmentation with Re-identification (VS-ReID) model includes a mask propagation module and a ReID module. The mask propagation module is a two-stream convolutional neural network, inspired by

[11]. The RGB branch of the mask propagation module accepts a bounding box and a guided probability map as input, and produces a segmentation mask for the main instance as an output. The guided probability map is obtained from adjacent frames’ predictions by flow warping. In addition to the RGB branch, we also train an optical flow branch to incorporate the temporal information. The final segmentation mask of the image patch is obtained by averaging the predictions of these two branches.

To cope with frequent occlusions and large pose variations in dynamic scenes, we leverage object re-identification module to retrieve missing instances. Specifically, when missing instances are re-identified with a high confidence, they are assigned with a higher priority to be recovered during the mask propagation process. For each retrieved instance, we take its frame as the starting point and use the mask propagation module to bi-directionally generate the probability maps in its adjacent frames.

With the updated probability maps, the mask propagation module and ReID module of VS-ReID are iteratively applied to the whole video sequence until no more high confidence instances can be found. Finally, for each frame, the instance segmentation results are obtained by merging the probability maps of all the instances. With both flow warping to ensure temporal continuity and object re-identification to recover missing objects, VS-ReID records a global mean (Region Jaccard and Boundary F measure) of 0.699, the best performance in 2017 DAVIS Challenge.

2 Related Work

The realm of object segmentation witnesses drastic progress these days, including the marriage of deep learning and graphical models 

[17, 9] and the efforts to enable real-time inference on high-res images [7, 16]. Since most visual sensory data are videos, it is crucial to extend object segmentation from image to video. Existing video segmentation methods [10, 11] rely on temporal continuity to establish spatio-temporal correlation. However, real-life videos exhibit severe deformation and occlusion, rendering such assumption to suffer from issues like drifting and inability to handle large object displacement. In this work, we propose a novel method known as Video Object Segmentation with Re-identification (VS-ReID) to overcome these issues.

3 Approach

Our VS-ReID model includes a mask propagation module and a re-identification module. The mask propagation module propagates the probability map from the predicted frame to the adjacent frames. Meanwhile, we employ the re-identification module to retrieve instances that are missing during the mask propagation process. Two modules are iteratively applied to the whole video sequence. Next, we will first present these two modules respectively in Sec. 3.1 and Sec. 3.2, then introduce the algorithm of VS-ReID in Sec. 3.3.

1:procedure ()
2:      initialize
3:      extract the optical flow
4:      flow guided warp
5:      obtain the bounding box
6:     
7:     return
Algorithm 1 Mask propagation for single object

3.1 Mask Propagation Module

The inference algorithm of mask propagation is summarized in Algorithm 1. Given two adjacent frames , and the pixel-level probability map for instance in the frame , , we aim to predict the probability map for instance in the frame , .

Following [11, 6]

, we first obtain the coarse estimation of

, , from by flow guided warping. We use FlowNet2.0 [5] to extract the optical flow between frame and . The probability map is warped to according to by a bilinear warping function . After that, we employ a convolutional neural network, mask propagation network , to further refine the coarse estimation. Rather than full-resolution images as in [11, 6], our mask propagation network accepts size-normalized patches that enclose objects of interest as input and produces the refined probability patch. Using object patches as input allows our model to better cope with objects of different scales. More specifically, we crop the patches , and from full-image by instance bounding box . Then we resize those patches into a fixed size and feed them into the mask propagation network to get the probability patch . Finally, is resized back to the original size, and fill into a full size zero map to generate the prediction of . Unlike full-image based network [11, 6], our method can easily capture the small objects and fine structures.

Figure 1: Network architecture of mask propagation network. Best viewed in color.
Figure 2: Pipeline of our Video Object Segmentation with Re-identification (VS-ReID) model. Best viewed in color.

Mask Propagation Network. As shown in Fig. 1, our mask propagation network is a two-stream convolutional neural network, inspired by [6]. However, several important modifications are necessary to further improve the network performance. First, we adopt the much deeper ResNet-101 [4]

network to increase the model capacity. Second, as we mentioned before, since our network takes patches as input, it is capable of capturing more details compared with full-image based network. We also slightly enlarge the bounding box to keep more contextual information. Third, to increase the resolution of prediction, we enlarge the size of feature maps by decreasing the convolutional stride and replace convolutions by dilated convolutions. Similar to

[1], atrous spatial pyramid pooling and multi-scale testing are also employed. Last but not least, after independent branch training, two streams are jointly fine-tuned to further improves the performance.

1:procedure ()
2:      obtain the candidate boxes
3:     for  do
4:          

denotes the cosine similarity      

5:     
6:     
7:     if  and  then
8:          return
9:     else
10:          return fail or unnecessary      
Algorithm 2 Re-identification module

3.2 Re-identification Module

Our mask propagation module is based on the short-term memory and it highly relies on temporal continuity. However, frequent occlusions and large pose variations are very common in dynamic scenes and likely to cause failures in mask propagation. To overcome these issues, we leverage object re-identification module to retrieve missing instances. Re-identification module incorporates long-term memory, which complements mask propagation module and makes our system more robust.

As summarized in Algorithm 2, during the iterative refinement in VS-ReID, our re-identification module takes a single frame , the current pixel-level probability map which is predicted in the previous round of inference for instance in the frame , and the template of instance , as input, produces the retrieved boundary box , and corresponding re-identification score . In this module, we obtain the candidate bounding boxes in frame through a detection network . For each candidate bounding box , the re-identification score between and is conducted through measuring the cosine similarity between their features that are extracted from a re-identification network . Suppose is the most similar candidate bounding box, it is only accepted as the final result if two conditions are satisfied: First, is sufficiently similar with the template , that is, the re-identification score between and is larger than a threshold ; Second, current is not consistent with , otherwise we do not need to retrieve the instance k in frame i. To evaluate this condition, we compute the IoU score between and current bounding box from . If it is less than another threshold , which means they are inconsistent, we believe that we have made a wrong prediction of in the previous rounds and accept as the retrieve bounding box. Those two thresholds are selected on the validation set.

Detection & Re-identification Network. We directly adopt the Faster R-CNN [14] as our detection network . For the re-identification network , we employ the architecture of ‘Identification Net’ in [15] and retrain this network for the general object re-identification task.

1:procedure VS-ReID()
2:      number of frames
3:      number of instances
4:     for  = 1 to  do
5:          obtain the template from      
6:     for  = 2 to  do initialize probability maps
7:          for  = 1 to  do
8:               
9:                               
10:     loop
11:          
12:          for  = 2 to  do retrieve instances
13:               for  = 1 to  do
14:                    
15:                    if  and  then
16:                                                                       
17:          if  then
18:               break no instance retrieved
19:          else
20:               
21:               
22:                recover
23:               for  = to  do forward propagate
24:                    if  then
25:                         
26:                                                             
27:               for  = downto  do backward propagate
28:                    if  then
29:                         
30:                                                                            
31:     return
Algorithm 3 VS-ReID algorithm
Figure 3: Existing probability maps might be impaired during the iterative refinement. Therefore, we devise a checkpoint mechanism to avoid this issue. Best viewed in color.

3.3 VS-ReID

In this section, we will introduce the VS-ReID algorithm that combines previous two components to infer the masks of all instances on the whole video sequence.

As shown in Fig. 2, given a video sequence and the mask (i.e. ground-truth probability map) of the objects in the first frame, VS-ReID first initializes the probability maps . We enumerate all instances and forward propagate their probability maps from the first frame to the last frame by the mask propagation module. After initialization, the re-identification module and mask propagation module are iteratively applied to the whole video sequence until no more high confidence instances can be found. To be more specific, we first applied re-identification module to the whole video for all instances. We keep the retrieved bounding box with the highest similarity score . Suppose is the bounding box of instance in frame , we then try to recover the probability map of instance in frame , . The recovery process is quite similar to the process of mask propagation, with one difference: there is no guided probability map from adjacent frames. So we replace that with the probability patch of instance cropped from the first frame. Once we obtain the recovered probability map, we can take it as the starting point and use the mask propagation module to bi-directionally recover more probability maps of instance in adjacent frames. However, sometimes existing probability maps will be impaired during this iterative refinement. An example is shown in Fig. 3 (a), suppose we have 6 frames in a video sequence. In the first round of iterative refinement, we retrieve the instance in the first frame and propagate the recovered mask to the end of video sequence. In the second round, we retrieve the instance again in the last frame and do the backward propagation. In this case, all probability maps we predicted in the first round will be overwritten. Because of the longer propagation distance, the probability map for instance in the second frame might be impaired in the second round. To avoid this issue, we devise a checkpoint mechanism with a new variable recording the starting point by which is updated. The initial value of is , and every probability map prefers to be updated by a closer starting point. As shown in Fig. 3 (b), the backward propagation will be interrupted at the fourth frame, since the first frame is closer to the third frame compared with the last one. Finally, we combine all to generate the mask prediction through:

where is a normalizing factor, is the frame index, is a pixel’s location, is the number of instances in the video sequence.

3.4 Implementation Details

Two branches of mask propagation network are first trained individually. The RGB branch is pre-trained on the MS-COCO [8] and PASCAL VOC [3] dataset. During the pre-training, we use the randomly deformed ground-truth mask as the guided probability map. Subsequently, the network is fine-tuned on the DAVIS training set. The flow branch is initialized by RGB branch’s weights and fine-tuned on the DAVIS training set. Finally, those two branches are jointly fine-tuned together on the DAVIS training and validation sets.

Detection and re-identification networks are trained on the ImageNet 

[2] dataset, we followed the training strategy in original papers [14, 15]. In particular, for the person category, we directly use the network in [15] as our re-identification network.

4 Experiments

We evaluate our VS-ReID on the DAVIS 2017 [13] dataset. DAVIS 2017 dataset contains 150 video sequences with all frames annotated with high-quality object masks. There are 60 videos in the train set, 30 videos in the val set, 30 videos in the test-dev set and 30 videos in the test-challenge set. In our experiments, we employ both train set and val set for training, and all performances are reported on the test-dev set. Followed [12], we adopt region() and boundary() measures to evaluate the performance.

4.1 Ablation Study

 -mean  -mean  global-mean  boost
baseline[11] 0.509 0.526 0.517 -
+ full-image to bbox 0.532 0.577 0.555  + 0.038
+ flow-stream 0.568 0.600 0.584  + 0.007
+ re-id module 0.633 0.670 0.652  + 0.068
+ multi-scale testing 0.644 0.678 0.661  + 0.009
Table 1: Ablation study of each module in VS-ReID.

In this section, we investigate the effects of each component in VS-ReID model. Table 1 summarizes how performance gets improved by adding each component step-by-step into our VS-ReID model.

We choose [11] as our baseline model. After modified the input from full-image to bounding box, global-mean increases by and the boundary () measure achieves a significant improvement of . It demonstrates that bounding box input overcomes large scale variations and contributes to capture the boundary details. As mentioned in Sec. 3.1, to incorporate the temporal information, we train an optical flow branch and joint fine-tuning it with the RGB branch. This two-stream architecture also slightly improves the performance. Employing the iterative refinement we introduced in Sec. 3.3 greatly improves the global-mean by , which shows that the re-identification module and iterative refinement are essential. We also visualize the example videos which are improved by this iterative refinement in Fig. 5. Once an instance is recovered, it will benefit adjacent frames’ prediction. Finally, multi-scale testing further improves the results.

Figure 4: Missing instances are retrieved by re-identification module. We annotate the retrieved instances by blue bounding boxes. Best viewed in color.
Measure Ours Apata Vanta Haamo Voigt Lalal Cjc YXLKJ Wasid Froma Zwrq0 Drbea Anews Ilanv Koh Make Kozab Xn881 Zpd Griff Nitin Team5
Ranking 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Global Mean 69.9 67.8 63.8 61.5 57.7 56.9 56.9 55.8 54.8 53.9 53.6 51.9 50.9 49.7 49.1 48.0 47.8 47.6 47.1 42.0 25.6 11.2
Mean 67.9 65.1 61.5 59.8 54.8 54.8 53.6 53.8 51.6 50.7 50.5 50.5 49.0 46.0 45.9 46.3 43.9 47.8 44.9 40.6 24.9 11.8
Recall 74.6 72.5 68.6 71.0 60.8 60.7 59.5 60.1 56.3 54.9 54.9 56.4 55.1 49.3 50.2 50.0 45.8 56.3 48.0 42.1 12.3 7.3
Decay 22.5 27.7 17.1 21.9 31.0 34.4 25.3 37.7 26.8 32.5 28.0 34.1 21.3 33.1 36.1 40.2 33.0 16.7 31.8 37.4 13.1 14.0
Mean 71.9 70.6 66.2 63.2 60.5 59.1 60.2 57.8 57.9 57.1 56.7 53.3 52.8 53.3 52.3 49.7 51.6 47.3 49.3 43.3 26.3 10.6
Recall 79.1 79.8 79.0 74.6 67.2 66.7 67.9 62.1 64.8 63.2 63.5 57.9 58.3 58.4 57.1 52.8 56.0 53.0 54.4 43.2 9.1 3.0
Decay 24.1 30.2 17.6 23.7 34.7 36.1 27.6 42.9 28.8 33.7 30.4 39.5 23.7 36.4 39.2 44.8 36.3 21.6 36.2 40.2 13.0 12.6
Table 2: Results on 2017 DAVIS Challenge test-challenge set.

4.2 Benchmark

As shown in Table 2, VS-ReID achieves a global mean of 0.699 on test-challenge set, the best performance in 2017 DAVIS Challenge. By inspecting closer, we observe that VS-ReID wins both -Mean and -Mean measures and outperforms the second place method by more than . Thanks to the re-identification module that incorporates the long-term memory, our -Decay and -Decay are also relatively small. In Fig. 5, we demonstrate some examples of VS-ReID prediction on DAVIS test-dev set and test-challenge set.

Figure 5: Qualitative results of our VS-ReID model on DAVIS 2017 test-dev set and test-challenge set.

5 Conclusion

In this work we tackle the problem of video object segmentation and explore the utility of object re-identification. We propose Video Object Segmentation with Re-identification (VS-ReID) model which includes two dedicated modules: a mask propagation module and a ReID module. It is observed that our ReID module combined with bidirectional refinement is capable of retrieving missing instances and greatly improves the performance. These two modules are employed iteratively, enabling our final model to win the DAVIS video segmentation challenge.

References