Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement

10/15/2020 ∙ by Yongqing Liang, et al. ∙ Northeastern University Louisiana State University 0

We propose a new matching-based framework for semi-supervised video object segmentation (VOS). Recently, state-of-the-art VOS performance has been achieved by matching-based algorithms, in which feature banks are created to store features for region matching and classification. However, how to effectively organize information in the continuously growing feature bank remains under-explored, and this leads to inefficient design of the bank. We introduce an adaptive feature bank update scheme to dynamically absorb new features and discard obsolete features. We also design a new confidence loss and a fine-grained segmentation module to enhance the segmentation accuracy in uncertain regions. On public benchmarks, our algorithm outperforms existing state-of-the-arts.



There are no comments yet.


page 3

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video object segmentation (VOS) is a fundamental step in many video processing tasks, like video editing and video inpainting. In the semi-supervised setting, the first frame annotation is given, which depicts the objects of interest of the video sequence. The goal is to segment mask of that object in the subsequent frames. Many deep learning based methods have been proposed to solve this problem in recent years. When people tackle the semi-supervised VOS task, the segmentation performance is affected by two main steps: (1) distinguish the object regions from the background, (2) segment the object boundary clearly.

A key question in VOS is how to learn the cues of target objects. We divide recent works into two categories, implicit learning and explicit learning. Conventional implicit approaches include detection-based and propagation-based methods Caelles et al. (2017); Perazzi et al. (2017); Hu et al. (2017); Voigtlaender and Leibe (2017); Bao et al. (2018); Khoreva et al. (2017); Maninis et al. (2019). They often adopt the fully convolutional network (FCN) Long et al. (2015) pipeline to learn object features by the network weights implicitly; then, before segmenting a new video, these methods often need an online learning to fine-tune their weights to learn new object cues from the video. Explicit approaches learn object appearance explicitly. They often formulate the segmentation as pixel-wise classification in a learnt embedding space Voigtlaender et al. (2019); Chen et al. (2018); Oh et al. (2018); Hu et al. (2018b); Wang et al. (2019); Lin et al. (2019); Johnander et al. (2019); Oh et al. (2019); Liang et al. (2020). These approaches first construct an embedding space to memorize the object appearance, then segment the subsequent frames by computing similarity. Therefore, they are also called matching-based methods. Recently, matching-based methods achieve the state-of-the-art results in the VOS benchmark.

A fundamental issue in matching-based VOS segmentation is how to effectively exploit previous frames’ information to segment the new frame. Since the memory size is limited, it is not possible and unnecessary to memorize information from all the previous frames. Most methods Oh et al. (2018); Wang et al. (2019); Voigtlaender et al. (2019); Lin et al. (2019) only utilize the first and the latest frame. However, when the given video becomes longer, these methods often either miss sampling on some key-frames or encounter out-of-memory crash. To tackle this problem, we propose an adaptive feature bank (AFB) to organize the target object features. This adaptive feature bank absorbs new features by weighted averaging and discards obsolete features according to the least frequently used (LFU) index. As results, our model could memorize the characteristics of multi objects and segment them simultaneously in long videos under a low memory consumption.

Besides identifying the target object, clearly segmenting object boundary is also critical to VOS performance: (1) People are often sensitive to boundary segmentation. (2) When estimated masks on some boundary regions are ambiguous and hard to classify, their misclassificaton is easily accumulated in video. However, most recent VOS methods follow an encoder-decoder mode to estimate the object masks, the boundary of the object mask becomes vague when it is iteratively upscaled from a lower resolution. Therefore, we propose an

uncertain-region refinement (URR) scheme to improve the segmentation quality. It includes a novel classification confidence loss to estimate the ambiguity of segmentation, and a local fine-grained segmentation to refine the ambiguous regions.

Our main contributions of this work are three-folded: (1) We propose an adaptive and efficient feature bank to maintain most useful information for video object segmentation. (2) We propose a confidence loss to estimate the ambiguity of the segmentation results. We also design a local fine-grained segmentation module to refine these ambiguous regions. (3) We demonstrate the effectiveness of our method on segmenting long videos, which are often seen in practical applications.

2 Related Work

Recent video object segmentation works can be divided into two categories: implicit learning and explicit learning. The implicit learning approaches include detection-based methods Caelles et al. (2017); Maninis et al. (2019) which segment the object mask without using temporal information, and propagation-based methods Perazzi et al. (2017); Hu et al. (2017); Voigtlaender and Leibe (2017); Bao et al. (2018); Khoreva et al. (2017); Hu et al. (2018a) which use masks computed in previously frames to infer masks in the current frame. These methods often adopt a fully convolutional network (FCN) structure to learn object appearance by network weights implicitly; so they often require an online learning to adapt to new objects in the test video.

The explicit learning methods first construct an embedding space to memorize the object appearance, then classify each pixel’s label using their similarity. Thus, the explicit learning is also called matching-based method. A key issue in matching-based VOS segmentation is how to build the embedding space. DMM Zeng et al. (2019) only uses the first frame’s information. RGMP Oh et al. (2018), FEELVOS Voigtlaender et al. (2019), RANet Wang et al. (2019) and AGSS Lin et al. (2019) store information from the first and the latest frames. VideoMatch Hu et al. (2018b) and WaterNet Liang et al. (2020) store information from several latest frames using a slide window. STM Oh et al. (2019) stores features every frames ( in their experiments). However, when the video to segment is long, these static strategies could encounter out-of-memory crashes or miss sampling key-frames. Our proposed adaptive feature bank (AFB) is a first non-uniform frame-sampling strategy in VOS that can more flexibly and dynamically manage objects’ key features in videos. AFB performs dynamic feature merging and removal, and can handle videos with any length effectively.

Recent image segmentation techniques introduce fine-grained modules to improve local accuracy. ShapeMask Kuo et al. (2019) revise the decoder from FCN to refine the segmentation. PointRend Kirillov et al. (2020) defines uncertainty on a binary mask, and does a one-pass detection and refinement on uncertain regions. We proposed an uncertainty-region refinement (URR) strategy to perform boundary refinement in video segmentation. URR includes (1) a more general multi-object uncertainty score for estimated masks, (2) a novel confidence loss to generate cleaner masks, and (3) a non-local mechanism to more reliably refine uncertain regions.

Figure 1: An overview of the proposed algorithm. We use a typical matching-based module to estimate the initial segmentation, marked in blue. An adaptive feature bank is proposed to organize the feature space. In the red region, a novel uncertain-region refinement mechanism is design for fine-grained segmentation.

3 Approach

The overview of our framework is illustrated in Fig. 1. First, as shown in blue region, we use a basic pipeline of matching-based segmentation (Sec. 3.1) to generate initial segmentation masks. In Sec. 3.2, we propose an adaptive feature bank module to dynamically organize the past frame information. In Sec. 3.3, given the initial segmentation, we design a confidence loss to estimate the ambiguity of misclassification, and a fine-grained module to classify the uncertain regions. These two components are marked in the red region in Fig. 1.

3.1 Matching-based segmentation

Given an evaluation video, we encode the the first frame and its groundtruth annotation to build the feature bank. Then, we use the feature bank to match and segment the target objects starting from the second frame. The decoder takes the output of the matching results to estimate the frame’s object masks.


A query encoder is designed to encode the current frame, named query frame, to its feature map for segmentation. We use the ResNet-50 He et al. (2016) as backbone and takes the output of the layer-3 as feature map , where and are the height and width.

For segmenting the th frame, we treat the past frames from 1 to as reference frames. A reference encoder is designed for memorizing the characteristics of the target objects. Suppose there are objects of interest, we encode the reference frame object by object and output feature maps , . The reference encoder is a modification of the original ResNet-50. For each object , it takes both the reference frame and its corresponding mask as inputs, then extracts object-level feature map, . Combining the object-level feature maps together, we obtain the feature maps of the reference frame at index , , where .

Feature map embedding.

Traditional matching-based methods directly compare the feature maps of the query feature map and the reference feature maps. Although such design is good for classification, they are lack of semantic information to estimate the object masks. Inspired by STM Oh et al. (2019), we utilize the similar feature map embedding module. The feature maps are encoded into two embedding spaces, named key and value by two convolutional modules. Specifically, we match the feature maps by their keys , while allow their values being different in order to preserve as much semantic information as possible. The feature bank stores pairs of keys and values of the past frames. In the next section, we compare the feature maps of the current frame with the feature bank to estimate the object masks. The details of maintaining the feature bank are elaborated in Section 3.2.


A query frame is encoded into pairs of key and value through the query encoder and feature map embedding. We maintain feature banks from the past frames for each object . The similarity between the query frame and the feature banks is calculated object by object. For each point in the query frame, we use a weighted summation to retrieve the closet value in the th object feature bank,


where , and is the softmax function , where

represents the dot production between two vectors. We concatenate the query value map with its most similar retrieval value map as

, where is the matching results between the query frame and the feature banks for the object .


The decoder takes the output of the matcher to estimate the object mask independently, where depicts the semantic information for the object . We follow the refinement module used in Oh et al. (2018); Lin et al. (2019); Oh et al. (2019) that gradually upscales the feature map by a set of residual convolutional blocks. At each stage, the refinement module takes both the output of the previous stage and a feature map from the query encoder at corresponding scale through skip connections. After the decoder module, we obtain the initial object masks for each object . We minimize the cross entropy loss between the object masks and the groundtruth labels ,


3.2 Adaptive feature bank

We build a feature bank to store features of each object and classify new pixels/regions in the current frame . Storing features from all the previous frames is impossible, because it will make the bank prohibitively big (as the length of the video clip grows) and make the query slow. Recent approaches either store the features from every few frames or from the several latest frames. For example, STM Oh et al. (2019) uniformly stores every one of frames in feature bank; and on an NVIDIA 1080Ti card with 11GB Memory, this can only handle a single-object video at most frames. Practical videos are often longer (e.g., average YouTube video has about 12 minutes or 22K frames); and for a 10-min video, STM needs to set

, which will probably miss many important frames and information. Hence, we propose an

adaptive feature bank (AFB) to more effectively manage object’s key features. AFB contains two operations: absorbing new features and removing obsolete ones.

Absorbing new features.

In video segmentation, although features from most recent frames are often more important, earlier frames may contain useful information. Therefore, rather than simply ignoring those earlier frames, we keep earlier features and organize all the features by a weighted averaging (which has been shown effective for finding optimal data representations such as Neural Gas Fritzke (1995)). Fig. 2 illustrates how the adaptive feature bank absorbs new features. Existing features and new features are marked in blue and red, separately. When a new feature is extracted, if it is close enough to some existing one, then merge them. Such a merge avoids storing redundant information and helps the memory efficiency. Meanwhile, it allows a flexible update on the stored features according to the object’s changing appearance.

At the beginning of the video segmentation, the feature bank is initialized by the features of the first frame. We build an independent feature bank for each object. Since the object-level feature banks are maintained separately, we omit the object symbol here to simplify the formulae. Suppose we have estimated the object mask of the th frame, then the th frame and the estimated mask are encoded into features through the reference encoder and the embedding module. For each new feature and old features that store in the feature bank , we employ an inner product as the similarity function,


For each new feature , we select the most similar feature from the feature bank that . If is large enough, since these two features are similar, we will merge the new one to the feature bank. Specifically, when ,


where controls the merging rate, and controls the impact of the moving averaging. Otherwise, for all , because the new features are so distinct from all the existing ones, we append the new features to the feature bank,


From our experiments, we find about of new features that satisfy merging operation and we only need to add the rest each time.

(a) Initial state (b) New features (c) Traditional update (d) Our update
Figure 2: An illustration of the Adaptive Feature Bank. (a) shows an initial state of a feature bank, where blue points are the key features. In (b), new extracted features to add are marked in red. In the traditional feature bank (c), features are directly added and it produces redundant features. In our design, as (d) shows, some features are alternated (in dark blue) to absorb those similar new features. The black arrows show the directions of the moving average.

Removing obsolete features.

Though the above updating strategy reliefs memory pressure significantly (e.g., less memory consumption), feature bank sizes still gradually expand with the growth of frame number. Similar to the cache replacement policy, we measure which old features are least likely to be useful and may be eliminated. We build a measurement using the least-frequently used (LFU) index.

Each time when we use the feature bank to match the query frame in Eqn. 1, if the similarity function is greater than a threshold , we increase the count of this feature. Specifically, for , the LFU index is counted by


where is the time span that the feature stays in the feature bank and the function is used to smooth the LFU index. In practice, when the size of the feature bank is about to exceed the predefined budget, we remove the features with the least LFU index until the size of the feature bank is below the budget. The LFU index counting and feature removing procedure are very efficient. This adaptive feature bank scheme can be generalized to other matching-based video processing methods to maintain the bank size, making it suitable to handle videos of arbitrary length.

3.3 Uncertain-Regions Refinement

Figure 3: An illustration of the Uncertain-regions Refinement. (a) is an input frame. (b) is the initial uncertainty map , where brightness indicates the uncertainty. (c) is the initial segmentation, where blue regions are the object masks and red regions highlights regions whose uncertainty score . In (d), the uncertain-regions refinement helps classify ambiguous areas. (e) is the final segmentation.

In the decoding stage, the object masks are computed from the upscaled low-resolution images. Therefore, the object boundaries of such estimated masks are often ambiguous. The classification accuracy of the boundary regions, however, is critical to the segmentation results. Hence, we propose a new scheme, named uncertain-region refinement (URR), to tackle boundary and other uncertain regions. It includes a new loss to evaluate such uncertainty and a novel local refinement mechanism to adjust fine-grained segmentation.

Confidence loss.

After decoding and a softmax normalization, we have a set of initial segmentations for each object , . The object mask represents the likelihood of each pixel belonging to the object , where the value range of is in , and . In other words, for each pixel , there are values in , indicating the likelihood of being one of these objects. We can simply pick the index of the largest value as ’s label.

More adaptively, we define a pixel-wise uncertainty map to measure classification ambiguity on each pixel, using the ratio of the largest likelihood value to the second largest value ,


where . The uncertainty map is in , where smaller value means more confidence. The confidence loss of a set of object masks is defined as


During the training stage, our framework is optimized using the following loss function,


where the is a weight scalar. (Eqn. 2) is the cross entropy loss for pixel-wise classification. is designed for minimizing the ambiguities of the estimated masks, i.e., pushing each object mask towards a map.

Local refinement mechanism.

We propose a novel local refinement mechanism to refine the ambiguous regions. Experimentally, given two neighbor points in the spatial space, if they belong to the same object, their features are usually close. The main intuition is that we use the pixels which have high confidence in classification to refine the other uncertain points in its neighborhood. Specifically, for each uncertain pixel , we compose its local reference features from ’s neighborhood, where is the number of target objects. If the local feature of is close to the , we say pixel is likely to be classified as the object . The local reference feature is computed by weighted average in a small neighborhood ,


where the weight is the object mask for the object . Then, a residual network module is designed to learn to predict the local similarity. We assign a local refinement mask for each pixel by comparing the similarity between and ,


where . are confidence scores for adjusting the impact of the local refinement mask.

Finally, we obtain the final segmentation for each object by adding the local refinement mask to the initial object mask ,


Fig. 3 shows the effectiveness of the proposed uncertain-region refinement (URR). The initial segmentation (marked in blue) is ambiguous, where some areas are lack of confidence (marked in red). As Fig. 3 (d)(e) shown, when our model is trained with the , the uncertain-region refinement improves the segmentation quality.

4 Training details

Our model is first pretrained on simulation videos which is generated from static image datasets. Then, for different benchmarks, our model is further trained on their training videos.

Pretraining on image datasets.

Since we don’t introduce any temporal smoothness assumptions, the learnable modules in our model does not require long videos for training. Pretraining on image datasets is widely used in VOS methods Perazzi et al. (2017); Oh et al. (2019), we simulate training videos by static image datasets Cheng et al. (2015); Shi et al. (2015); Li et al. (2014); Lin et al. (2014); Everingham et al. (2010) ( images in total). A synthetic video clip has 1 first frame and 5 subsequent frames, which are generated from the same image by data augmentation (random affine, color, flip, resize, and crop). We use the first frame to initialize the feature bank and the rest 5 frames consist of a mini-batch to train our framework by minimizing the loss function in Eqn. 9.

Main training on the benchmark datasets.

Similar to the pretrianing routine, we randomly select 6 frames per training video as a training sample and apply data augmentation on those frames. The input frames are randomly resized and cropped into px for all training. For each training sample, we randomly select at most 3 objects for training. We minimize our loss uing AdamW Loshchilov and Hutter (2017) optimizer (, , and the weight decay is ). The initial learning rate is for pretraining and for main training. Note that we directly use the network output without post-processing or video-by-video online training.

Figure 4:

The qualitative results of our framework on DAVIS dataset. Frames of challenging moments (occlusion or deformation) are shown. The target objects are marked in red, green, yellow, etc.

5 Experiments

5.1 Datasets and evaluation metrics

We evaluated our model (AFB-URR) on DAVIS17 Pont-Tuset et al. (2018) and YouTube-VOS18 Xu et al. (2018), two large-scale VOS benchmarks with multiple objects. DAVIS17 contains 60 training videos and 30 validation videos. YouTube-VOS18 (YV) contains

training videos and 474 videos for validation. We implemented our framework in PyTorch 

Paszke et al. (2017) and conducted experiments on a single NVIDIA 1080Ti GPU. Qualitative results of our framework on DAVIS17 dataset are shown in Fig. 4.

We adopt the evaluation metrics from the DAVIS benchmark. The region accuracy

calculates the intersection-over-union (IoU) of the estimated masks and the groundtruth masks. The boundary accuracy measures the accuracy of boundaries, via bipartite matching between the boundary pixels.

Methods OL M R D M R D M
RANet Wang et al. (2019) 63.2 73.7 18.6 68.2 78.8 19.7 65.7
AGSS Lin et al. (2019) 63.4 - - 69.8 - - 66.6
RGMP Oh et al. (2018) 64.8 74.1 18.9 68.6 77.7 19.6 66.7
OSVOS Maninis et al. (2019) Yes 64.7 74.2 15.1 71.3 80.7 18.5 68.0
CINM Bao et al. (2018) Yes 67.2 74.5 24.6 74.0 81.6 26.2 70.6
A-GAME (+YV)  Johnander et al. (2019) 68.5 78.4 14.0 73.6 83.4 15.8 71.0
FEELVOS (+YV) Voigtlaender et al. (2019) 69.1 79.1 17.5 74.0 83.8 20.1 71.5
STM Oh et al. (2019) 69.2 - - 74.0 - - 71.6
Ours 73.0 85.3 13.8 76.1 87.0 15.5 74.6
Table 1: The quantitative evaluation on the validation set of the DAVIS17 benchmark Pont-Tuset et al. (2018) in percentages. +YV indicates the use of YouTube-VOS for training. OL means it needs online learning.

5.2 State-of-the-art comparison

Results on DAVIS benchmarks.

We select state-of-the-art methods from both implicitly learning and explicitly learning to make fair comparisons. Methods with lower scores are omitted due to the limited space. We report three accuracy scores Pont-Tuset et al. (2018) in percentages, mean (M) is the average of Intersection over Union (IoU), recall (R) measures the fraction of sequences scoring higher than a threshold , and decay (D) measures how the performance changes over time. The quantitative results are shown in Table 1. Our method significantly outperforms the existing methods. Our M score is 74.6 without any online fine-tune. Besides that, our model has better runtime performance than the baseline STM Oh et al. (2019). On DAVIS17, with an NVIDIA 1080Ti, STM achieves 3.4fps ( 71.6), and ours achieves 4.0fps ( 74.6). Our runtime is a trade-off between latency and accuracy depends on requirement: if we limit the memory usage under , it achieves 5.7fps ( 71.7).

Perazzi et al. (2017) Oh et al. (2018) Voigtlaender and Leibe (2017) Johnander et al. (2019) Luiten et al. (2018) Xu et al. (2018) Lin et al. (2019) Oh et al. (2019)
Need OL Yes Yes Yes Yes
seen 59.9 59.5 60.1 66.9 71.4 71.0 71.3 79.7 78.8
unseen 45.0 - 46.6 61.2 56.5 55.5 65.5 72.8 74.1
seen 59.5 45.2 62.7 - 75.9 70.0 75.2 84.2 83.1
unseen 47.9 - 51.4 - 63.7 61.2 73.1 80.9 82.6
Overall 53.1 53.8 55.2 66.0 66.9 64.4 71.3 79.4 79.6
Table 2: The quantitative evaluation on the validation set of the YouTube-VOS18 benchmark Xu et al. (2018) in percentages. OL means it needs online learning.

Results on YouTube-VOS benchmarks.

The validation sets contains 474 videos with the first frame annotation. It includes objects from 65 training categories, and 26 unseen categories in training. We resize the video sequences into pixels as inputs, then resize the segmentation masks to the original resolutions.

Table 2 shows a comparison with previous state-of-the-art methods on the open evaluation server. Our framework achieves the best overall score 79.6 because the adaptive feature bank improves the robustness and reliability for different scenarios. For those videos whose objects are already been seen in the training videos, STM’s results are somewhat better than ours. The reason could be that their model and ours are evaluated on different memory budgets. STM Oh et al. (2019) evaluated their work on an NVIDIA V100 GPU with 16GB memory, while we evaluated ours on a weaker machine (one NVIDIA 1080Ti GPU with 11GB memory). However, our framework is more robust in segmenting unseen objects, 74.1 in and 82.6 in , compared with STM (72.8 in and 80.9 in ). Overall, our proposed model has great generalization and achieves the state-of-the-art performance.

5.3 Long-time video comparison

Methods M R D M R D M
RVOS Ventura et al. (2019) 10.2 6.67 13.0 14.3 11.7 10.1 12.2
A-GAME Johnander et al. (2019) 50.0 58.3 39.6 50.7 58.3 45.2 50.3
STM Oh et al. (2019) 79.1 88.3 11.6 79.5 90.0 15.4 79.3
Ours 82.7 91.7 11.5 83.8 91.7 13.9 83.3
Table 3: The quantitative evaluation on the Long-time Video dataset in percentages.

Current benchmarks DAVIS17 (average 67 frames per video) and YouTube-VOS (average 132 frames per video) only contain short clips, which cannot show the advantages of AFB in dealing with long videos in real-world. Hence, we build a new dataset of 3 long videos (average 2K+ frames per video) to show the performance in real-world. Each video has uniformly annotated 20 frames for evaluation. We select three state-of-the-art works to build the long-time video benchmark. All methods were evaluated on the same machine, NVIDIA 1080Ti (11GB Memory). We use their official implementation codes and pretrained models on YouTube-VOS dataset.

In Table 3, RVOS Ventura et al. (2019) and A-GAME Johnander et al. (2019) achieved lower scores because they failed to recognize the object of interest after 1K frames. We found that STM Oh et al. (2019) was able to store at most 50 frames per video. We tuned this hyper-parameter and left the others unchanged. STM’s score is 79.3. Because the total frames that STM can store is fixed, when the video length becomes longer, STM has to increase the key frame interval and has higher chance of missing important frames and information. In contrast, the proposed AFB mechanism can dynamically manage the key information from past frames. We achieved the best score 83.3.

5.4 Ablation study

Variants M R D M R D M
Keep first frame +URR 63.3 70.2 31.4 69.0 77.8 30.2 66.2
Keep latest frame + URR 66.6 76.6 19.1 70.2 83.3 21.7 68.4
Keep first & latest frames +URR 71.4 82.9 17.4 74.9 85.7 21.0 73.1
Keep first & latest 5 frames +URR 69.7 79.8 19.0 73.6 84.6 20.8 71.6
AFB only 68.5 80.4 17.3 72.0 84.0 19.5 70.2
AFB + Confidence loss 71.8 83.5 15.5 73.6 84.1 19.5 72.7
AFB + Local refinement 70.5 80.7 17.2 73.6 85.1 22.5 72.1
Full (AFB+URR) pretrain only 59.0 66.3 27.9 62.9 70.8 31.6 60.9
Full (AFB+URR) 73.0 85.3 13.8 76.1 87.0 15.5 74.6
Table 4: Ablation study using the validation set of the DAVIS17 benchmark Pont-Tuset et al. (2018).

We conducted an extensive ablation analysis of our framework on the DAVIS17 dataset. In Table 4, the quantitative results show the effectiveness of the proposed key modules.

First, we analyze the impact of the adaptive feature bank (AFB), and evaluate memory management schemes: (1) keeping features from the first frame, (2) from latest frame, (3) from the first latest frame, and (4) from the first latest 5 frames. The remaining modules follow the full framework (i.e., with uncertain-regions refinement (URR) included). From the first four rows in Table 4, while reference frames help the segmentation, simply adding multiple frames may not further improve the performance. Our adaptive feature bank more effectively organizes the key information of all the previous frames. Consequently, our framework AFB+URR has the best performance.

Second, we analyze the efficiency of the proposed uncertain-regions refinement (URR). We disable URR by training the framework without the confidence loss in Eqn. 9 or/and local refinement. Because object boundary regions are often ambiguous, and their uncertainty errors are easily accumulated and harm segmentation results. Our uncertainty evaluation and local refinement significantly improve performance in these regions.

6 Conclusion

We present a novel framework for semi-supervised video object segmentation. Our framework includes an adaptive feature bank (AFB) module and an uncertain-region refinement (URR) module. The adaptive feature bank effectively organizes key features for segmentation. The uncertain-region refinement is designed for refining ambiguous regions. We train our framework by minimizing the typical segmentation cross-entropy loss plus an innovative confidence loss. Our approach outperforms the state-of-the-art methods on two large-scale benchmark datasets.

Broader Impact

Our framework is designed for the semi-supervised video object task, also known as one-shot video object segmentation task. Given the first frame annotation, our model could segment the object of interest in the subsequent frames. Due to the robust generalization of our model, the category of the target object is unrestricted. Our adaptive feature bank and the matching based framework can be modified to benefit other video processing tasks in autonomous driving, robot interaction, and video surveillance monitoring that need to handle long videos and appearance-changing contents. For example, one application is in real-time flood detection/monitoring using surveillance cameras. Flooding constitutes the largest portion of insured losses among all disasters in the world Aerts et al. (2014). Nowadays, many cameras in city including traffic monitoring and security surveillance cameras are able to capture time-lapse images and videos. By leveraging our video object segmentation framework, flood can be located from the videos and the water level can be estimated. The societal impact is immense because such flood monitoring system can predict and alert a flooding event from rainstorms or hurricanes in time. Our framework is trained and evaluated on large-scale segmentation datasets, we do not leverage biases in the data.

This work is partly supported by Louisiana Board of Regents ITRS LEQSF(2018-21)-RD-B-03, and National Science Foundation of USA OIA-1946231.


  • [1] J. C. Aerts, W. W. Botzen, K. Emanuel, N. Lin, H. De Moel, and E. O. Michel-Kerjan (2014) Evaluating flood resilience strategies for coastal megacities. Science 344 (6183), pp. 473–475. Cited by: Broader Impact.
  • [2] L. Bao, B. Wu, and W. Liu (2018-06) CNN in MRF: Video Object Segmentation via Inference in a CNN-Based Higher-Order Spatio-Temporal MRF. In

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Salt Lake City, UT, pp. 5977–5986 (en). External Links: ISBN 978-1-5386-6420-9, Link, Document Cited by: §1, §2, Table 1.
  • [3] S. Caelles, K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool (2017) One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 221–230. Cited by: §1, §2.
  • [4] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1189–1198. Cited by: §1.
  • [5] M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S. Hu (2015) Global contrast based salient region detection. IEEE TPAMI 37 (3), pp. 569–582. External Links: Document Cited by: §4.
  • [6] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §4.
  • [7] B. Fritzke (1995) A growing neural gas network learns topologies. In Advances in neural information processing systems, pp. 625–632. Cited by: §3.2.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. pp. 770–778. External Links: Link Cited by: §3.1.
  • [9] P. Hu, G. Wang, X. Kong, J. Kuen, and Y. Tan (2018) Motion-guided cascaded refinement network for video object segmentation. pp. 1400–1409. External Links: Link Cited by: §2.
  • [10] Y. Hu, J. Huang, and A. G. Schwing (2018) Videomatch: matching based video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 54–70. Cited by: §1, §2.
  • [11] Y. Hu, J. Huang, and A. Schwing (2017) MaskRNN: Instance Level Video Object Segmentation. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 325–334. External Links: Link Cited by: §1, §2.
  • [12] J. Johnander, M. Danelljan, E. Brissman, F. S. Khan, and M. Felsberg (2019) A generative appearance model for end-to-end video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8953–8962. Cited by: §1, §5.3, Table 1, Table 2, Table 3.
  • [13] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele (2017) Lucid data dreaming for object tracking. In The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops, Cited by: §1, §2.
  • [14] A. Kirillov, Y. Wu, K. He, and R. Girshick (2020) PointRend: image segmentation as rendering. pp. 9799–9808. External Links: Link Cited by: §2.
  • [15] W. Kuo, A. Angelova, J. Malik, and T. Lin (2019) ShapeMask: learning to segment novel objects by refining shape priors. pp. 9207–9216. External Links: Link Cited by: §2.
  • [16] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille (2014) The secrets of salient object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 280–287. Cited by: §4.
  • [17] Y. Liang, N. Jafari, X. Luo, Q. Chen, Y. Cao, and X. Li (2020) WaterNet: an adaptive matching pipeline for segmenting water with volatile appearance. Computational Visual Media, pp. 1–14. Cited by: §1, §2.
  • [18] H. Lin, X. Qi, and J. Jia (2019) AGSS-VOS: Attention Guided Single-Shot Video Object Segmentation. pp. 3949–3957. External Links: Link Cited by: §1, §1, §2, §3.1, Table 1, Table 2.
  • [19] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014)

    Microsoft coco: common objects in context

    In European conference on computer vision, pp. 740–755. Cited by: §4.
  • [20] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
  • [21] I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §4.
  • [22] J. Luiten, P. Voigtlaender, and B. Leibe (2018-07) PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation. (en). External Links: Link Cited by: Table 2.
  • [23] K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixe, D. Cremers, and L. Van Gool (2019-06) Video Object Segmentation without Temporal Information. IEEE transactions on pattern analysis and machine intelligence 41 (6), pp. 1515–1530. Note: Place: United States Publisher: IEEE Computer Society External Links: ISSN 1939-3539, Document Cited by: §1, §2, Table 1.
  • [24] S. W. Oh, J. Lee, K. Sunkavalli, and S. J. Kim (2018-06) Fast Video Object Segmentation by Reference-Guided Mask Propagation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7376–7385. Note: ISSN: 2575-7075 External Links: Document Cited by: §1, §1, §2, §3.1, Table 1, Table 2.
  • [25] S. W. Oh, J. Lee, N. Xu, and S. J. Kim (2019) Video Object Segmentation Using Space-Time Memory Networks. pp. 9226–9235. External Links: Link Cited by: §1, §2, §3.1, §3.1, §3.2, §4, §5.2, §5.2, §5.3, Table 1, Table 2, Table 3.
  • [26] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §5.1.
  • [27] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A.Sorkine-Hornung (2017) Learning video object segmentation from static images. In Computer Vision and Pattern Recognition, Cited by: §1, §2, §4, Table 2.
  • [28] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool (2018-03) The 2017 DAVIS Challenge on Video Object Segmentation. arXiv:1704.00675 [cs]. Note: arXiv: 1704.00675Comment: Challenge website: External Links: Link Cited by: §5.1, §5.2, Table 1, Table 4.
  • [29] J. Shi, Q. Yan, L. Xu, and J. Jia (2015) Hierarchical image saliency detection on extended cssd. IEEE transactions on pattern analysis and machine intelligence 38 (4), pp. 717–729. Cited by: §4.
  • [30] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and X. Giro-i-Nieto (2019) Rvos: end-to-end recurrent network for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5277–5286. Cited by: §5.3, Table 3.
  • [31] P. Voigtlaender, Y. Chai, F. Schroff, H. Adam, B. Leibe, and L. Chen (2019) Feelvos: fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9481–9490. Cited by: §1, §1, §2, Table 1.
  • [32] P. Voigtlaender and B. Leibe (2017)

    Online adaptation of convolutional neural networks for video object segmentation

    In BMVC, Cited by: §1, §2, Table 2.
  • [33] Z. Wang, J. Xu, L. Liu, F. Zhu, and L. Shao (2019) RANet: Ranking Attention Network for Fast Video Object Segmentation. pp. 3978–3987. External Links: Link Cited by: §1, §1, §2, Table 1.
  • [34] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price, S. Cohen, and T. Huang (2018) Youtube-vos: sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 585–601. Cited by: Table 2.
  • [35] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang (2018-09) YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark. arXiv:1809.03327 [cs]. Note: arXiv: 1809.03327Comment: Dataset Report. arXiv admin note: substantial text overlap with arXiv:1809.00461 External Links: Link Cited by: §5.1, Table 2.
  • [36] X. Zeng, R. Liao, L. Gu, Y. Xiong, S. Fidler, and R. Urtasun (2019) DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation. pp. 3929–3938. External Links: Link Cited by: §2.