Video object segmentation (VOS) aims at segmenting the foreground objects from a given video with motion, which is a classic task in computer vision with many applications, including surveillance, video compression, and motion understandingetc. In this paper, we focus on the most practical and widely studied setting, i.e., semi-supervised VOS liang_AFB_URR_NIPS_2020; yang_CFBI_ECCV_2020; Robinson_frtm-vos_CVPR_2020; oh_STM_ICCV_2019; yin2021agunet_PR, whereas the scope is to segment target objects over video sequences only given the initial mask of the first frame as prior and visual guidance. This is a challenging problem because the target objects can be confused with similar instances of the same category, and their appearance might vary drastically over time due to scale change, pose changes, fast motion, truncation, blurry effects, occlusions etc. Essentially, these challenges could not be addressed with image appearance information only.
Recently, various deep learning based VOS approaches have been proposedsun2020adaptive_PR; zhao2021real_PR, which could be roughly categorized as propagation-based methods Perazzi_masktrack_CVPR_2017; oh_seoung_fastvos_RGMP_CVPR_2018; li_vsreid_cvprw_davis_2017 and feature matching based methods oh_STM_ICCV_2019; Li_STM-cycle_NeurIPS_2020; liang_AFB_URR_NIPS_2020; lu_episodicvos_eccv_2020; seong_kernelizedvos_eccv_2020. The former generally formulate the task as object mask propagation, while the latter leverages memory networks to retrieve relevant information. Nowadays, the feature matching-based methods such as Space-Time Memory (STM) Networks oh_STM_ICCV_2019; liang_AFB_URR_NIPS_2020; Li_STM-cycle_NeurIPS_2020; liyu_fast_vos_ECCV_2020; lu_episodicvos_eccv_2020; seong_kernelizedvos_eccv_2020 have achieved state-of-the-art performance in VOS. The key to the success lies in introducing the feature matching of historical frames using non-local operations with a well-designed feature-memory-bank mechanism. They conduct matching using all previous frames with the corresponding object segmentation results through feature similarity query, and infer the object mask of the current frame. However, these methods heavily rely on the matching of object appearance between frames, while motion information, as the critical feature between video frames, tends to be ignored. Prior to the success of these feature matching-based methods, explicit motion modeling in the form of dense optical flow have been exploited cheng_segflow_cvpr_2017; jang_online_vos_ctn_cvpr_2017; Perazzi_masktrack_CVPR_2017; Dave_seg_anything_moves_ICCV_2019; khoreva_lucid_dream_DAVIS_2017; li_vsreid_cvprw_davis_2017; tian2020joint_PR, where the dense optical flow is pre-computed from optical flow estimation networks ilg_flownet2_cvpr_2017; hui_liteflownet_cvpr_2018; revaud_epicflow_cvpr_2015; hu_efficient_flow_cvpr_2016; kroeger_fast_flow_eccv_2016; wang_DICL_NIPS_2020; teed_raft_eccv_2020. However, the explicit use of optical flow not only requires additional large dataset (having a domain gap with VOS datasets) training for an optical flow network but also cannot capture the critical challenges in VOS, i.e., occlusion, textureless regions, fast motion, and blurry effects.
In this paper, we advocate that VOS should not only rely on image appearance similarity matching but also emphasize the essential motion information from the video, which exists in any adjacent frames and will not disappear over time. We propose a novel framework for semi-supervised VOS named MUNet that embeds motion information into a single branch pipeline without a pre-computed optical flow. Given a video sequence, MUNet uses a dense matching based method with a feature-memory-bank to store both appearance and motion features, as shown in Fig. 2. To avoid problems caused by explicitly using optical flow, we propose a lightweight Motion Uncertainty-aware Layer (MU-Layer) to implicitly model motion information from adjacent frames. Specifically, we use a cost volume to model the displacement and motion uncertainty as a motion feature to establish spatio-temporal relationships, which is calculated from high-level semantic features of adjacent frames. In addition, we design a Motion-aware Spatial Attention Module (MSAM) to effectively fuse the appearance feature and the motion uncertainty-aware feature, then we use this module to guide the segmentation of video sequences. Different from the two-stream methods that need pre-computed optical flow as input, MUNet only requires the raw images and does not need the supervision of optical flow, which greatly expands the practicability and scope of application. Meanwhile, under the experimental protocol using a small amount of data, the powerful motion information brings a significant performance boost without complex tricks.
We conduct comprehensive experiments on DAVIS17 and YouTube-VOS18. Experimental results show that our method achieves state-of-the-art accuracy on the validation set of DAVIS17 ( ranks 1st under protocol(1), ranks 2nd under protocol(2), ranks 1st under protocol(3-1)), which exceeds all competing methods under the same settings. We provide qualitative comparisons with four SOTA methods liang_AFB_URR_NIPS_2020; yang_CFBI_ECCV_2020; Robinson_frtm-vos_CVPR_2020; oh_STM_ICCV_2019 in Fig. 1. MUNet can accurately demarcate the boundaries of multiple objects and does not overly cover or ignore small structures. As shown in Fig. 5
, the MU-Layer can implicitly learn a reasonable uncertainty map and displacement vector, and effectively activate the area of the moving object through the motion-aware spatial attention module. Our main contributions are summarized as follows.
To the best of our knowledge, we are the first to embed motion information into an end-to-end VOS pipeline with a single branch and without a precomputed optical flow.
We introduce the MU-Layer and MSAM to learn the motion features with uncertainty, which provides critical information for VOS.
Experimental evaluation on benchmark datasets with wide protocols verifies the superiority of our proposed method, where our method is competitive with the existing SOTA methods, especially the best performance under low-data protocols, i.e. (1), (3-1).
2 Related Work
In this section, we introduce previous works from two categories of VOS as feature matching based methods and motion based methods.
Feature matching methods. STMoh_STM_ICCV_2019 achieves great success, which performs dense feature matching across the entire spatio-temporal volume of the video through a dynamic memory bank, i.e. saves the spatio-temporal information of previous frames. To alleviate the problem of missing samples and out-of-memory crashes when processing long videos, AFB-URRliang_AFB_URR_NIPS_2020 introduces the adaptive feature bank to organize the object features by weighted averaging and discards obsolete features by least frequently used index principle. STM-CycleLi_STM-cycle_NeurIPS_2020 incorporates cycle consistency to mitigate the error propagation problem. liyu_fast_vos_ECCV_2020 proposes a global context module using attentions to reduce temporal information and guide the segmentation. RANetZiqin_RANet_CVPR_2019 is a hybrid strategy, integrating the insights of matching based and propagation based methods to learn pixel-level similarity. KMNseong_kernelizedvos_eccv_2020 and GraphMemlu_episodicvos_eccv_2020 focus more on memory bank optimization to achieve more accurate key-value matching, both are accompanied by data augmentation, such as Hide-and-Seek and Label Shuffling, and their performance will drop a lot if the augmentation is removed. We agree that these schemes will also give us an improvement, but this paper is dedicated to using simple motions to emphasize the essentials. Besides, although these methods have used temporal information to improve accuracy, they ignore the most essential motion information.
Such methods can be roughly classified as mask propagation methodsPerazzi_masktrack_CVPR_2017; oh_seoung_fastvos_RGMP_CVPR_2018; li_vsreid_cvprw_davis_2017; xie_RMNet_CVPR_2021 and two-stream methods cheng_segflow_cvpr_2017; khoreva_lucid_dream_DAVIS_2017; Dave_seg_anything_moves_ICCV_2019; jang_online_vos_ctn_cvpr_2017 . The mask propagation methods start from an annotated frame and propagates masks through the entire video sequence, and the propagation process is often guided by optical flow. Two-stream architecture usually fuses feature between appearance (RGB branch) and motion (optical flow branch).
However, such methods highly dependent on high-quality pre-inferenced optical flow ilg_flownet2_cvpr_2017; hui_liteflownet_cvpr_2018; revaud_epicflow_cvpr_2015; hu_efficient_flow_cvpr_2016; kroeger_fast_flow_eccv_2016; teed_raft_eccv_2020, and the performance will be limited by the generalization ability of optical flow network due to the fact that there is no ground truth of optical flow in the common VOS task. RMNet xie_RMNet_CVPR_2021 uses the mask of the previous frame and pre-computed optical flow to generate the attention mask for current frame, but different optical flow networks ilg_flownet2_cvpr_2017; teed_raft_eccv_2020 are selected for different datasets. It also shows that this kind of methods cannot escape the disadvantages of generalization. As for mask propagation methods, they rely on temporal continuity from optical flow and spatio-temporal context from the previous frames, this leads these methods are difficult to deal with occlusions, rapid motion, and complex deformation of objects, also meets performance drift over time once the propagation becomes unreliable.
3 Motion Uncertainty-aware VOS
Given a video sequence with frames, , is the -th frame RGB image, is the ground truth object segmentation mask and denotes the predicted object mask. Semi-supervised VOS aims to predict the object masks of all following video frames with the first RGB image and its object annotation mask as prior, which can be formulated as , where is an object segmentation network with learnable weights . denotes the historical frames before the current frame that can be used directly or implicitly to infer the object mask.
3.1 Network Overview
Our MUNet is a pixel-level dense matching method. The network architecture as illustrated in Fig. 2 consists of four seamless parts: 1) two encoders as feature extractors, 2) a Motion Uncertainty-aware Layer (MU-Layer), 3) a Motion-aware Spatial Attention Module (MSAM), and 4) a decoder with a feature memory bank.
First, the reference image and mask are fed into the reference encoder Enc-R, the query image and adjacent frame are respectively input to query encoder Enc-Q with shared weights. Second, the semantic features from Enc-Q are input to the MU-Layer to obtain the motion feature , more details are described in Sec. 3.2. Then, we use MSAM to take advantage of the motion feature from adjacent frames to enhance the semantic feature embed by Enc-Q in Sec. 3.3.
In oh_STM_ICCV_2019; Li_STM-cycle_NeurIPS_2020; liang_AFB_URR_NIPS_2020, the semantic features output by Enc-R and Enc-Q are directly used for key-value embedding, where the keys are used for addressing and matching, while the values are used to preserve feature information. This will cause the query process to rely heavily on the similarity matching of appearance semantic features, which may lead to appearance confusion and wrong predictions caused by similar instances. Different from them, for a query image, we use the motion enhanced feature from MASM for key-value embedding, and obtain the key-value pairs to match the most similar features in the memory bank. The key-value pairs of historical images are stored in memory bank, updated dynamically over time. Finally, the matching results from memory bank are fed into decoder, and then output the segmentation mask of each object. We use ResNet50kaiming_resnet50_cvpr_2016 as the backbone of two encoders. For the query encoder Enc-Q, we take the output feature map of Layer-4 (res4) as a semantic feature , where and are the height and width of raw image shape correspondingly. The reference encoder Enc-R is slightly different from the vanilla ResNet50, which takes the RGB image (3-channel) and its segmentation label mask (1-channel) as inputs to extract object-level semantic feature.
3.2 Motion Uncertainty-aware Layer (MU-Layer)
Here we introduce MU-Layer, which establishes spatio-temporal relationships based on the motion feature calculated from high-level semantic features of adjacent frames. We construct a cost volume to model pixel-wise feature similarity that indicates the rough motion.
Cost Volume. For semantic features of reference and query images from Enc-Q, where indicates of the input image shape, is the feature dimension. In order to achieve lightweight computing, we first reduce the feature dimension to by convolution. Then we define the correlation as
, which can be efficiently computed as cosine similarity between each pixel in thereference image with a set of candidate targets in a search window.
where is the source pixel coordinate, is the pixel displacement in search window. We set window size as 2525.
Displacement Calculation. In order to implicitly represent the motion without optical flow ground truth as supervision, we dig into the information represented by cost volume and calculate the displacement with the highest matching cost. We use a soft-argmin operator by to solve these problems. denotes the softmax function. We use 2D softmax along the displacement hypothesis space of
to calculate the matching probability. Therefore,is a two-channel displacement vector.
Uncertainty Estimation. The last two dimensions of the cost volume represent the correspondence between and
in all spatial positions. Previously, we build pixel-wise displacement between features of adjacent frames by soft-argmin, but it may fail in low texture, large motion, and motion blur areas. To remedy this issue, we propose an uncertainty branch to measure the matching confidence. We use the matching cost value of the correlation matrix in each spatial position as a score to measure the matching uncertainty of each pair of spatial correspondence. Specifically, we use max-pooling to calculate the max value of coordinateas matching uncertainty, which means the highest response in each spatial position of 4D cost volume. We define uncertainty map as,
As shown in Fig. 3, we randomly select a pixel in query image , the red box is the corresponding search window in reference image , and yellow line represents the correspondence between two adjacent frames. Right subfigure indicates the local matching cost of the search window, which will be spliced into the total uncertainty map of image as Eq. (2).
Motion Feature Representation. After getting motion information from cost volume by soft-argmin and max-pooling, we also need a high-level motion feature to achieve feature fusion with high-level semantic feature on an equivalent level. Different from many two-stream based VOS methodsli_MotionAttention_ICCV_2019; zhou_matnet_aaai_2020_tip that directly use a deep encoder parallel with RGB encoder to encode motion feature, we design a lightweight CNN model to project the original motion information () into a high-dimensional feature space (). How to use the motion feature to enhance the semantic feature will be described in Sec. 3.3. Table 1 shows the structure of our lightweight motion feature representation network in our proposed MU-Layer, which maps the original motion information () into a high-dimensional feature space ().
3.3 Motion-aware Spatial Attention Module (MSAM)
To enhance the semantic feature from Enc-Q by motion feature provided by the former MU-Layer, a direct and simple way is fusing the motion feature with semantic feature by element-wise add operation, as shown in the right of Fig. 2(a) and can be formulated as , where denotes the motion and semantic feature respectively, is the fused feature, which is subsequently used to embed the key-value pair of query image.
Considering that the motion information is different in each spatial position, to use it more efficiently, we design a lightweight attention module to fuse the motion feature and semantic feature. In this way, the semantic feature from Enc-Q can be enhanced by the motion feature in some important regions of motion. Thus, we exploit motion feature as spatial attention weights, as shown in Fig. 2(b), and this module can be formulated as,
where indicates the broadcast multiply, the shape of is and the output channel of is . So after and , we can get one-channel attention map with shape .
3.4 Loss function
Cross-Entropy (CE) is most widely used in segmentation tasks. However, the CE loss calculates the error of each pixel independently and treats all pixels equally, which ignores the global structure of the image. Here we adopt the Bootstrap Cross-Entropy loss (BsCE, ) wu_bridging_bootstrap_celoss_2016 to force networks to focus on the hard and valuable pixels during training, and we select top hardest pixels to carry out back propagation. Besides, we use the mask-IoU loss () to optimize the global structure instead of focusing on a single independent pixel, which is not affected by the unbalanced distribution. Thus, we use the combination of and as supervision, which is formulated as,
3.5 Implementation Details
We implement our model by PyTorchpaszke_pytorch_NIPS_2019 with a single NVIDIA RTX 2080Ti GPU. We use ResNet50 kaiming_resnet50_cvpr_2016 as the feature extractor (Enc-Q, Enc-R
), pretrained on ImageNet. During training, we select 6 continuous frames per sequence as a batch (one as reference frame and the other five as query frames). During inference, we feed the video sequence frame by frame and do not use online fine-tuning. We simply apply common data augmentation on current frames including flip, color jitter and affine transformation. The input frames are randomly resized and cropped into 400400, and we use the raw image size during inference. We minimize and by the AdamW loshchilov_AdamW_ICLR_2018 optimizer with default parameters . The initial learning rate is and the weight decay is .
4.1 Datasets, Evaluation metrics, and Protocols
We train and evaluate our method on DAVIS 2017 Pont_DAVIS_arxiv_2017 and YouTube-VOS 2018 Xu_YouTube_ECCV_2018 datasets. Considering that many methods use extra static image datasets for pre-training Cheng_PAMI_2015; everingham_pascal_voc_2010; shi_hierarchical_pami_2015; lin_coco_microsoft_2014; li_secrets_2014 or conduct fine-tuning, for a fair comparison, we categorize and compare solutions based on whether they use extra static datasets and online fine-tuning.
DAVIS 2017. The DAVIS17 dataset contains 120 video sequences in total, where 60 sequences are split for training, 30 for validation, and 30 for testing. Each video contains one or several annotated objects to track. Each video sequence has 25 to 104 annotated continuous frames.
Youtube-VOS 2018. The Youtube-VOS18 dataset contains 4453 videos with one or more target objects, including 3471 videos for training (65 categories), 474 sequences for validation (additional 26 unseen categories). Each video sequence has 20 to 180 discontinuous frames, where every 5 interval frames are provided and annotated.
Evaluation metrics. Following the standard DAVIS protocol, we measure region accuracy by calculating average intersection-over-union (IoU), and boundary accuracy via bipartite matching between boundary pixels. In addition, we compare inference speed by frames per second (FPS) according to averaging FPS of each sequence on the validation set.
Evaluation protocols. There are various training protocols for the existing VOS methods. For a fair comparison with more methods, we train our model and report the results under different data use protocols, including (1) only using DAVIS17 for training, (2) jointly training on DAVIS17 and YouTube-VOS18, (3) using static images Cheng_PAMI_2015; everingham_pascal_voc_2010; shi_hierarchical_pami_2015; lin_coco_microsoft_2014; li_secrets_2014 to pretrain and then fine-tuning on DAVIS17 (3-1) or both jointly (3-2), more intuitively in Fig. 4. The different training settings may cause unfair comparisons, such as using different backbones (e.g. RestNet50, ResNet101), using well-designed data augmentation methods (e.g. Hide-and-Seek, Label Shuffling, Balanced Random-Crop), using different weight initialization (e.g. Mask-RCNN, DeepLab) and whether performing online fine-tune on the test video or not, etc. Here, we will report the detailed comparisons as fair as possible.
It is worth mentioning that, this low-data experimental setup greatly reduces the training time and no longer requires a large amount of data to pretrain. Although this is a smart way to initialize the network and improve the metric under protocol(3)
, researchers gradually realize the shortcomings of this way, such as long training time. In our setting, one epoch of pretraining underprotocol(3) needs about 17 hours, and the whole pretraining process may take 67 days in a single 2080Ti GPU. The time of one epoch under protocol(2) is about 25 minutes and protocol(1) is about 12 minutes. The training protocol(1) and (2) are lighter, i.e. less time. We believe abandoning the pretraining phase by emphasizing own data motion information is an improvement in the VOS field.
4.2 Quantitative Comparison
DAVIS17. Results on the DAVIS17 validation set are reported in Table 2. Our method achieves 1st under training protocol(1) and protocol(3-1), and ranks 2nd under protocol(2). Under protocol(1), only using the DAVIS17 dataset for training, our vanilla version achieves on , which outperforms all other methods. And a better initialization by Mask-RCNN ResNet50 can boost the performance of our method to on , which significantly outperforms STM() by and CFBI() by . In addition, when jointly training under protocol(2), ours achieves , and the performance can increase to if using static images to pretrain. Ours surpasses most competitors in Table 2, it is competitive and able to prove our superiority.
|STM111In Table 2,3, “S”: static images for training, “Y”: Youtube-VOS18, and “O”: fine-tune on test strategy. indicates using DeepLabv3-ResNet101 to initialize the backbone network, and means using Mask-RCNN-ResNet50. The best result is bold-faced, and the suboptimal result is underlined. oh_STM_ICCV_2019||-||-||43.0||6.25|
YouTube-VOS18. We report the results on the YouTube-VOS18 validation set in Table 3. The subscript and denote the unseen and seen categories, is the average of all four measures. Our method with different training settings can achieve competitive results. Compared with the baseline STMoh_STM_ICCV_2019, the motion information brings () performance gain. Especially without pretraining on static images, our mehtod brings () gain. In addition, we found the improvement on YouTube is not as significant as DAVIS. The reason is that DAVIS consists of continuous frames, while every 5 interval frames (discontinuous) are used in YouTube. MULayer cannot effectively capture the long range motion information without the optical flow ground truth supervision.
4.3 Ablation Study
In this section, we conduct several ablation experiments on the DAVIS17 validation set via training protocol(2) in Fig. 4 to discuss the effectiveness of our proposed modules. The baseline version of these experiments uses ResNet50, without any proposed modules.
MU-Layer. We train a network without our MU-layer to study how it influences the matching based methods. As shown in Table 4 (a) and (c), our proposed MU-Layer can achieve () performance gain. We also visualize the displacement and uncertainty map in Fig. 5 and Fig. 6, which shows that our MU-Layer extracts reliable motion information.
MASM. To evaluate the effectiveness of our fusion module, we report the result of addition (f) and MASM (g) in Table 4, based on the complete structure (e). The comparison shows that MASM improves the vanilla adding based fusion method by () in &. We visualize the spatial attention map of shape before broadcast multiply and show it in Fig. 7. We can observe that the attention map focuses on the salient area of the moving objects, which demonstrates that MASM helps strengthen the object information.
Segmentation Loss. We study how well-designed segmentation loss affects the performance. The results of (a) and (b) are listed in Table 4, and the performance drops with or without such loss. This provides us a strong baseline and achieves better performance.
|Baseline and components||&|
|(a)||w/o disp. (baseline)||74.5||78.4||76.5||-|
|(b)||+ BsCE&IoU loss||76.2||80.9||78.6||+2.1|
|(c)||+ disp. + uncertainty map||75.9||80.1||78.0||+1.5|
|(d)||+ disp. + BsCE&IoU||76.9||82.5||79.7||+3.2|
|(e)||+ disp. + BsCE&IoU + uncertainty||78.3||83.9||81.1||+4.6|
|(f)||All three cmpt. w [addition]||77.9||82.3||80.1||+3.6|
|(g)||All three cmpt. w [motion attention]||78.3||83.9||81.1||+4.6|
4.4 Some Failure Cases
Here we provide some failure cases of our scheme in Fig. 8. After the previous quantitative and qualitative comparisons, our solution has achieved great success by exploiting the effective motion information. As common optical flow networks cannot handle extremely complex/fast motions and large-scale occlusion well, and there is no ground truth for supervision in our implicit modeling method, we also have flaws in some extreme examples. However, our method still outperforms other competing methods under these challenging scenarios.
4.5 Qualitative Comparison
We give a qualitative comparison with some SOTA methods liang_AFB_URR_NIPS_2020; yang_CFBI_ECCV_2020; Robinson_frtm-vos_CVPR_2020; oh_STM_ICCV_2019 in Fig. 1. In soapbox#74 and scooter-black#42, our method can handle the boundaries between objects well. In paragliding-launch#20, our method deals with moving thin lines better. However, FRTMRobinson_frtm-vos_CVPR_2020 lose parts of objects or over-cover the thin object. AFB-URRliang_AFB_URR_NIPS_2020 struggles with discrimination between different objects and almost fails to detect small and thin objects.
In Fig. 9, we show how IoU changes over time by per-frame evaluation. Because the motion information we used exists between any adjacent frames and will not disappear over time, our methods can still maintain high IoU in the last 50% frames. Compared with STM and AFB-URR, we have achieved more robust segmentation. More qualitative results of ours are shown in Fig. 10. Thanks to the accurate modeling of motion information, our method is robust to occlusion, complex/large motion of row#1,2,4 and can distinguish the small/thin objects in row#3. And in Fig. 11, we select 3 video sequences of YouTube-VOS18 for visual comparison with AFB-URRliang_AFB_URR_NIPS_2020.
In this work, we advocated the return of motion information in the state-of-the-art dense-matching based semi-supervised VOS approaches. We proposed a novel motion uncertainty-aware pipeline for semi-supervised VOS, where the motion information is implicitly modeled. We implicitly built correlation volume by matching pixel-pairs between the reference frame and the query frame, enabling the learning of motion features. To address the challenging cases of occlusion and textureless regions, we incorporated the motion uncertainty into building dense correspondences. Furthermore, we proposed a motion-enhanced module to effectively fuse the motion feature and the semantic feature. Extensive experiments on benchmark datasets proved the superiority of our proposed framework.
This research was supported in part by the National Key Research and Development Program of China under Grant 2018AAA0102803 and the National Natural Science Foundation of China under Grants 61871325, 62001394 and 61901387.