Self-Supervised Video Object Segmentation by Motion-Aware Mask Propagation
We propose a self-supervised spatio-temporal matching method coined Motion-Aware Mask Propagation (MAMP) for semi-supervised video object segmentation. During training, MAMP leverages the frame reconstruction task to train the model without the need for annotations. During inference, MAMP extracts high-resolution features from each frame to build a memory bank from the features as well as the predicted masks of selected past frames. MAMP then propagates the masks from the memory bank to subsequent frames according to our motion-aware spatio-temporal matching module, also proposed in this paper. Evaluation on DAVIS-2017 and YouTube-VOS datasets show that MAMP achieves state-of-the-art performance with stronger generalization ability compared to existing self-supervised methods, i.e. 4.9% higher mean 𝒥&ℱ on DAVIS-2017 and 4.85% higher mean 𝒥&ℱ on the unseen categories of YouTube-VOS than the nearest competitor. Moreover, MAMP performs on par with many supervised video object segmentation methods. Our code is available at: <https://github.com/bo-miao/MAMP>.READ FULL TEXT VIEW PDF
Self-Supervised Video Object Segmentation by Motion-Aware Mask Propagation
Video object segmentation (VOS) is a fundamental problem in visual understanding where the aim is to segment objects of interest from the background in unconstrained videos. VOS enables machines to sense the motion pattern, location, and boundaries of the objects of interest in videos, which is useful in a wide range of applications. For example, in video editing, manual frame-wise segmentation is laborious and does not maintain temporal consistency whereas VOS can segment all frames automatically using the mask of one frame as a guide. The problem of segmenting objects of interest in a video using the ground truth object masks provided for only the first frame is referred to as semi-supervised video object segmentation. This is challenging because the appearance of objects in a video change significantly due to fast motion, occlusion, scale variation, . Moreover, other similar looking non-target objects may confuse the model to segment incorrect objects.
Semi-supervised video object segmentation techniques fall into two categories: supervised and self-supervised. Supervised approaches [24, 43] use the rich annotation information in training data to learn the model achieving great success in video object segmentation. Despite their success, these methods are not attractive given their reliance on accurate pixel-level annotations for training (see Fig. 1), which are expensive to generate. Moreover, supervised approaches struggle to maintain the same performance in the wild. In contrast, self-supervised approaches [12, 19] learn feature representations based on the intrinsic properties of the video frames, and thus do not require any annotations and can better generalize to unseen objects. Even though the motivations behind existing self-supervised methods are different, they share the same objective of learning to extract general feature representations and construct precise spatio-temporal correspondences to propagate the object masks in the video sequences. Exploiting the spatio-temporal coherence in videos, the pretext tasks of self-supervised methods can be designed to either minimize the reconstruction or prediction loss  or maximize the temporal cycle-correspondence consistency 
. Once trained, the models are able to extract general feature representations and build spatio-temporal correspondences between the reference and query frames. Therefore, the pixels in the query frames can be classified according to the mask labels of their corresponding region of interests (ROIs) in the reference frames. Despite the simplicity of the self-supervised methods, existing methods perform poorly in the case of fast motion and long-term matching scenarios.
To deal with the above challenges, we propose a self-supervised method, coined motion-aware mask propagation (MAMP). Similar to previous self-supervised methods, MAMP learns to represent image features and build spatio-temporal correspondences without any annotations during training. During inference, MAMP first leverages the feature representations and the given object masks for the first frame to build a memory bank. The proposed motion-aware spatio-temporal matching module in MAMP then takes advantage of the motion cues to mitigate the issues caused by fast motion and long-term correspondence mismatches, and propagates the mask from the memory bank to subsequent frames. Moreover, the proposed size-aware image feature alignment module fixes the misalignment during mask propagation and the memory bank is constantly updated by the past frames to provide the most appropriate spatio-temporal guidance. We evaluate MAMP on the DAVIS-2017 and YouTube-VOS benchmarks to verify its effectiveness as well as generalization ability.
Our contributions are summarized as follows:
We propose Motion-Aware Mask Propagation (MAMP) for semi-supervised video object segmentation that trains the model end-to-end without any annotations and effectively propagates the masks across frames.
We propose a motion-aware spatio-temporal matching module to mitigate errors caused by fast motion and long-term correspondence mismatches. The proposed module improves the performance of MAMP on YouTube-VOS by 6.4%.
Without any bells and whistles (, fancy data augmentations, online adaptation, and external datasets), MAMP significantly outperforms existing self-supervised methods by 4.9% on DAVIS-2017 and 4.85% on the unseen categories of YouTube-VOS.
Experiment on YouTube-VOS dataset shows that our MAMP has the best generalization ability compared to existing self-supervised and supervised methods.
Semi-supervised video object segmentation aims to leverage the ground truth object mask given (only) in the first frame to segment the objects of interest in subsequent frames. Existing semi-supervised video object segmentation methods can be divided into online-learning methods and offline-learning methods depending on whether online adaptation is needed during inference.
Online-learning methods [21, 22] usually update the networks dynamically during inference to make them object-specific in each video. OSVOS  and MaskTrack  simply fine-tune the networks on the first frame of each video to make the networks object-specific. Lucid Tracker  leverages data augmentation to generate more synthetic training data to fine-tune the model. OnAVOS  fine-tunes the model multiple times on different past frames to adapt to the appearance changes across frames. PReMVOS  fine-tunes three different networks and combines their outputs with optical flow for more accurate segmentation. TAN-DTTM  fine-tunes the detection network on the first frame, and segments the objects on the extracted object-level images. RANet  takes online-learning as an optional step to boost the model’s performance. Previous methods have shown the effectiveness of online-learning, however, online-learning is time consuming and adversely affects the models’ generalization ability.
Offline-learning methods [14, 23, 6, 32, 17] usually propagate the given object mask of the first frame either explicitly or implicitly to subsequent frames, making the expensive online-adaptation no longer necessary. RVOS  leverages recurrent networks to implicitly propagate the predicted masks of past frames. PML  stores the embeddings of the first frame and propagates the mask according to a nearest-neighbor method. CFBI  implements spatio-temporal matching on both foreground and background features for mask propagation. STM, GC, and KMN [24, 16, 28] leverage memory bank and attention mechanism to propagate the spatio-temporal features. MAST  stores the reference features and masks separately in the memory bank and propagates the reference masks to subsequent frames according to the local spatio-temporal correspondences. Existing offline-learning methods usually perform local or non-local spatio-temporal matching for temporal association and mask propagation. However, non-local matching contains too much noise and has a large memory footprint, while local matching struggles to cope with problems from fast motion and long-term correspondence mismatches.
The proposed MAMP belongs to offline-learning methods, which means that the time-consuming online adaptation is not required. MAMP leverages a dynamically updated memory bank to store features and masks from selected past frames, and propagates masks effectively according to our proposed motion-aware spatio-temporal matching module. In contrast to previous local and non-local matching methods, the proposed motion-aware spatio-temporal matching module does not only exclude the noisy matching results but also mitigates the problems caused by fast motion and long-term correspondence mismatches.
Memory networks aim to capture the long-term dependencies by storing temporal features or different categories of features in a memory module. LSTM  and GRU  implicitly represent spatio-temporal features with local memory cells and update them in a recurrent process. However, the information within the memory cells is highly compressed and has limited representation ability. To overcome this issue, memory networks  were introduced to explicitly store the important features. A commonly used memory network in video object segmentation is STM  which incrementally adds the features of past frames to the memory bank, and leverages the non-local spatio-temporal matching to provide spatio-temporal features. However, the incremental memory bank updates are impractical when segmenting long videos due to the growing memory cost. In this work, we divide the memory into long-term and short-term memory similar to . The former is fixed, whereas the latter is updated dynamically using the past few frames making our MAMP more memory efficient.
Self-supervised learning can learn general feature representations and spatio-temporal correspondences based on the intrinsic properties of the videos. It has shown promising capacity on various downstream tasks because it does not require any annotations to train the model and can better generalize to other datasets[4, 10, 25, 36, 29]. Many pretext tasks have been explored for self-supervised learning on videos such as query frame reconstruction , future frame prediction , patch re-localization , and motion statistics prediction . In this work, we use a generative reconstruction task as in  to train the model.
Figure 2 shows an overview of our MAMP method for video object segmentation. As shown in Fig. 2(a), MAMP is trained with the reconstruction task to learn feature representations and construct robust spatio-temporal correspondences between a pair of frames in the same video. Hence, zero annotation is required to train the model. During inference, MAMP segments the frames in a sequential manner. As shown in Fig. 2(b), selected past frames with their features (Key) and object masks (Value) are stored in the memory bank for future reference, and the current frame is encoded into Query. Subsequently, the motion-aware spatio-temporal matching module calculates the spatio-temporal affinity matrix between Key and Query, and mask propagation is achieved by multiplying the warped Value with the spatio-temporal affinity matrix. The size-aware image feature alignment module also participates in the mask propagation process to prevent misalignment.
We use the reconstruction task for feature representation learning and robust spatio-temporal matching. Since the channel correlation in color space is smaller than that of color space , we randomly dropout one channel from the channels in color space as the target to be reconstructed. Dropout preserves enough information for the input and avoids trivial solutions. The model is forced to learn the general feature representations and the spatio-temporal correspondences between the reference frames and query frames instead of learning how to predict the missing channel based on the observed channels. Therefore, in this paper, frames are converted into color space and channel dropout on the channels is used to generate reconstruction targets.
To minimize the reconstruction loss, the semantically similar pixels between reference frames and query frames are forced to have highly correlated feature representations, while the semantically dissimilar pixels are forced to have weakly correlated feature representations. Finally, the reconstruction target of the query frames is predicted according to the highly correlated ROIs in the reference frames.
Specifically, given a reference frame and a query frame from the same video. A parameter-sharing convolutional encoder, , , is used to extract the feature representations of the two frames , where is the information bottleneck to prevent trivial solutions. The dropped channels of the two frames are downsampled to the resolution of the features .
To enable to represent and reconstruct , the local affinity matrix that represents the strength of the correlation between and is calculated:
where and are the locations in and , respectively.
is the dot product between two vectors, andrefers to the number of channels to re-scale the correlation value. is the ROI to , , , and is the radius of the ROI. Next, a location in is represented by the weighted sum of the corresponding ROI in :
Finally, is upsampled to , and Huber Loss is used to force to be close to :
Most of the object instances present in videos move over time. Therefore, segmentation models should be able to retrieve the corresponding ROIs from the reference frames for mask propagation under motion. To meet this constraint, non-local spatio-temporal matching methods  consider all locations in the reference frames as potential ROIs. However, non-local matching occupies too much memory and generates many noisy matches. Local spatio-temporal matching methods  retrieve the ROIs in the reference frames based on the location coordinates in the query frame and a pre-defined retrieve radius. Although local matching is more efficient, existing local matching methods have limited receptive fields and cannot localize the most correlated ROIs when they encounter fast motion or after long-term matching.
Intuitively, motion cues can help the query locations retrieve the most correlated ROIs for fast moving objects and in scenarios of long-term matching. Therefore, we propose a motion-aware spatio-temporal matching module that leverages optical flow to enable the query locations to retrieve the most correlated ROIs from the reference frames as shown in Fig. 3(a). We use RAFT  to compute optical flow with an efficient setting that costs only about 18ms and 13ms for a pair of frames in DAVIS-2017 and YouTube-VOS datasets, respectively. As shown in Fig. 3(b), the vanilla local spatio-temporal matching method cannot retrieve the most correlated ROIs for the query pixel. However, with our motion-aware spatio-temporal matching, the query pixel can find its most correlated ROIs even if the pixels in the ROIs are not consecutive in raw space.
Formally, given a reference and a query frame from the same video, optical flow is first computed according to and . Therefore, for one location in the query frame, the center of the corresponding ROI in the reference frame is , where and are the displacement vectors along the horizontal and vertical directions, respectively. Subsequently, the reference frame is warped according to the optical flow making the locations with the same coordinates in the reference frame and the query frame to be the most similar pairs. During inference, with all the ROIs being dynamically sampled from different reference frames in the memory bank based on the optical flow, the mask propagation becomes more accurate and the problems caused by fast motion and long-term correspondence mismatches are alleviated.
To reduce memory consumption, previous methods perform bilinear downsampling on the supervision signals, , masks, of the reference frames and propagate the supervision signals at the feature resolution. However, this operation introduces misalignment between the strided convolution layers and the supervision signals from naïve bilinear downsampling. The work in proposed an image feature alignment module to deal with this problem and achieved great success. As shown in Fig. 4(a), 
handles the misalignment at the downsampling stage by directly sampling the supervision signals at the convolution kernel centers. However, this does not cater for the misalignment caused at the upsampling stage. To deal with this problem, we propose a size-aware image feature alignment module, which leverages simple padding and unpadding to fix the misalignment at the upsampling stage. As shown in Fig.4(b), if the input size is not divisible by the size after downsampling, it will be padded automatically to satisfy this constraint allowing vanilla image feature alignment to be effectively used on the padded inputs. Hence, the misalignment at both the downsampling and the upsampling stages is fixed.
|Layer Name||Output Size||Configuration|
|Conv1||7 7, 64, stride 2|
|Method||Backbone||Num. Param.||Train. Dataset||Video Length||Supervised||& (Mean)||(Mean)||(Mean)|
|Vid. Color. ||ResNet-18||5M||K||800 hrs||34.0||34.6||32.7|
|CycleTime ||ResNet-50||9M||V||344 hrs||48.7||46.4||50.0|
|CorrFlow ||ResNet-18||5M||O||14 hrs||50.3||48.4||52.2|
|UVC ||ResNet-18||3M||K||800 hrs||59.5||57.7||61.3|
|RPM-Net ||ResNet-101||43M||DY||5.67 hrs||41.6||41.0||42.2|
|Mug ||ResNet-50||9M||O||14 hrs||56.1||54.0||58.2|
|MAST ||ResNet-18||5M||Y||5.58 hrs||65.5||63.3||67.6|
Evaluation on DAVIS-2017 validation set. Note that each method modifies vanilla backbone models to suit their framework. Training Dataset notations: C=COCO, D=DAVIS, E=ECSSD, H=HKU-IS, I=ImageNet, K=Kinetics, M=Mapillary, O=OxUvA, P=PASCAL-VOC, S=MSRA10K, V=VLOG, Y=YouTube-VOS.
Training: We modify ResNet-18 and use it as the encoder to extract image features with a spatial resolution of 1
/4 of the input images. Table 1 shows the detailed architecture. The parameters of the encoder are randomly initialized without pre-training. A pair of frames that are close in time and from the same video are randomly sampled as the reference frame and query frame, and the reconstruction task with Huber Loss is used to train the model. During training, all frames are converted into color space and channel dropout is used only on the channels to generate the reconstruction target. Hence, our method does not require an annotation mask. For pre-processing, the frames are resized to 256 256, and no data augmentation is used.
We train our model with pairwise frames for 35 epochs on YouTube-VOS using a batch-size of 24 for all experiments. We adopt Adam optimizer with the base learning rate of 1-3, and the learning rate is divided by 2 after 0.4M, 0.6M, 0.8M, and 1.0M iterations, respectively. Our model is trained end-to-end without any multi-stage training strategies, such as fine-tuning with multiple reference frames or fine-tuning in a sequential manner. The training takes about 11 hours on one NVIDIA GeForce 3090 GPU.
Testing: The proposed MAMP does not require time-consuming online adaption to fine-tune the model during testing. To be consistent with benchmarks, the 2018 version of the YouTube-VOS validation set and DAVIS-2017 validation set are used to evaluate MAMP. DAVIS-2017 is evaluated on the raw resolution and YouTube-VOS 2018 is evaluated on the half resolution for efficiency.
During testing, MAMP leverages the size-aware image feature alignment module to fix the misalignment issues, and uses the trained encoder to extract Key and Query. After that, MAMP uses the proposed motion-aware spatio-temporal matching module to propagate the masks from Value to subsequent frames. The memory bank of MAMP is updated dynamically to include and as long-term memory and , , and as short-term memory. To further filter out noise and redundancy, only the top 36 correlated locations in the ROIs are used for mask propagation (See Table 6).
We benchmark MAMP on two widely-used datasets DAVIS-2017 and YouTube-VOS. DAVIS-2017 is a commonly used dataset with short videos and complex scenes. It contains 150 videos with over 200 objects. YouTube-VOS is the largest dataset with long videos. YouTube-VOS contains more than 4000 high-resolution videos with more than 7000 objects. Unlike previous methods that leverage several external datasets to train the model, we only train our model on YouTube-VOS and test our model on both DAVIS-2017 and YouTube-VOS.
We use Region Similarity and Countour Accuracy to evaluate the performance of MAMP. Additionally, we report the Generalization Gap as in  to evaluate the generalization ability of MAMP on YouTube-VOS. Generalization Gap computes the model’s performance gap between inference on seen categories and inference on unseen categories, all its value is inversely proportional to the generalization ability of the model.
|Vid. Color. ||38.9||43.1||38.6||36.6||37.4||3.9|
We compare MAMP with existing methods on DAVIS-2017 and YouTube-VOS. The validation sets of DAVIS-2017 and YouTube-VOS contain 30 videos and 474 videos, respectively. It is noteworthy that the research of video object segmentation is developing rapidly, and there exists various training strategies, architectures and post-processing methods. We do our best to compare the results as fairly as possible. For example, multi-stage training strategies, external datasets, data augmentations, and online adaptation are used in this work. Moreover, we only use an efficient modified ResNet-18 as the encoder.
Table 2 and Table 3 summarize the performance of the state-of-the-art methods and MAMP on DAVIS-2017 and YouTube-VOS. MAMP significantly outperforms benchmark self-supervised methods by over 4.9% on DAVIS-2017, 4% on YouTube-VOS, and 4.85% on the unseen categories of YouTube-VOS. Moreover, MAMP is also comparable to some previous supervised methods. These results demonstrate the effectiveness of MAMP.
To evaluate the generalization ability of MAMP, we evaluate it on both “seen” categories and “unseen” categories of YouTube-VOS. Objects in “unseen” categories do not appear in the training set. As shown in Table 3, MAMP performs well on “unseen” categories and has the best generalization ability compared with other methods. Surprisingly, it performs better on “unseen” categories than on “seen” categories because of the better boundary segmentation performance on “unseen” objects. These results indicate that MAMP can learn general feature representations that are not restricted by the specific object categories in the training set. The most comparable supervised method in generalization ability is GC  (1.8 vs -1.2). However, GC is trained with several external datasets with precise ground truth annotations.
|Long-term & Short-term||70.4||68.7||72.0||68.2|
|✓||68.6 (-1.8)||66.9 (-1.8)||70.2 (-1.8)||61.8 (-6.4)|
|✓||69.4 (-1.0)||67.1 (-1.6)||71.7 (-0.3)||68.2 (-0.0)|
|66.7 (-3.7)||64.3 (-4.4)||69.1 (-2.9)||61.8 (-6.4)|
Figure 5 shows qualitative results of MAMP under various challenging scenarios, , occlusion, fast motion, large deformations, and scale variations. Notice how MAMP is able to handle these challenging scenarios effectively.
Long-term Memory and Short-term Memory: We compared the results using different memory settings. As shown in Table 4, all memory settings have reasonable performance. Long-term memory provides accurate ground truth information for query frames, while short-term memory offers up-to-date information from past neighboring frames. The results show that MAMP using short-term memory performs better than using long-term memory. This is because the appearance and scale of the objects usually change significantly over time and using only long-term memory makes it difficult to adapt to these changes. Furthermore, it can be seen that MAMP using both memory types has the best performance as both memories are complementary.
Motion-aware Spatio-temporal Matching: As shown in Table 5, we replaced the proposed motion-aware spatio-temporal matching module with the vanilla local spatio-temporal matching module in , the performance dropped by 1.8% on DAVIS-2017 and 6.4% on YouTube-VOS. Vanilla local spatio-temporal matching module retrieves the corresponding ROIs according to the pre-defined radius. However, objects usually move over time and the most correlated ROIs are prone to be outside the search radius and cannot be retrieved. Fast motion and long-term correspondence mismatches cause this type of issues. Without these most correlated ROIs, the label of one location in the query frame will be determined by the labels of several uncorrelated or weakly-correlated locations in the reference frames. Therefore, the segmentation results could be further improved if we can retrieve the most correlated ROIs for each location in the query frame.
The motion-aware spatio-temporal matching module leverages the motion cues to register the reference frames in the memory bank to the query frame before computing the local spatio-temporal correspondences. Therefore, the above issues could be alleviated even if the reference frames are far from the query frame in the time. As shown in Table 5, the motion-aware spatio-temporal matching module brings more performance gains in YouTube-VOS, because YouTube-VOS has longer videos compared to DAVIS-2017, and thus has more severe problems caused by fast motion and long-term correspondence mismatches.
Size-aware Image Feature Alignment: As shown in Table 5, we replaced the proposed size-aware image feature alignment module with the image feature alignment module as in . The performance dropped by 1.0% on DAVIS-2017 and remained unchanged on YouTube-VOS. The size-aware image feature alignment module is constructed based on the image feature alignment module. The difference is that the size-aware image feature alignment module can also fix the misalignment at the upsampling stage caused by the improper input size. We compute the statistics of the percentage of videos that have improper input sizes in order to illustrate the results in Table 5. We found that 96.7% of the videos in the DAVIS-2017 validation set have improper input sizes, while only 1.9% of the videos in the YouTube-VOS validation set have improper input sizes. That is why we can get observable performance gains only at DAVIS-2017.
TopK Correlated Locations for Mask Propagation: If we retrieve ROIs with a radius of 12 on 5 reference frames, the corresponding ROIs for one location in the query frame will include 3125 locations. However, noisy matches to 3125 locations may adversely affect the model’s performance. Hence, we filter out redundant noises and select TopK correlated locations in the ROIs for mask propagation. As shown in Table 6, leveraging the top 36 or top 9 correlated locations in the ROIs for mask propagation improves the performance of MAMP compared to using all 3125 locations. Using the top 36 correlated locations obtains the best performance. Moreover, compared to other methods, MAMP still maintains the best performance even if only one of the 3125 locations in the ROIs is used for mask propagation. These results further demonstrate the effectiveness of the proposed motion-aware spatio-temporal matching module.
In this paper, we proposed MAMP that enables general feature representation and motion-guided mask propagation. MAMP can train the model without any annotations, and outperforms existing self-supervised methods by a large margin. Moreover, MAMP demonstrates the best generalization ability compared to previous methods. We believe that MAMP has the potential to propagate spatio-temporal features and masks in practical video segmentation tasks. In the future, we will develop more effective pretext tasks and adaptive memory selection methods to further improve the performance of MAMP.
This research was supported by the ARC Industrial Transformation Research Hub IH180100002.
International Journal of Computer Vision. Cited by: §2.1.
Future frame prediction for anomaly detection–a new baseline. In CVPR, Cited by: §2.3.
Online adaptation of convolutional neural networks for video object segmentation. In BMVC, Cited by: Figure 1, §2.1, Table 3.