Self-Supervised Video Object Segmentation by Motion-Aware Mask Propagation

07/27/2021 ∙ by Bo Miao, et al. ∙ The University of Western Australia Griffith University 0

We propose a self-supervised spatio-temporal matching method coined Motion-Aware Mask Propagation (MAMP) for semi-supervised video object segmentation. During training, MAMP leverages the frame reconstruction task to train the model without the need for annotations. During inference, MAMP extracts high-resolution features from each frame to build a memory bank from the features as well as the predicted masks of selected past frames. MAMP then propagates the masks from the memory bank to subsequent frames according to our motion-aware spatio-temporal matching module, also proposed in this paper. Evaluation on DAVIS-2017 and YouTube-VOS datasets show that MAMP achieves state-of-the-art performance with stronger generalization ability compared to existing self-supervised methods, i.e. 4.9% higher mean 𝒥&ℱ on DAVIS-2017 and 4.85% higher mean 𝒥&ℱ on the unseen categories of YouTube-VOS than the nearest competitor. Moreover, MAMP performs on par with many supervised video object segmentation methods. Our code is available at: <https://github.com/bo-miao/MAMP>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 8

Code Repositories

MAMP

Self-Supervised Video Object Segmentation by Motion-Aware Mask Propagation


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video object segmentation (VOS) is a fundamental problem in visual understanding where the aim is to segment objects of interest from the background in unconstrained videos. VOS enables machines to sense the motion pattern, location, and boundaries of the objects of interest in videos, which is useful in a wide range of applications. For example, in video editing, manual frame-wise segmentation is laborious and does not maintain temporal consistency whereas VOS can segment all frames automatically using the mask of one frame as a guide. The problem of segmenting objects of interest in a video using the ground truth object masks provided for only the first frame is referred to as semi-supervised video object segmentation. This is challenging because the appearance of objects in a video change significantly due to fast motion, occlusion, scale variation, . Moreover, other similar looking non-target objects may confuse the model to segment incorrect objects.

Figure 1: Comparison on DAVIS-2017 with other methods. MAMP outperforms existing self-supervised methods, and is on par with some supervised methods trained with large amounts of annotated data. Notation:Video Colorization[34], RPM-Net [11], CycleTime [38], CorrFlow [13], MuG [19], UVC [15], MAST [12], OSVOS [1], RANet [39], OSVOS-S [21], GC [16], OSMN [42], SiamMask [37], OnAVOS [33], FEELVOS [32], AFB-URR [17], PReMVOS [20], STM [24], KMN [28], CFBI [43]

Semi-supervised video object segmentation techniques fall into two categories: supervised and self-supervised. Supervised approaches [24, 43] use the rich annotation information in training data to learn the model achieving great success in video object segmentation. Despite their success, these methods are not attractive given their reliance on accurate pixel-level annotations for training (see Fig. 1), which are expensive to generate. Moreover, supervised approaches struggle to maintain the same performance in the wild. In contrast, self-supervised approaches [12, 19] learn feature representations based on the intrinsic properties of the video frames, and thus do not require any annotations and can better generalize to unseen objects. Even though the motivations behind existing self-supervised methods are different, they share the same objective of learning to extract general feature representations and construct precise spatio-temporal correspondences to propagate the object masks in the video sequences. Exploiting the spatio-temporal coherence in videos, the pretext tasks of self-supervised methods can be designed to either minimize the reconstruction or prediction loss [12] or maximize the temporal cycle-correspondence consistency [19]

. Once trained, the models are able to extract general feature representations and build spatio-temporal correspondences between the reference and query frames. Therefore, the pixels in the query frames can be classified according to the mask labels of their corresponding region of interests (ROIs) in the reference frames. Despite the simplicity of the self-supervised methods, existing methods perform poorly in the case of fast motion and long-term matching scenarios.

To deal with the above challenges, we propose a self-supervised method, coined motion-aware mask propagation (MAMP). Similar to previous self-supervised methods, MAMP learns to represent image features and build spatio-temporal correspondences without any annotations during training. During inference, MAMP first leverages the feature representations and the given object masks for the first frame to build a memory bank. The proposed motion-aware spatio-temporal matching module in MAMP then takes advantage of the motion cues to mitigate the issues caused by fast motion and long-term correspondence mismatches, and propagates the mask from the memory bank to subsequent frames. Moreover, the proposed size-aware image feature alignment module fixes the misalignment during mask propagation and the memory bank is constantly updated by the past frames to provide the most appropriate spatio-temporal guidance. We evaluate MAMP on the DAVIS-2017 and YouTube-VOS benchmarks to verify its effectiveness as well as generalization ability.

Our contributions are summarized as follows:

  • We propose Motion-Aware Mask Propagation (MAMP) for semi-supervised video object segmentation that trains the model end-to-end without any annotations and effectively propagates the masks across frames.

  • We propose a motion-aware spatio-temporal matching module to mitigate errors caused by fast motion and long-term correspondence mismatches. The proposed module improves the performance of MAMP on YouTube-VOS by 6.4%.

  • Without any bells and whistles (, fancy data augmentations, online adaptation, and external datasets), MAMP significantly outperforms existing self-supervised methods by 4.9% on DAVIS-2017 and 4.85% on the unseen categories of YouTube-VOS.

  • Experiment on YouTube-VOS dataset shows that our MAMP has the best generalization ability compared to existing self-supervised and supervised methods.

Figure 2: Framework of the proposed MAMP. (a) A random pair of neighboring frames in the same video is sampled to train the model. The frames are converted to color space and channel dropout is used only on the channels to generate the reconstruction target for self-supervision. During training, a vanilla spatio-temporal matching module is used because the selected frames are adjacent. Only the encoder weights are learned during training. (b) The trained encoder is used to encode the selected past frames into Key and store them in the memory bank along with their object masks as Value for mask propagation. The query frame () is encoded into Query to retrieve the spatio-temporal correspondences from Key and Value

. The affinity matrix is computed in a local manner, where

is the area of Query, is the number of feature maps in Key, and is the area of region of interest (ROI) in one feature map of Key.

2 Related Work

2.1 Semi-supervised Video Object Segmentation

Semi-supervised video object segmentation aims to leverage the ground truth object mask given (only) in the first frame to segment the objects of interest in subsequent frames. Existing semi-supervised video object segmentation methods can be divided into online-learning methods and offline-learning methods depending on whether online adaptation is needed during inference.

Online-learning methods [21, 22] usually update the networks dynamically during inference to make them object-specific in each video. OSVOS [1] and MaskTrack [26] simply fine-tune the networks on the first frame of each video to make the networks object-specific. Lucid Tracker [9] leverages data augmentation to generate more synthetic training data to fine-tune the model. OnAVOS [33] fine-tunes the model multiple times on different past frames to adapt to the appearance changes across frames. PReMVOS [20] fine-tunes three different networks and combines their outputs with optical flow for more accurate segmentation. TAN-DTTM [7] fine-tunes the detection network on the first frame, and segments the objects on the extracted object-level images. RANet [39] takes online-learning as an optional step to boost the model’s performance. Previous methods have shown the effectiveness of online-learning, however, online-learning is time consuming and adversely affects the models’ generalization ability.

Offline-learning methods [14, 23, 6, 32, 17] usually propagate the given object mask of the first frame either explicitly or implicitly to subsequent frames, making the expensive online-adaptation no longer necessary. RVOS [31] leverages recurrent networks to implicitly propagate the predicted masks of past frames. PML [2] stores the embeddings of the first frame and propagates the mask according to a nearest-neighbor method. CFBI [43] implements spatio-temporal matching on both foreground and background features for mask propagation. STM, GC, and KMN [24, 16, 28] leverage memory bank and attention mechanism to propagate the spatio-temporal features. MAST [12] stores the reference features and masks separately in the memory bank and propagates the reference masks to subsequent frames according to the local spatio-temporal correspondences. Existing offline-learning methods usually perform local or non-local spatio-temporal matching for temporal association and mask propagation. However, non-local matching contains too much noise and has a large memory footprint, while local matching struggles to cope with problems from fast motion and long-term correspondence mismatches.

The proposed MAMP belongs to offline-learning methods, which means that the time-consuming online adaptation is not required. MAMP leverages a dynamically updated memory bank to store features and masks from selected past frames, and propagates masks effectively according to our proposed motion-aware spatio-temporal matching module. In contrast to previous local and non-local matching methods, the proposed motion-aware spatio-temporal matching module does not only exclude the noisy matching results but also mitigates the problems caused by fast motion and long-term correspondence mismatches.

2.2 Memory Networks

Memory networks aim to capture the long-term dependencies by storing temporal features or different categories of features in a memory module. LSTM [5] and GRU [3] implicitly represent spatio-temporal features with local memory cells and update them in a recurrent process. However, the information within the memory cells is highly compressed and has limited representation ability. To overcome this issue, memory networks [40] were introduced to explicitly store the important features. A commonly used memory network in video object segmentation is STM [24] which incrementally adds the features of past frames to the memory bank, and leverages the non-local spatio-temporal matching to provide spatio-temporal features. However, the incremental memory bank updates are impractical when segmenting long videos due to the growing memory cost. In this work, we divide the memory into long-term and short-term memory similar to [12]. The former is fixed, whereas the latter is updated dynamically using the past few frames making our MAMP more memory efficient.

2.3 Self-supervised Learning on Videos

Self-supervised learning can learn general feature representations and spatio-temporal correspondences based on the intrinsic properties of the videos. It has shown promising capacity on various downstream tasks because it does not require any annotations to train the model and can better generalize to other datasets

[4, 10, 25, 36, 29]. Many pretext tasks have been explored for self-supervised learning on videos such as query frame reconstruction [12], future frame prediction [18], patch re-localization [19], and motion statistics prediction [35]. In this work, we use a generative reconstruction task as in [12] to train the model.

3 Method

Figure 2 shows an overview of our MAMP method for video object segmentation. As shown in Fig. 2(a), MAMP is trained with the reconstruction task to learn feature representations and construct robust spatio-temporal correspondences between a pair of frames in the same video. Hence, zero annotation is required to train the model. During inference, MAMP segments the frames in a sequential manner. As shown in Fig. 2(b), selected past frames with their features (Key) and object masks (Value) are stored in the memory bank for future reference, and the current frame is encoded into Query. Subsequently, the motion-aware spatio-temporal matching module calculates the spatio-temporal affinity matrix between Key and Query, and mask propagation is achieved by multiplying the warped Value with the spatio-temporal affinity matrix. The size-aware image feature alignment module also participates in the mask propagation process to prevent misalignment.

3.1 Self-supervised Feature Representation
Learning

We use the reconstruction task for feature representation learning and robust spatio-temporal matching. Since the channel correlation in color space is smaller than that of color space [27], we randomly dropout one channel from the channels in color space as the target to be reconstructed. Dropout preserves enough information for the input and avoids trivial solutions. The model is forced to learn the general feature representations and the spatio-temporal correspondences between the reference frames and query frames instead of learning how to predict the missing channel based on the observed channels. Therefore, in this paper, frames are converted into color space and channel dropout on the channels is used to generate reconstruction targets.

To minimize the reconstruction loss, the semantically similar pixels between reference frames and query frames are forced to have highly correlated feature representations, while the semantically dissimilar pixels are forced to have weakly correlated feature representations. Finally, the reconstruction target of the query frames is predicted according to the highly correlated ROIs in the reference frames.

Specifically, given a reference frame and a query frame from the same video. A parameter-sharing convolutional encoder, , , is used to extract the feature representations of the two frames , where is the information bottleneck to prevent trivial solutions. The dropped channels of the two frames are downsampled to the resolution of the features .

To enable to represent and reconstruct , the local affinity matrix that represents the strength of the correlation between and is calculated:

where and are the locations in and , respectively.

is the dot product between two vectors, and

refers to the number of channels to re-scale the correlation value. is the ROI to , , , and is the radius of the ROI. Next, a location in is represented by the weighted sum of the corresponding ROI in :

Finally, is upsampled to , and Huber Loss is used to force to be close to :

where

3.2 Motion-aware Spatio-temporal Matching

Most of the object instances present in videos move over time. Therefore, segmentation models should be able to retrieve the corresponding ROIs from the reference frames for mask propagation under motion. To meet this constraint, non-local spatio-temporal matching methods [24] consider all locations in the reference frames as potential ROIs. However, non-local matching occupies too much memory and generates many noisy matches. Local spatio-temporal matching methods [12] retrieve the ROIs in the reference frames based on the location coordinates in the query frame and a pre-defined retrieve radius. Although local matching is more efficient, existing local matching methods have limited receptive fields and cannot localize the most correlated ROIs when they encounter fast motion or after long-term matching.

Intuitively, motion cues can help the query locations retrieve the most correlated ROIs for fast moving objects and in scenarios of long-term matching. Therefore, we propose a motion-aware spatio-temporal matching module that leverages optical flow to enable the query locations to retrieve the most correlated ROIs from the reference frames as shown in Fig. 3(a). We use RAFT [30] to compute optical flow with an efficient setting that costs only about 18ms and 13ms for a pair of frames in DAVIS-2017 and YouTube-VOS datasets, respectively. As shown in Fig. 3(b), the vanilla local spatio-temporal matching method cannot retrieve the most correlated ROIs for the query pixel. However, with our motion-aware spatio-temporal matching, the query pixel can find its most correlated ROIs even if the pixels in the ROIs are not consecutive in raw space.

Formally, given a reference and a query frame from the same video, optical flow is first computed according to and . Therefore, for one location in the query frame, the center of the corresponding ROI in the reference frame is , where and are the displacement vectors along the horizontal and vertical directions, respectively. Subsequently, the reference frame is warped according to the optical flow making the locations with the same coordinates in the reference frame and the query frame to be the most similar pairs. During inference, with all the ROIs being dynamically sampled from different reference frames in the memory bank based on the optical flow, the mask propagation becomes more accurate and the problems caused by fast motion and long-term correspondence mismatches are alleviated.

Figure 3: (a) Implementation of our motion-aware spatio-temporal matching module. We use an efficient implementation where we warp the masks and feature maps at first instead of warping each ROI separately. (b) Comparison of the matching results between the vanilla local spatio-temporal matching module and the motion-aware spatio-temporal matching module.

3.3 Size-aware Image Feature Alignment

To reduce memory consumption, previous methods perform bilinear downsampling on the supervision signals, , masks, of the reference frames and propagate the supervision signals at the feature resolution. However, this operation introduces misalignment between the strided convolution layers and the supervision signals from naïve bilinear downsampling. The work in

[12] proposed an image feature alignment module to deal with this problem and achieved great success. As shown in Fig. 4(a), [12]

handles the misalignment at the downsampling stage by directly sampling the supervision signals at the convolution kernel centers. However, this does not cater for the misalignment caused at the upsampling stage. To deal with this problem, we propose a size-aware image feature alignment module, which leverages simple padding and unpadding to fix the misalignment at the upsampling stage. As shown in Fig. 

4(b), if the input size is not divisible by the size after downsampling, it will be padded automatically to satisfy this constraint allowing vanilla image feature alignment to be effectively used on the padded inputs. Hence, the misalignment at both the downsampling and the upsampling stages is fixed.

Figure 4: Implementation of size-aware image feature alignment module with comparison to the image feature alignment in [12]. n is the ratio of the input size to the size after downsampling. The proposed size-aware image feature alignment module fixes the misalignment in both downsampling and upsampling stages.
Layer Name Output Size Configuration
Conv1 7 7, 64, stride 2
Conv2 2
Conv3 2
Conv4 2
Conv5 2
Table 1: Architecture of the modified ResNet-18.
Method Backbone Num. Param. Train. Dataset Video Length Supervised & (Mean) (Mean) (Mean)
Vid. Color. [34] ResNet-18 5M K 800 hrs 34.0 34.6 32.7
CycleTime [38] ResNet-50 9M V 344 hrs 48.7 46.4 50.0
CorrFlow [13] ResNet-18 5M O 14 hrs 50.3 48.4 52.2
UVC [15] ResNet-18 3M K 800 hrs 59.5 57.7 61.3
RPM-Net [11] ResNet-101 43M DY 5.67 hrs 41.6 41.0 42.2
Mug [19] ResNet-50 9M O 14 hrs 56.1 54.0 58.2
MAST [12] ResNet-18 5M Y 5.58 hrs 65.5 63.3 67.6
Ours ResNet-18 5M Y 5.58 hrs 70.4 68.7 72.0
OSVOS [1] VGG-16 15M ID 60.3 56.6 63.9
OSMN [42] VGG-16 15M ICD 54.8 52.5 57.1
OSVOS-S [21] VGG-16 15M IPD 68.0 64.7 71.3
PReMVOS [20] ResNet-101 43M ICPMD 77.8 73.9 81.8
SiamMask [37] ResNet-50 9M ICY 56.4 54.3 58.5
FEELVOS [32] Xception-65 38M ICDY 71.5 69.1 74.0
STM [24] ResNet-50 9M ICPSEDY 81.8 79.2 84.3
GC [16] ResNet-50 9M ISEHD 71.4 69.3 73.5
AFB-URR [17] ResNet-50 9M ICPSED 74.6 73.0 76.1
KMN [28] ResNet-50 9M ICPSEDY 82.8 80.0 85.6
CFBI [43] ResNet-101 43M ICDY 81.9 79.1 84.6
Table 2:

Evaluation on DAVIS-2017 validation set. Note that each method modifies vanilla backbone models to suit their framework. Training Dataset notations: C=COCO, D=DAVIS, E=ECSSD, H=HKU-IS, I=ImageNet, K=Kinetics, M=Mapillary, O=OxUvA, P=PASCAL-VOC, S=MSRA10K, V=VLOG, Y=YouTube-VOS.

4 Implementation Details

Training: We modify ResNet-18 and use it as the encoder to extract image features with a spatial resolution of 1/4 of the input images. Table 1 shows the detailed architecture. The parameters of the encoder are randomly initialized without pre-training. A pair of frames that are close in time and from the same video are randomly sampled as the reference frame and query frame, and the reconstruction task with Huber Loss is used to train the model. During training, all frames are converted into color space and channel dropout is used only on the channels to generate the reconstruction target. Hence, our method does not require an annotation mask. For pre-processing, the frames are resized to 256 256, and no data augmentation is used.

We train our model with pairwise frames for 35 epochs on YouTube-VOS using a batch-size of 24 for all experiments. We adopt Adam optimizer with the base learning rate of 1

-3, and the learning rate is divided by 2 after 0.4M, 0.6M, 0.8M, and 1.0M iterations, respectively. Our model is trained end-to-end without any multi-stage training strategies, such as fine-tuning with multiple reference frames or fine-tuning in a sequential manner. The training takes about 11 hours on one NVIDIA GeForce 3090 GPU.

Testing: The proposed MAMP does not require time-consuming online adaption to fine-tune the model during testing. To be consistent with benchmarks, the 2018 version of the YouTube-VOS validation set and DAVIS-2017 validation set are used to evaluate MAMP. DAVIS-2017 is evaluated on the raw resolution and YouTube-VOS 2018 is evaluated on the half resolution for efficiency.

During testing, MAMP leverages the size-aware image feature alignment module to fix the misalignment issues, and uses the trained encoder to extract Key and Query. After that, MAMP uses the proposed motion-aware spatio-temporal matching module to propagate the masks from Value to subsequent frames. The memory bank of MAMP is updated dynamically to include and as long-term memory and , , and as short-term memory. To further filter out noise and redundancy, only the top 36 correlated locations in the ROIs are used for mask propagation (See Table 6).

5 Experiment and Results

We benchmark MAMP on two widely-used datasets DAVIS-2017 and YouTube-VOS. DAVIS-2017 is a commonly used dataset with short videos and complex scenes. It contains 150 videos with over 200 objects. YouTube-VOS is the largest dataset with long videos. YouTube-VOS contains more than 4000 high-resolution videos with more than 7000 objects. Unlike previous methods that leverage several external datasets to train the model, we only train our model on YouTube-VOS and test our model on both DAVIS-2017 and YouTube-VOS.

5.1 Evaluation Metrics

We use Region Similarity and Countour Accuracy to evaluate the performance of MAMP. Additionally, we report the Generalization Gap as in [12] to evaluate the generalization ability of MAMP on YouTube-VOS. Generalization Gap computes the model’s performance gap between inference on seen categories and inference on unseen categories, all its value is inversely proportional to the generalization ability of the model.

Method Sup. Overall Seen Unseen Gen. Gap
Vid. Color. [34] 38.9 43.1 38.6 36.6 37.4 3.9
CorrFlow [13] 46.6 50.6 46.6 43.8 45.6 3.9
MAST [12] 64.2 63.9 64.9 60.3 67.7 0.4
Ours 68.2 67.0 68.4 64.5 73.2 -1.2
OSMN [42] 51.2 60.0 60.1 40.6 44.0 17.8
RGMP [23] 53.8 59.5 - 45.2 - 14.3
OnAVOS [33] 55.2 60.1 62.7 46.6 51.4 12.4
S2S [41] 64.6 71.0 70.0 55.5 61.2 12.2
A-GAME [8] 66.1 67.8 - 60.8 - 7.0
STM [24] 79.4 79.7 84.2 72.8 80.9 5.1
KMN [28] 81.4 81.4 85.6 75.3 83.3 4.2
GC [16] 73.2 72.6 75.6 68.9 75.7 1.8
AFB-URR [17] 79.6 78.8 83.1 74.1 82.6 2.6
CFBI [43] 81.4 81.1 85.8 75.3 83.4 4.1
Table 3: Evaluation on YouTube-VOS validation set for “seen” and “unseen” categories where “unseen” is the object category that does not appear in the training set. For overall, , and , higher values are better and for Gen. Gap (Generalization Gap), lower values are better.

5.2 Quantitative Results

We compare MAMP with existing methods on DAVIS-2017 and YouTube-VOS. The validation sets of DAVIS-2017 and YouTube-VOS contain 30 videos and 474 videos, respectively. It is noteworthy that the research of video object segmentation is developing rapidly, and there exists various training strategies, architectures and post-processing methods. We do our best to compare the results as fairly as possible. For example, multi-stage training strategies, external datasets, data augmentations, and online adaptation are used in this work. Moreover, we only use an efficient modified ResNet-18 as the encoder.

Table 2 and Table 3 summarize the performance of the state-of-the-art methods and MAMP on DAVIS-2017 and YouTube-VOS. MAMP significantly outperforms benchmark self-supervised methods by over 4.9% on DAVIS-2017, 4% on YouTube-VOS, and 4.85% on the unseen categories of YouTube-VOS. Moreover, MAMP is also comparable to some previous supervised methods. These results demonstrate the effectiveness of MAMP.

To evaluate the generalization ability of MAMP, we evaluate it on both “seen” categories and “unseen” categories of YouTube-VOS. Objects in “unseen” categories do not appear in the training set. As shown in Table 3, MAMP performs well on “unseen” categories and has the best generalization ability compared with other methods. Surprisingly, it performs better on “unseen” categories than on “seen” categories because of the better boundary segmentation performance on “unseen” objects. These results indicate that MAMP can learn general feature representations that are not restricted by the specific object categories in the training set. The most comparable supervised method in generalization ability is GC [16] (1.8 vs -1.2). However, GC is trained with several external datasets with precise ground truth annotations.

Memory DAVIS-2017 YouTube-VOS
 &  &
Long-term & Short-term 70.4 68.7 72.0 68.2
Long-term 52.4 49.8 55.1 58.7
Short-term 65.2 63.4 66.9 66.4
Table 4: Ablation experiment for long-term and short-term memory. Using both memory types achieves the best performance.
M A DAVIS-2017 YouTube-VOS
 &  &
70.4 68.7 72.0 68.2
68.6 (-1.8) 66.9 (-1.8) 70.2 (-1.8) 61.8 (-6.4)
69.4 (-1.0) 67.1 (-1.6) 71.7 (-0.3) 68.2 (-0.0)
66.7 (-3.7) 64.3 (-4.4) 69.1 (-2.9) 61.8 (-6.4)
Table 5: Ablation experiment for motion-aware spatio-temporal matching module (M) and size-aware image feature alignment module (A).

5.3 Qualitative Results

Figure 5: Qualitative results of MAMP on DAVIS-2017 and YouTube-VOS. The frames are sampled at fixed intervals and the first frame in each video is assigned index 0. MAMP performs well in challenging scenarios of occlusion/dis-occlusion, fast motion, large deformations, and scale variations.

Figure 5 shows qualitative results of MAMP under various challenging scenarios, , occlusion, fast motion, large deformations, and scale variations. Notice how MAMP is able to handle these challenging scenarios effectively.

5.4 Ablation Studies

Long-term Memory and Short-term Memory: We compared the results using different memory settings. As shown in Table 4, all memory settings have reasonable performance. Long-term memory provides accurate ground truth information for query frames, while short-term memory offers up-to-date information from past neighboring frames. The results show that MAMP using short-term memory performs better than using long-term memory. This is because the appearance and scale of the objects usually change significantly over time and using only long-term memory makes it difficult to adapt to these changes. Furthermore, it can be seen that MAMP using both memory types has the best performance as both memories are complementary.

Motion-aware Spatio-temporal Matching: As shown in Table 5, we replaced the proposed motion-aware spatio-temporal matching module with the vanilla local spatio-temporal matching module in [12], the performance dropped by 1.8% on DAVIS-2017 and 6.4% on YouTube-VOS. Vanilla local spatio-temporal matching module retrieves the corresponding ROIs according to the pre-defined radius. However, objects usually move over time and the most correlated ROIs are prone to be outside the search radius and cannot be retrieved. Fast motion and long-term correspondence mismatches cause this type of issues. Without these most correlated ROIs, the label of one location in the query frame will be determined by the labels of several uncorrelated or weakly-correlated locations in the reference frames. Therefore, the segmentation results could be further improved if we can retrieve the most correlated ROIs for each location in the query frame.

The motion-aware spatio-temporal matching module leverages the motion cues to register the reference frames in the memory bank to the query frame before computing the local spatio-temporal correspondences. Therefore, the above issues could be alleviated even if the reference frames are far from the query frame in the time. As shown in Table 5, the motion-aware spatio-temporal matching module brings more performance gains in YouTube-VOS, because YouTube-VOS has longer videos compared to DAVIS-2017, and thus has more severe problems caused by fast motion and long-term correspondence mismatches.

Size-aware Image Feature Alignment: As shown in Table 5, we replaced the proposed size-aware image feature alignment module with the image feature alignment module as in [12]. The performance dropped by 1.0% on DAVIS-2017 and remained unchanged on YouTube-VOS. The size-aware image feature alignment module is constructed based on the image feature alignment module. The difference is that the size-aware image feature alignment module can also fix the misalignment at the upsampling stage caused by the improper input size. We compute the statistics of the percentage of videos that have improper input sizes in order to illustrate the results in Table 5. We found that 96.7% of the videos in the DAVIS-2017 validation set have improper input sizes, while only 1.9% of the videos in the YouTube-VOS validation set have improper input sizes. That is why we can get observable performance gains only at DAVIS-2017.

TopK DAVIS-2017 YouTube-VOS
 &  &
ALL 69.6 68.1 71.1 67.9
Top 1 67.3(-2.3) 65.0(-3.1) 69.5(-1.6) 66.3(-1.6)
Top 9 70.1(+0.5) 68.1(+0.0) 72.1(+1.0) 68.4(+0.5)
Top 36 70.4(+0.8) 68.7(+0.6) 72.0(+0.9) 68.2(+0.3)
Table 6: Ablation of TopK correlated locations in the ROIs for mask propagation.

TopK Correlated Locations for Mask Propagation: If we retrieve ROIs with a radius of 12 on 5 reference frames, the corresponding ROIs for one location in the query frame will include 3125 locations. However, noisy matches to 3125 locations may adversely affect the model’s performance. Hence, we filter out redundant noises and select TopK correlated locations in the ROIs for mask propagation. As shown in Table 6, leveraging the top 36 or top 9 correlated locations in the ROIs for mask propagation improves the performance of MAMP compared to using all 3125 locations. Using the top 36 correlated locations obtains the best performance. Moreover, compared to other methods, MAMP still maintains the best performance even if only one of the 3125 locations in the ROIs is used for mask propagation. These results further demonstrate the effectiveness of the proposed motion-aware spatio-temporal matching module.

6 Conclusion

In this paper, we proposed MAMP that enables general feature representation and motion-guided mask propagation. MAMP can train the model without any annotations, and outperforms existing self-supervised methods by a large margin. Moreover, MAMP demonstrates the best generalization ability compared to previous methods. We believe that MAMP has the potential to propagate spatio-temporal features and masks in practical video segmentation tasks. In the future, we will develop more effective pretext tasks and adaptive memory selection methods to further improve the performance of MAMP.

7 Acknowledgements

This research was supported by the ARC Industrial Transformation Research Hub IH180100002.

References

  • [1] S. Caelles, K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool (2017) One-shot video object segmentation. In CVPR, Cited by: Figure 1, §2.1, Table 2.
  • [2] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In CVPR, Cited by: §2.1.
  • [3] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, Cited by: §2.2.
  • [4] T. Han, W. Xie, and A. Zisserman (2019) Video representation learning by dense predictive coding. In ICCV Workshops, Cited by: §2.3.
  • [5] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation. Cited by: §2.2.
  • [6] Y. Hu, J. Huang, and A. G. Schwing (2018) Videomatch: matching based video object segmentation. In ECCV, Cited by: §2.1.
  • [7] X. Huang, J. Xu, Y. Tai, and C. Tang (2020) Fast video object segmentation with temporal aggregation network and dynamic template matching. In CVPR, Cited by: §2.1.
  • [8] J. Johnander, M. Danelljan, E. Brissman, F. S. Khan, and M. Felsberg (2019) A generative appearance model for end-to-end video object segmentation. In CVPR, Cited by: Table 3.
  • [9] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele (2019) Lucid data dreaming for video object segmentation.

    International Journal of Computer Vision

    .
    Cited by: §2.1.
  • [10] D. Kim, D. Cho, and I. S. Kweon (2019) Self-supervised video representation learning with space-time cubic puzzles. In AAAI, Cited by: §2.3.
  • [11] Y. Kim, S. Choi, H. Lee, T. Kim, and C. Kim (2020) Rpm-net: robust pixel-level matching networks for self-supervised video object segmentation. In WACV, Cited by: Figure 1, Table 2.
  • [12] Z. Lai, E. Lu, and W. Xie (2020) MAST: a memory-augmented self-supervised tracker. In CVPR, Cited by: Figure 1, §1, §2.1, §2.2, §2.3, Figure 4, §3.2, §3.3, Table 2, §5.1, §5.4, §5.4, Table 3.
  • [13] Z. Lai and W. Xie (2019) Self-supervised learning for video correspondence flow. In BMVC, Cited by: Figure 1, Table 2, Table 3.
  • [14] X. Li and C. C. Loy (2018) Video object segmentation with joint re-identification and attention-aware mask propagation. In ECCV, Cited by: §2.1.
  • [15] X. Li, S. Liu, S. De Mello, X. Wang, J. Kautz, and M. Yang (2019) Joint-task self-supervised learning for temporal correspondence. In NeurIPS, Cited by: Figure 1, Table 2.
  • [16] Y. Li, Z. Shen, and Y. Shan (2020) Fast video object segmentation using the global context module. In ECCV, Cited by: Figure 1, §2.1, Table 2, §5.2, Table 3.
  • [17] Y. Liang, X. Li, N. Jafari, and J. Chen (2020) Video object segmentation with adaptive feature bank and uncertain-region refinement. In NeurIPS, Cited by: Figure 1, §2.1, Table 2, Table 3.
  • [18] W. Liu, W. Luo, D. Lian, and S. Gao (2018)

    Future frame prediction for anomaly detection–a new baseline

    .
    In CVPR, Cited by: §2.3.
  • [19] X. Lu, W. Wang, J. Shen, Y. Tai, D. J. Crandall, and S. C. Hoi (2020) Learning video object segmentation from unlabeled videos. In CVPR, Cited by: Figure 1, §1, §2.3, Table 2.
  • [20] J. Luiten, P. Voigtlaender, and B. Leibe (2018) Premvos: proposal-generation, refinement and merging for video object segmentation. In ACCV, Cited by: Figure 1, §2.1, Table 2.
  • [21] K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool (2018) Video object segmentation without temporal information. IEEE transactions on pattern analysis and machine intelligence. Cited by: Figure 1, §2.1, Table 2.
  • [22] T. Meinhardt and L. Leal-Taixe (2020) Make one-shot video object segmentation efficient again. In NeurIPS, Cited by: §2.1.
  • [23] S. W. Oh, J. Lee, K. Sunkavalli, and S. J. Kim (2018) Fast video object segmentation by reference-guided mask propagation. In CVPR, Cited by: §2.1, Table 3.
  • [24] S. W. Oh, J. Lee, N. Xu, and S. J. Kim (2019) Video object segmentation using space-time memory networks. In ICCV, Cited by: Figure 1, §1, §2.1, §2.2, §3.2, Table 2, Table 3.
  • [25] T. Pan, Y. Song, T. Yang, W. Jiang, and W. Liu (2021) Videomoco: contrastive video representation learning with temporally adversarial examples. In CVPR, Cited by: §2.3.
  • [26] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung (2017) Learning video object segmentation from static images. In CVPR, Cited by: §2.1.
  • [27] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley (2001) Color transfer between images. IEEE Computer graphics and applications. Cited by: §3.1.
  • [28] H. Seong, J. Hyun, and E. Kim (2020) Kernelized memory network for video object segmentation. In ECCV, Cited by: Figure 1, §2.1, Table 2, Table 3.
  • [29] L. Tao, X. Wang, and T. Yamasaki (2020) Self-supervised video representation learning using inter-intra contrastive framework. In ACM MM, Cited by: §2.3.
  • [30] Z. Teed and J. Deng (2020) Raft: recurrent all-pairs field transforms for optical flow. In ECCV, Cited by: §3.2.
  • [31] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and X. Giro-i-Nieto (2019) RVOS: end-to-end recurrent network for video object segmentation. In CVPR, Cited by: §2.1.
  • [32] P. Voigtlaender, Y. Chai, F. Schroff, H. Adam, B. Leibe, and L. Chen (2019) Feelvos: fast end-to-end embedding learning for video object segmentation. In CVPR, Cited by: Figure 1, §2.1, Table 2.
  • [33] P. Voigtlaender and B. Leibe (2017)

    Online adaptation of convolutional neural networks for video object segmentation

    .
    In BMVC, Cited by: Figure 1, §2.1, Table 3.
  • [34] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy (2018) Tracking emerges by colorizing videos. In ECCV, Cited by: Figure 1, Table 2, Table 3.
  • [35] J. Wang, J. Jiao, L. Bao, S. He, Y. Liu, and W. Liu (2019) Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In CVPR, Cited by: §2.3.
  • [36] J. Wang, J. Jiao, and Y. Liu (2020) Self-supervised video representation learning by pace prediction. In ECCV, Cited by: §2.3.
  • [37] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr (2019) Fast online object tracking and segmentation: a unifying approach. In CVPR, Cited by: Figure 1, Table 2.
  • [38] X. Wang, A. Jabri, and A. A. Efros (2019) Learning correspondence from the cycle-consistency of time. In CVPR, Cited by: Figure 1, Table 2.
  • [39] Z. Wang, J. Xu, L. Liu, F. Zhu, and L. Shao (2019) Ranet: ranking attention network for fast video object segmentation. In ICCV, Cited by: Figure 1, §2.1.
  • [40] J. Weston, S. Chopra, and A. Bordes (2014) Memory networks. arXiv preprint arXiv:1410.3916. Cited by: §2.2.
  • [41] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang (2018) Youtube-vos: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327. Cited by: Table 3.
  • [42] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos (2018) Efficient video object segmentation via network modulation. In CVPR, Cited by: Figure 1, Table 2, Table 3.
  • [43] Z. Yang, Y. Wei, and Y. Yang (2020) Collaborative video object segmentation by foreground-background integration. In ECCV, Cited by: Figure 1, §1, §2.1, Table 2, Table 3.