Dual Temporal Memory Network for Efficient Video Object Segmentation

03/13/2020 ∙ by Kaihua Zhang, et al. ∙ 3

Video Object Segmentation (VOS) is typically formulated in a semi-supervised setting. Given the ground-truth segmentation mask on the first frame, the task of VOS is to track and segment the single or multiple objects of interests in the rest frames of the video at the pixel level. One of the fundamental challenges in VOS is how to make the most use of the temporal information to boost the performance. We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories to address the temporal modeling in VOS. Our network consists of two temporal sub-networks including a short-term memory sub-network and a long-term memory sub-network. The short-term memory sub-network models the fine-grained spatial-temporal interactions between local regions across neighboring frames in video via a graph-based learning framework, which can well preserve the visual consistency of local regions over time. The long-term memory sub-network models the long-range evolution of object via a Simplified-Gated Recurrent Unit (S-GRU), making the segmentation be robust against occlusions and drift errors. In our experiments, we show that our proposed method achieves a favorable and competitive performance on three frequently-used VOS datasets, including DAVIS 2016, DAVIS 2017 and Youtube-VOS in terms of both speed and accuracy.



There are no comments yet.


page 3

page 5

page 6

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Previous methods [34, 21, 5, 38, 47, 37]

capture temporal dependencies in a short-term or long-term video sequence for VOS (a,b). Our proposed method leverages both short- and long-term temporal information (c). RNN is short for Recurrent Neural Network.

Video Object Segmentation (VOS) aims to separate the foreground objects from the backgrounds in all frames of a video sequence. The common approach casts the problem into a semi-supervised learning task,

i.e., the segmentation ground truth of the target object in the first frame is provided and the goal is to infer the segmentation masks of the object in all other frames [2, 40, 34, 49, 50, 45, 39, 38, 19]. Fast and accurate VOS methods are beneficial to many applications such as video editing [25, 42], object tracking [41, 51] and activity recognition [9].

Modelling the inter-frame temporal correlation is one of the essential challenges in VOS. Some existing methods [34, 21, 5] model the short-term consistency of the object appearance across neighboring frames in the video (Figure 1 (a)). The predicted mask of the previous frame is propagated to the current frame either by feature map aggregation or by optical-flow-guided pixel matching. The major issue with these methods is that they ignore the fine-grained interactions between local regions over time. In other words, the spatial regions in the predicted mask of the previous frames are integrated into the corresponding spatial regions in the current frame individually without exploring their spatial-temporal correlations. As a result, local prediction errors may easily be propagated and amplified during temporal modeling, especially at the border of object regions. Ideally, we need a mechanism to model the fine-grained spatial-temporal interactions on the local frame regions, so that the consistency of object can be preserved.

Other works  [38, 47, 37] apply Convolutional Recurrent Unit to capture the evolution of the frame’s convolutional feature map over a long-term time range, and map the output of the recurrent units into a segmentation map of the current frame (Figure 1 (b)). These methods can get a long view of the video sequence preceding the current frame so that the long-range dynamics of the object can be captured, making the network be robust against occlusions and drift errors. Nevertheless, the major issue with these methods is that the feature maps fed into the recurrent units describe the holistic frame, which not only unnecessarily involves the background region into the learning process, but also dramatically increases the computational complexity. In fact, only the object mask regions are needed to model the evolution of object in time.

Motivated by the above issues in VOS, we propose an end-to-end Dual Temporal Memory Network (DTMNet) that stores both short-term and long-term video sequence information as memories to assist the segmentation of a current frame (Figure. 1 (c)). In our network, the short-term memory sub-network is designed as a spatial-temporal feature correlation module to capture the fine-grained inter-frame object appearance consistency. Given a current frame, we collect a small window of the preceding frames as its short-term memory. The frame and its memory frames are respectively encoded into a feature map in which each spatial location denotes one local region in the frame and the same feature location across different frames naturally encodes the evolution of a region across time. Then a spatial-temporal graph is built over all local regions in which each region is a node and the edges are established between regions within a local spatial-temporal window. The Graph Convolution [18] operation is performed to update each region feature on the node according to its relations to others. By doing this, we model the spatial-temporal consistency of local regions across frames, leading to an improved segmentation performance.

The long-term sub-network

models the evolution of object across a long-time range. Given a current frame, we collect all preceding frames from the beginning of the video as its long-term memory. Instead of using the convolutional features of frames in the memory to model the dynamics of object over time, we propose to pool an object-orientated feature vector from the object mask on each frame, and apply the

Simplified-Gated Recurrent Unit (S-GRU) to learn a hidden-state vector to characterize the evolution of the object over a long-time range in the memory. This relieves the distractions of the background regions and significantly reduces the computational complexity.

The outputs from the short-term and the long-term sub-networks are sent to the segmentation sub-network as supportive information to perform object segmentation. Extensive evaluations on three benchmark VOS datasets demonstrate that our DTMNet yields state-of-the-art performance in terms of both speed and accuracy. Our main contributions include:

(1) DTMNet for VOS, through which both the short-term spatial-temporal local region consistency and the long-term object evolution can be exploited.

(2) A graph-based learning framework to model the short-term spatial-temporal interactions of the local regions from neighboring frames in the video.

(3) An object-orientated feature based S-GRU module to model object evolution over a long-time range.

Figure 2: Pipeline of the proposed DTMNet for VOS. The network includes three key components: (a) A short-term memory sub-network to capture the spatial-temporal consistency of local regions over time; (b) A long-term memory sub-network to model the evolution of object over a long-time range to ensure robustness against occlusions and drift errors; (c) A segmentation sub-network that seamlessly fuses the short- and long-term memory information and the ground-truth information from the first frame to accurately predict the segmentation mask.

2 Related Work

Video Object Segmentation. There is a line of research on unsupervised VOS which leverages visual saliency [13, 17], point trajectory [3] and motion [32] to segment objects from the background. Many semi-supervised VOS methods heavily rely on online fine-tuning on the first-frame mask to predict the masks on other frames during testing. OSVOS [2] and its extensions [40] ignore the temporal dimension and fine-tune a pre-trained fully convolutional network on the first frame to remember the object appearance. MHP-VOS [49] proposes a novel method called Multiple Hypotheses Propagation to defer the decision until a global view can be established. Other methods take temporal information into consideration. MSK [34] and LucidTracker [21] use the predicted mask of the last frame as additional input of the current frame. PReMVOS [28] combines four different sub-networks to achieve impressive performance. Although online fine-tuning boosts test accuracy, it badly sacrifices running efficiency for practical applications.

A growing line of research attempts to avoid the time-consuming online fine-tuning at the expense of a little accuracy reduction. VideoMatch [14] explores pixel-level embedding matching. OSMN [50] uses two novel modulators to capture visual and spatial information of the target object and injects them into the segmentation branch. FAVOS [4] utilizes tracking to obtain object bounding boxes and performs segmentation within the boxes. AGAM-VOS [19] learns a probabilistic generative model to find a representation of the target and background appearance. Our method shares the same spirit of not performing online fine-tuning as these methods, but the key difference is that we design a dedicated dual temporal memory mechanism to make the VOS accuracy even higher than some state-of-the-art online fine-tuning methods (see Table 1).

Temporal Modeling in VOS. Temporal sequence modeling plays an important role in VOS. Some methods try to model the long-term object dynamics in the video sequence using Recurrent Neural Networks (RNNs). RNN is designed to sequence modeling by propagating and accumulating a hidden state over time  [8, 20]. Gated Recurrent Unit (GRU)  [6]

and Long Short-Term Memory (LSTM)  

[12] are the two classic RNN components, both of which have been extended to CNNs. The existing methods introduce ConvLSTM and ConvGRU to model the long-term temporal dynamics in the video sequence.  [37] designs a visual memory module based ConvGRU to capture the evolution of object mask over times. RVOS  [38] presents a spatial-temporal recurrence module and applies ConvLSTM as the decoder.  [47] proposes a sequence-to-sequence network by ConvLSTM. Without any RNN unit, STCNN [46] designs a novel temporal coherence branch inspired by video predict task, and is also able to model the long-term dynamics in video. STMN [31] stores the first, intermediate and previous frame in the memory and uses them as reference to infer the object mask of the current frame. The advantage of modeling long-term temporal dynamics is to allow the network to get the long view of the video sequence preceding the current frame, making the network be robust against occlusions and drift errors.

The other VOS methods model the short-term visual consistency across neightboring frames in the video. MSK [34] uses the predicted mask of the last frame as additional input of the current frame to help with the mask prediction. RGMP [45]

utilizes a Siamese encoder-decoder network to extract the first, previous and current feature to propagate the previous predicted mask to current frame. Optical flow is also commonly used to match the pixel correspondence between successive frames, through which the object mask of sequential frames can be estimated 

[21, 5, 28]. It turns out that modeling short-term visual consistency is an important prior to enhance the performance of VOS, and is commonly applied in the VOS works [34, 45, 21, 5, 28]. In contrast to these existing methods, we integrate the long- and short-term temporal modeling into a unified framework, and each sub-network in our network design is dedicated to resolve the issues of the existing methods.

3 Proposed Method

3.1 Framework Overview

Given a video sequence with frames and the binary ground-truth mask of the first frame with width and height , our task is to predict the segmentation masks of the subsequent frames , denoted as . To this end, we develop the DTMNet for VOS, as illustrated by Figure 2. Our DTMNet is composed of three seamless components: (a) A short-term memory sub-network; (b) A long-term memory sub-network and (c) A segmentation sub-network. Among them, the segmentation module takes full advantages of the complementary characteristics of the rich supportive information provided by the short- and the long-term memory modules, i.e., good adaptation to appearance changes and robustness against occlusions and drift errors, thereby enabling to predict an accurate segmentation mask.

Specifically, when segmenting video frame , we take it as the query frame and the preceding frames with their masks as the short-term memories. The frames are fed into the backbone network as the encoder to extract features , where the feature map with channels. Afterwards, as shown in Figure 2(a), the features are fed into a spatial-temporal graph convolutional filtering (GCF) module, generating the refined features for the query frame . The GCF leverages Laplacian smoothing to compute that can be viewed as a low-pass filtering process [24]. The smoothing makes the features in the same cluster similar, facilitating the subsequent classification task in the segmentation sub-network.

Meanwhile, as shown by Figure 2(b), we leverage S-GRU to model the long-term memory that simplifies the GRU proposed by [6] with only one update gate left. The output of S-GRU is a -dimensional hidden-state vector that can memorize all object appearances appearing before frame , where denotes the object representation at frame . The S-GRU updates its states incrementally across the video frames, which can effectively capture the long-range dynamics of the object that are robust against occlusions and drifting with less memory overhead.

Finally, as shown in Figure 2(c), the learned hidden-state vector is correlated with the query image features to generate a target-specific attention map , which highlights the target-specific region while suppressing other distractors. Then, we concatenate the attention map and the GCF features to further refine the target-specific features of the query image. Moreover, we also concatenate features and its ground-truth mask from the first frame to further strengthen the target-specific feature representation. Finally, the concatenated features are fed into the decoder module to produce the final segmentation mask , with skip-connections to fuse multi-scale features of different layers like U-Net [36].

3.2 Short-term Memory Sub-network

As aforementioned, the short-term memory sub-network is to model the inter-frame temporal correlation. Previous works achieve this by propagating the mask from previous frame to current frame, either by directly concatenating the previously predicted mask and the current frame [34, 45, 19, 39] or depending on the optical-flow guided pixel matching between two sequential frames [15, 5, 37, 27]. However, the former is easy to introduce noisy backgrounds into the object regions, especially on the object boundaries, leading to sub-optimal accuracy. Although the latter seems reasonable, there exist two limitations: First, it is very computationally expensive to estimate optical flows. Second, estimating optical flows needs to compute point-to-point mapping between two pixels, which is too restrictive [26]. For the high-level feature maps, they involve both the strength of the responses and their spatial locations [10], where each feature corresponds to a single site in the predicted mask. Hence, mask propagation can be implemented via feature propagation. Moreover, due to the fact that each feature in the high-level feature maps represents a local region inside the receptive field of the CNN filter instead of a single image pixel, a linear combination of these features to implement feature propagation serves well to model the spatial-temporal interactions between the local regions across video frames, thereby enabling to well preserve their spatial-temporal consistency across the frames. Motivated by this analysis, we propose to propagate features by spatial-temporal GCF, that is, using graph convolutions to linearly combine spatial-temporal neighbors.

Notations. As shown in Figure 2(a), given the short-term memory set , we define the spatial-temporal graph as  [44], where denotes the node set with size , is the edge set, where models the pairwise relations between any two nodes and , is the adjacency matrix whose entry is the weight of edge , is the feature matrix constructed by set , where denotes the feature representation of node .

Sparse Adjacency Matrix A. For each node , we construct its edge set to capture the spatial-temporal interactions of the pair-wise local regions across video frames, where is an window centered at node at the current frame, denotes an window centered at node at the next frame. The number of none-zero edges in is , leading to a sparse A with less computational cost. To learn task-specific similarity between nodes and for adaptive graph learning, we define the weight of edge as



is the sigmoid function,

denote node features, are learnable weight matrices.

Figure 3: Laplacian smoothing effect of GCF features. Top: three query frames selected from sequence pigs in DAVIS 2017 val dataset [35]; bottom: the corresponding GCF feature responses, showing that the features of the same object across frames well preserve spatial-temporal consistency.

Spatial-temporal GCF. To perform GCF, we apply the graph convolutional networks (GCNs) proposed in [23]. We design one-layer graph convolution in our DTMNet as


where denotes the weight vector of the FC layer and


where , I

denotes the identity matrix,

is a diagonal matrix with .

After GCF using (3), the -th node representation can be formulated as


It is obvious that in (4) is a linear combination of the nodes in its spatial-temporal neighborhood , thereby expressing feature propagation more accurately than existing optical-flow-based methods [15, 5, 37, 27] that are limited by restrictive point-to-point mappings. Moreover, (4) is a Laplacian smoothing process [24] that calculates the new features as the weighted average of its neighboring features in and itself . The smoothing makes the features in the same cluster similar that favorably preserves the spatial-temporal consistency of the segmented object across frames as illustrated by Figure 3, rendering a great benefit to the downstream pixel-wise classification task in the segmentation sub-network (§ 3.4).

Semi-supervised Classification. When training our model, we assume that in set , the ground-truth masks are given that correspond to the labeled nodes , while the query image mask is to be propagated from nodes

. We leverage a one-layer GCN which applies a softmax classifier on the output features

z in (2)


The loss function is defined as the cross-entropy error over all the labeled nodes


where denotes the ground-truth label of node .

3.3 Long-term Memory Sub-network

Using short-term memory for VOS can deal with target appearance changes well. However, it suffers from drifting problem under challenging scenarios such as severe occlusion or fast motion between sequential frames. To address this issue, we further develop the S-GRU module to capture long-term memory information as a complement, as illustrated by Figure 2(b).

For frame , given its features and segmentation mask , we first mask out object features as , where denotes pixel-wise product. Then, we feed the object features into a global average pooling (GAP) layer, yielding


which captures global context information that is robust against object appearance variations. Next, the S-GRU leverages in (7) and the previous state to compute the new state . The state vector h plays a key role in S-GRU since it well captures the long-term dynamics of the object across frames. Then, the learning process is formulated as


where denotes the element-wise multiplication, is the sigmoid function, is a learnable weight matrix. Different from the ConvGRU [37] that consists of update and reset gates, our S-GRU in (8) only has update gate , which reduces computational complexity significantly. In (8), the new state is a weighted sum of the current object representation and the previous state that memorizes the dynamic object appearances across all previous frames. If the update gate is close to one, the memories encoded in will be forgotten.

OL Mean Recall Decay Mean Recall Decay Time (s)
MSK [34] 77.6 79.7 93.1 8.9 75.4 87.1 9.0 12
LIP [29] 78.5 78.0 88.6 5.0 79.0 86.8 6.0 -
OSVOS [2] 80.2 79.8 93.6 14.9 80.6 92.6 15.0 9
Lucid [21] 83.6 84.8 - - 82.3 - - >30
STCNN [46] 83.8 83.8 96.1 4.9 83.8 91.5 6.4 3.9
CINM [1] 84.2 83.4 94.9 12.3 85.0 92.1 14.7 >30
OnAVOS [40] 85.5 86.1 96.1 5.2 84.9 89.7 5.8 13
OSVOS-S [30] 86.6 85.6 96.8 5.5 87.5 95.9 8.2 4.5
PReMVOS [28] 86.8 84.9 96.1 8.8 88.6 94.7 9.8 >30
MHP-VOS [49] 86.9 85.7 96.6 - 88.1 94.8 - 14
VPN [16] 67.9 70.2 82.3 12.4 65.5 69.0 14.4 0.63
OSMN [50] 73.5 74.0 87.6 9.0 72.9 84.0 10.6 0.14
VideoMatch [14] - 81.0 - - - - - 0.32
FAVOS [4] 81.0 82.4 96.5 4.5 79.5 89.4 5.5 1.8
FEELVOS [39] 81.7 81.1 90.5 13.7 82.2 86.6 14.1 0.45
RGMP [45] 81.8 81.5 91.7 10.9 82.0 90.8 10.1 0.13
AGAM-VOS [19] 81.8 81.4 93.6 9.4 82.1 90.2 9.8 0.07
DTMNet 85.4 85.9 96.0 4.7 84.9 92.0 5.7 0.12
Table 1: Comparison of our DTMNet with the state of the arts on DAVIS 2016 val. Red and blue bold fonts indicate the best, the second-best performance respectively.
Figure 4: Illustration of the target-specific attention maps. The same query frames are shown in Figure 3, including three pigs (ID numbers: 1⃝, 2⃝, 3⃝) to be segmented.

3.4 Segmentation Sub-network

Figure 2(c) illustrates the architecture of our segmentation sub-network. Similar to U-Net [36], our segmentation network uses skip-connections to fuse multi-scale features from the encoder to the decoder modules. The encoder uses the ResNet101 backbone network [11]

with dilated convolutions to set the stride of the deepest layer to 16. For query frame

, the deepest layer outputs feature maps . Then, we correlate with the hidden-state vector learned from the long-term memory sub-network, yielding the target-specific attention map . As illustrated by Figure 4, the learned can effectively highlight the target-specific regions while suppressing other distractors such as other objects and backgrounds. Next, after achieving the GCF features from the short-term memory sub-network, we feed the concatenated features

into the decoder. Meanwhile, we leverage skip-connections to concatenate the feature maps from the decoder and its counterparts from the encoder, which are then gradually upscaled by a factor of two at a time. Afterwards, they are concatenated with the following layer features. Finally, the aggregated features are fed into a convolutional layer following a softmax layer to predict the object mask

. As the short-term memory sub-network, the loss function here is also defined as the cross-entropy loss for pixel-wise classification task:


where denotes the ground-truth mask of frame .

Finally, the loss function for the whole network training is defined as


where is defined in (6), is a pre-defined trade-off parameter.

4 Experimental Results

4.1 Implementation Details

Following AGAME-VOS [19], the training process of our DTMNet is divided into two stages:

Stage 1. Firstly, we train our DTMNet using the Adam optimizer [22] to minimize the loss in (10) on DAVIS 2017 [35] and YouTube-VOS [48] datasets for epochs, where all training images are resized to pixels. Each batch contains videos, where

frames are randomly selected for training in each video. The hyperparameters in our DTMNet are set empirically as learning rate

, learning rate decay and weight decay .

Stage 2. Next, we fine-tune the trained model at Stage 1 on the same datasets for epochs but the images are resized to pixels which is twice the size of the input images at Stage 1. Each batch contains videos with randomly selected frames in each video. The parameters are also set empirically as learning rate , learning rate decay = and weight decay .

The DTMNet is implemented in Pytorch and an Nvidia GTX 2080Ti is used for acceleration. All of the training procedures can be completed within one day.

4.2 Datasets and Evaluation Metrics

Datasets. We train and evaluate the DTMNet on three VOS benchmark datasets, including DAVIS 2016 [33], DAVIS 2017 [35] and YouTube-VOS [48]. The DAVIS 2016 is a densely-annotated VOS dataset, which contains training and validation video sequences of high-quality with highly accurate pixel-wise annotation in total. The DAVIS 2017 enlarges the DAVIS 2016 by introducing more additional videos with multi-objects. The DAVIS 2017 contains a training set with sequences, a validation set with sequences, a test-dev set with sequences and a test-challenge set with sequences. The YouTube-VOS is the first large-scale VOS dataset, which is more than times larger than existing largest dataset at that time. The YouTube-VOS consists of videos in the training set, videos in the validation set with seen categories, and unseen categories in the training set.

Evaluation Metrics. We use the standard metrics provided by the DAVIS challenge [35], including the region similarity , contour accuracy and the mean of the two metrics . Given the estimated segmentation mask and the ground-truth mask M, the region similarity is calculated as . The contour accuracy is measured by the F-measure between the contour-based precision and recall as .

4.3 Comparison with the State-of-the-arts

We compare our DTMNet with some state-of-the-art online-learning (OL) VOS methods and some offline ones on the DAVIS 2016, the DAVIS 2017 and the YouTube-VOS benchmark datasets. It is worth noting that the our DTMNet does not resort to any post-processing or OL technique.

Results on DAVIS 2016. Table 1 lists the evaluation results on DAVIS 2016 by our DTMNet and state-of-the-art OL and offline VOS methods in comparison. Among the offline methods, our DTMNet achieves the best performance in terms of (), Mean (), Mean () and Recall (), the second-best performance with Recall of , Decay of and Decay of . Furthermore, the DTMNet is the first runner-up with a fast speed of s/frame, closely following the AGAM-VOS that runs at s/frame. Even compared with the OL methods, the DTMNet still has a competing of , which is only slightly lower than the best-performing MHP-VOS with of by . Besides, the DTMNet runs at s/frame, which is much faster than the MHP-VOS at a speed of more than s/frame.

OL Mean Mean Time (s)
MSK [34] 54.3 51.2 57.3 15
OSVOS [2] 60.3 56.6 63.9 11
LIP [29] 61.1 59.0 63.2 -
STCNN [46] 61.7 58.7 64.6 6
OnAVOS [40] 65.4 61.6 69.1 26
OSVOS-S [30] 68.0 64.7 71.3 8
CINM [1] 70.6 67.2 74.0 50
MHP-VOS [49] 75.3 71.8 78.8 20
OSMN [50] 54.8 52.5 57.1 0.28
SiamMask [41] 56.4 54.3 58.5 0.02
FAVOS [4] 58.2 54.6 61.8 1.2
VideoMatch [14] 62.4 56.6 68.2 0.35
RANet [43] 65.7 63.2 68.2 -
RGMP [45] 66.7 64.8 68.6 0.28
AGSS-VOS [27] 67.4 64.9 69.9 -
AGAM-VOS [19] 70.0 67.2 72.7 -
DMM-Net [52] 70.7 68.1 73.3 0.13
FEELVOS [39] 71.6 69.1 74.0 0.51
DTMNet 71.5 69.1 73.9 0.17
Table 2: Comparison of our DTMNet with the state of the arts on DAVIS 2017 val.

Results on DAVIS 2017. The DAVIS 2017 considers multi-object scenarios, making it more challenging than the DAVIS 2016 that is only for single-object segmentation. Table 2 lists the comparison results of our DTMNet with state-of-the-art OL and off-line methods. Among them, we can observe that our DTMNet has the best performance in terms of Mean (), and the second-best of and Mean of , closely following the best-performing FEELVOS in terms of () and Mean () with only a small gap of , but our DTMNet runs at s/frame on DAVIS 2017 val, which is much faster than FEELVOS that is s/frame. Furthermore, the DTMNet even outperforms the second best-performing offline method CINM in terms of and Mean by and , respectively, demonstrating the effectiveness of the dual temporal memory learning strategy in our DTMNet.

Method OL
MSK [34] 53.1 59.9 59.5 45.0 47.9
OnAVOS [40] 55.2 60.1 62.7 46.6 51.4
OSVOS [2] 58.8 59.8 60.5 54.2 60.7
S2S [47] 64.4 71.0 70.0 55.5 61.2
OSMN [50] 51.2 60.0 60.1 40.6 44.0
DMM-Net [52] 51.7 58.3 60.7 41.6 46.3
SiamMask [41] 52.8 60.2 58.2 45.1 47.7
RGMP [45] 53.8 59.5 - 45.2 -
RVOS [38] 56.8 63.6 67.2 45.5 51.0
CapsuleVOS [7] 62.3 67.3 68.1 53.7 59.9
DTMNet 65.6 66.1 68.9 60.5 66.8
Table 3: Comparison of our DTMNet with the state of the arts on YouTube-VOS dataset.

Results on YouTube-VOS. The YouTube-VOS computes and on seen and unseen categories, denoted as , , , in Table 3. The seen categories are included in both the training and the validation sets while the unseen categories only exist in the validation set. As listed by Table 3, our DTMNet achieves the best global mean of , outperforming the second best-performing CapsuleVOS () by a large margin. Besides, our DTMNet even outperforms the best-performing offline method S2S by in terms of . Especially, our DTMNet achieves excellent performance on the unseen categories with and , significantly outperforming the second-best method CapsuleVOS by and and even outperforming the best OL method S2S by and , respectively. The experimental results demonstrate the favorable generalization capability of our DTMNet to unseen categories. We argue that this is due to the fact that the short-term memory sub-network learning is guided by the semi-supervised loss (6).

Figure 5: Some qualitative results of our DTMNet on DAVIS 2016 val (the first column), DAVIS 2017 val (top of the second column) and YouTube-VOS (bottom of the second column) respectively. The sequences are soapbox, parkour, motocross-jump and 3e03f623bb. Best viewed in color.

4.4 Ablation Study

We compare three variants of our DTMNet, including those without long-term memory sub-net (DTMNet-L), short-term memory sub-net (DTMNet-S) and graph learning model (DTMNet-W denotes removing the weights in (1)). We evaluate them on the DAVIS 2016 val and list their results in Table 4. The DTMNet-S achieves a of , which is lower than the DTMNet by , which verifies the effectiveness of the short-term temporal information that can help to boost the accuracy of VOS. Moreover, the DTMNet-L only has a of , which is significantly lower than the DTMNet by . This shows the key role of the long-term temporal information that makes the model robust against occlusions and drifting, which significantly affects the performance of our model. Finally, we can observe that is dropped from to when removing the weights of the adjacency matrix in (1), which verifies the effectiveness of using graph learning structure that can also help to boost the performance of our model to some extent.

4.5 Qualitative Results

Figure 5 shows some qualitatively visual results on DAVIS 2016, DAVIS 2017 and YouTube-VOS datasets. We select some challenging videos from these three datasets. We can observe that our DTMNet still achieves favorable segmentation results when the targets suffer from various challenges like fast motion (the first column top), large-scale variations (the first column bottom and the second column top) and interacting objects (the second column bottom).

85.9 81.5 71 85.2
Table 4: Ablative experiments of our DTMNet on DAVIS 2016 val. DTMNet-A, A=S, L, W, denotes the DTMNet without short-memory, long-memory and graph learning modules, respectively.

5 Conclusions

In this paper, we have proposed an end-to-end DTMNnet for VOS which mainly includes a short-term and a long-term memory sub-networks. The former models the fine-grained spatial-temporal interactions between local regions across neighboring frames via a graph-based learning framework, which can well preserve the visual consistency of local regions over time. The latter models the long-range dynamics of object via an S-GRU, making the segmentation robust against occlusions and drift errors. Extensive evaluations on three benchmark datasets including DAVIS 2016, DAVIS 2017 and YouTube-VOS demonstrate favorable performance of our method over state-of-the-art methods in terms of both speed and accuracy.


  • [1] L. Bao, B. Wu, and W. Liu (2018) CNN in mrf: video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In CVPR, pp. 5977–5986. Cited by: Table 1, Table 2.
  • [2] S. Caelles, K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool (2017) One-shot video object segmentation. In CVPR, pp. 221–230. Cited by: §1, §2, Table 1, Table 2, Table 3.
  • [3] L. Chen, J. Shen, W. Wang, and B. Ni (2015) Video object segmentation via dense trajectories. TMM 17 (12), pp. 2225–2234. Cited by: §2.
  • [4] J. Cheng, Y. Tsai, W. Hung, S. Wang, and M. Yang (2018) Fast and accurate online video object segmentation via tracking parts. In CVPR, pp. 7415–7424. Cited by: §2, Table 1, Table 2.
  • [5] J. Cheng, Y. Tsai, S. Wang, and M. Yang (2017) Segflow: joint learning for video object segmentation and optical flow. In ICCV, pp. 686–695. Cited by: Figure 1, §1, §2, §3.2, §3.2.
  • [6] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP. Cited by: §2, §3.1.
  • [7] K. Duarte, Y. S. Rawat, and M. Shah (2019) CapsuleVOS: semi-supervised video object segmentation using capsule routing. In ICCV, pp. 8480–8489. Cited by: Table 3.
  • [8] J. L. Elman (1990) Finding structure in time. COGS 14 (2), pp. 179–211. Cited by: §2.
  • [9] J. Guo, Z. Li, L. Cheong, and S. Zhiying Zhou (2013) Video co-segmentation for meaningful action extraction. In ICCV, pp. 2232–2239. Cited by: §1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. TPAMI 37 (9), pp. 1904–1916. Cited by: §3.2.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.4.
  • [12] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. NEURAL COMPUT 9 (8), pp. 1735–1780. Cited by: §2.
  • [13] Y. Hu, J. Huang, and A. G. Schwing (2018) Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In ECCV, pp. 786–802. Cited by: §2.
  • [14] Y. Hu, J. Huang, and A. G. Schwing (2018) Videomatch: matching based video object segmentation. In ECCV, pp. 54–70. Cited by: §2, Table 1, Table 2.
  • [15] Y. Hu, J. Huang, and A. Schwing (2017) Maskrnn: instance level video object segmentation. In NeurIPS, pp. 325–334. Cited by: §3.2, §3.2.
  • [16] V. Jampani, R. Gadde, and P. V. Gehler (2017) Video propagation networks. In CVPR, pp. 451–461. Cited by: Table 1.
  • [17] W. Jang, C. Lee, and C. Kim (2016) Primary object segmentation in videos via alternate convex optimization of foreground and background distributions. In CVPR, pp. 696–704. Cited by: §2.
  • [18] B. Jiang, Z. Zhang, D. Lin, J. Tang, and B. Luo (2019) Semi-supervised learning with graph learning-convolutional networks. In CVPR, pp. 11313–11320. Cited by: §1.
  • [19] J. Johnander, M. Danelljan, E. Brissman, F. S. Khan, and M. Felsberg (2019) A generative appearance model for end-to-end video object segmentation. In CVPR, pp. 8953–8962. Cited by: §1, §2, §3.2, Table 1, §4.1, Table 2.
  • [20] K. Kawakami (2008) Supervised sequence labelling with recurrent neural networks. Ph. D. thesis. Cited by: §2.
  • [21] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele (2017) Lucid data dreaming for object tracking. In The DAVIS Challenge on Video Object Segmentation, Cited by: Figure 1, §1, §2, §2, Table 1.
  • [22] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [23] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §3.2.
  • [24] Q. Li, Z. Han, and X. Wu (2018) Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, pp. 3538–3545. Cited by: §3.1, §3.2.
  • [25] Y. Li, J. Sun, and H. Shum (2005) Video object cut and paste. In ToG, Vol. 24, pp. 595–600. Cited by: §1.
  • [26] Y. Li, J. Shi, and D. Lin (2018) Low-latency video semantic segmentation. In CVPR, pp. 5997–6005. Cited by: §3.2.
  • [27] H. Lin, X. Qi, and J. Jia (2019) AGSS-vos: attention guided single-shot video object segmentation. In ICCV, pp. 3949–3957. Cited by: §3.2, §3.2, Table 2.
  • [28] J. Luiten, P. Voigtlaender, and B. Leibe (2018) PReMVOS: proposal-generation, refinement and merging for video object segmentation. In ACCV, pp. 565–580. Cited by: §2, §2, Table 1.
  • [29] Y. Lyu, G. Vosselman, G. Xia, and M. Y. Yang (2019) LIP: learning instance propagation for video object segmentation. arXiv preprint arXiv:1910.00032. Cited by: Table 1, Table 2.
  • [30] K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool (2018) Video object segmentation without temporal information. TPAMI 41 (6), pp. 1515–1530. Cited by: Table 1, Table 2.
  • [31] S. W. Oh, J. Lee, N. Xu, and S. J. Kim (2019) Video object segmentation using space-time memory networks. arXiv preprint arXiv:1904.00607. Cited by: §2.
  • [32] A. Papazoglou and V. Ferrari (2013) Fast object segmentation in unconstrained video. In ICCV, pp. 1777–1784. Cited by: §2.
  • [33] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, pp. 724–732. Cited by: §4.2.
  • [34] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung (2017) Learning video object segmentation from static images. In CVPR, pp. 2663–2672. Cited by: Figure 1, §1, §1, §2, §2, §3.2, Table 1, Table 2, Table 3.
  • [35] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool (2017) The 2017 davis challenge on video object segmentation. arXiv:1704.00675. Cited by: Figure 3, §4.1, §4.2, §4.2.
  • [36] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §3.1, §3.4.
  • [37] P. Tokmakov, K. Alahari, and C. Schmid (2017) Learning video object segmentation with visual memory. In ICCV, pp. 4481–4490. Cited by: Figure 1, §1, §2, §3.2, §3.2, §3.3.
  • [38] C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and X. Giro-i-Nieto (2019) Rvos: end-to-end recurrent network for video object segmentation. In CVPR, pp. 5277–5286. Cited by: Figure 1, §1, §1, §2, Table 3.
  • [39] P. Voigtlaender, Y. Chai, F. Schroff, H. Adam, B. Leibe, and L. Chen (2019) Feelvos: fast end-to-end embedding learning for video object segmentation. In CVPR, pp. 9481–9490. Cited by: §1, §3.2, Table 1, Table 2.
  • [40] P. Voigtlaender and B. Leibe (2017)

    Online adaptation of convolutional neural networks for video object segmentation

    arXiv preprint arXiv:1706.09364. Cited by: §1, §2, Table 1, Table 2, Table 3.
  • [41] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr (2019) Fast online object tracking and segmentation: a unifying approach. In CVPR, pp. 1328–1338. Cited by: §1, Table 2, Table 3.
  • [42] W. Wang, J. Shen, and F. Porikli (2017) Selective video object cutout. TIP 26 (12), pp. 5645–5655. Cited by: §1.
  • [43] Z. Wang, J. Xu, L. Liu, F. Zhu, and L. Shao (2019) Ranet: ranking attention network for fast video object segmentation. In ICCV, pp. 3978–3987. Cited by: Table 2.
  • [44] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596. Cited by: §3.2.
  • [45] S. Wug Oh, J. Lee, K. Sunkavalli, and S. Joo Kim (2018) Fast video object segmentation by reference-guided mask propagation. In CVPR, pp. 7376–7385. Cited by: §1, §2, §3.2, Table 1, Table 2, Table 3.
  • [46] K. Xu, L. Wen, G. Li, L. Bo, and Q. Huang (2019) Spatiotemporal cnn for video object segmentation. In CVPR, pp. 1379–1388. Cited by: §2, Table 1, Table 2.
  • [47] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price, S. Cohen, and T. Huang (2018) Youtube-vos: sequence-to-sequence video object segmentation. In ECCV, pp. 585–601. Cited by: Figure 1, §1, §2, Table 3.
  • [48] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang (2018) Youtube-vos: a large-scale video object segmentation benchmark. In ECCV, pp. 585–601. Cited by: §4.1, §4.2.
  • [49] S. Xu, D. Liu, L. Bao, W. Liu, and P. Zhou (2019) MHP-vos: multiple hypotheses propagation for video object segmentation. In CVPR, pp. 314–323. Cited by: §1, §2, Table 1, Table 2.
  • [50] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos (2018) Efficient video object segmentation via network modulation. In CVPR, pp. 6499–6507. Cited by: §1, §2, Table 1, Table 2, Table 3.
  • [51] D. Yeo, J. Son, B. Han, and J. Hee Han (2017)

    Superpixel-based tracking-by-segmentation using markov chains

    In CVPR, pp. 1812–1821. Cited by: §1.
  • [52] X. Zeng, R. Liao, L. Gu, Y. Xiong, S. Fidler, and R. Urtasun (2019) DMM-net: differentiable mask-matching network for video object segmentation. ICCV, pp. 3929–3938. Cited by: Table 2, Table 3.