RANet: Ranking Attention Network for Fast Video Object Segmentation

08/19/2019 ∙ by Ziqin Wang, et al. ∙ 8

Despite online learning (OL) techniques have boosted the performance of semi-supervised video object segmentation (VOS) methods, the huge time costs of OL greatly restricts their practicality. Matching based and propagation based methods run at a faster speed by avoiding OL techniques. However, they are limited by sub-optimal accuracy, due to mismatching and drifting problems. In this paper, we develop a real-time yet very accurate Ranking Attention Network (RANet) for VOS. Specifically, to integrate the insights of matching based and propagation based methods, we employ an encoder-decoder framework to learn pixel-level similarity and segmentation in an end-to-end manner. To better utilize the similarity maps, we propose a novel ranking attention module, which automatically ranks and selects these maps for fine-grained VOS performance. Experiments on DAVIS_16 and DAVIS_17 datasets show that our RANet achieves the best speed-accuracy trade-off, e.g., with 33 milliseconds per frame and J&F=85.5 exceeding state-of-the-art VOS methods. The code can be found at https://github.com/Storife/RANet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 5

page 6

page 7

page 9

page 10

Code Repositories

RANet

RANet: Ranking Attention Network for Fast Video Object Segmentation (VOS), ICCV2019


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Comparison of different VOS frameworks. (a) Matching based framework; (b) Propagation based framework; and (c) Proposed RANet. We propose a novel Ranking Attention module to rank and select important features.

Semi-supervised Video Object Segmentation (VOS) [davis2016, davis2017, davis2018] aims to segment the object(s) of interests from the background throughout a video, in which only the annotated segmentation mask of the first frame is provided as the template frame at test phase. This challenging task is of great importance for large scale video processing and editing [semi02, semi01, zsvos], and many video analysis applications such as video understanding [Fan2019VideoSal, un02] and object tracking [siammask].

Early VOS methods [osvos, masktrack, osvos-s, onavos]

mainly resort to online learning (OL) techniques which fine-tune a pre-trained classifier on its first frame. Matching or propagation based methods have also been proposed for VOS. Matching based methods 

[videomatch, pml] segment pixels according to the pixel-level matching scores between the features of the first frame and of each subsequent frame (Fig. 1 (a)), while propagation based methods [masktrack, rgmp, favos, osmn, sfl, semi02] mainly rely on temporally deforming the annotated mask of the first frame via predictions of the previous frame [masktrack] (Fig. 1 (b)).

The respective benefits and drawbacks of these methods are clear. Specifically, OL based methods [osvos, masktrack, osvos-s, onavos] achieve accurate VOS at the expense of speed, requiring several seconds to segment each frame [osvos]. On the contrary, simple matching or propagation based methods [masktrack, pml, plm] are faster, but with sub-optimal VOS accuracy. Matching based methods [videomatch, pml, rgmp] bear up the mismatching problem, , violating the temporal consistency of the primary object with constantly changing appearance in the video. On the other hand, propagation based methods [masktrack, rgmp, favos, osmn, sfl, ofl] suffer from the drifting problem due to occlusions or fast motions between two sequential frames. In summary, most existing methods cannot tackle the VOS task with both satisfactory accuracy and speed, which are essential for practical applications. More efficient methods are still required to reach a better speed-accuracy trade-off for the VOS task.

With the above considerations, in this work, we develop a real-time network for fine-grained VOS performance. The developed network benefits from an encoder-decoder structure, and learns pixel-level matching, mask propagation, and segmentation in an end-to-end manner. Fig. 1 (c) shows a glimpse of the proposed network. A Siamese network [siamfc] is employed as the encoder to extract pixel-level matching features, and a pyramid-like decoder is used for simultaneous mask propagation and high-resolution segmentation.

A key problem in our framework is how to connect the pixel-level matching encoder and propagation based decoder in a meaningful manner. The encoder produces dynamic foreground and background similarity maps, which cannot be directly fed into the decoder. To this end, we propose a Ranking Attention Module (RAM, see Fig. 1 (c)) to reorganize (, rank and select) the similarity maps according to their importance for fine-grained VOS performance. The proposed Ranking Attention Network (RANet) can better utilize the pixel-level similarity maps for fine-grained VOS, greatly alleviating the drawbacks of previous matching or propagation based methods. Experiments on DAVIS and DAVIS datasets [davis2016, davis2017] demonstrate that the proposed RANet outperforms previous VOS methods in terms of speed and accuracy, e.g., achieving at a speed of FPS on DAVIS.

The contributions of this work are three-fold:

  • We integrate the benefits of matching and propagation frameworks in an end-to-end manner and develop a real-time network for the semi-supervised VOS task.

  • We propose a novel Ranking Attention Module to rank and select conformable feature maps according to their importance for fine-grained VOS performance.

  • Experiments on DAVIS datasets show that the proposed RANet achieves competitive or even better performance than previous VOS methods, at real-time speed. The proposed RANet achieves accurate VOS results even been trained only with static images.

2 Related Works

Online learning based methods. OL based methods [osvos, onavos, masktrack, osvos-s, reid, lucid, premvos, premvos2, premvos3] fine-tune on the first frame of a video to extract the primary object(s), and then segment the video frame-by-frame. OSVOS [osvos] uses a pre-trained object segmentation network, and fine-tunes it on the first frame of the test video. OnAVOS [onavos] extends OSVOS with an online adaptation mechanism, and OSVOS-S [osvos-s] utilizes semantic information from an instance segmentation network. LucidTracker [lucid] introduces a data augmentation mechanism for online fine-tuning. DyeNet [reid] integrates instance re-identification and temporal propagation, and uses OL to boost the performance. PReMVOS [premvos, premvos2, premvos3] integrates techniques from instance segmentation [mask-rcnn], optical flow [flownet, flownet2], refinement, and re-identification [personsearch] together with extensive fine-tuning, and achieves satisfactory performance. In summary, OL is very effective for the VOS task. Therefore, subsequent methods [masktrack, cinm, reid] regard OL as a conventional technique to boost VOS performance. However, OL based methods are computationally expensive for practical applications. In this work, we solve the VOS problem with a very fast network that obtains a competitive accuracy at a speed of FPS on DAVIS, times faster than previous OL based methods [osvos, masktrack, osvos-s, onavos].

Figure 2: Illustration of the proposed RANet

. We compute correlation of the features extracted by Siamese networks. The output similarity maps and template mask are fed into the RAM module to rank and select the foreground/background similarity maps. Then these maps and the previous frame’s mask and fed into the decoder for final segmentation.

Propagation or matching based methods. Propagation based methods additionally resort to the previous frame(s) for better VOS performance. Masktrack[masktrack] tackles VOS by combining the image and segmentation mask of the previous frame as the input. This strategy is also used in CINM [cinm], OSMN [osmn] and RGMP [rgmp]. RGMP [rgmp] stacks the first, previous and current frames’ features during propagation through a Siamese architecture network. In this work, we also utilize the Siamese network, but use a pixel-level matching technique instead of simply stacking, and feed the previous frame’s mask into the decoder, instead of the encoder as in RGMP [rgmp]. OSMN [osmn] introduces a modulator to manipulate the intermediate layers of the segmentation network, by using visual and spatial guidance. Optical flow [flownet, flownet2] is also used to guide the propagation process in many methods [sfl, masktrack, ofl, ctn]. However, it fails to distinguish non-rigid objects from motionless sections of the background. All these strategies are effective, but still, suffer from the drifting problem. MaskTrack [masktrack] embraces OL to remember the target object, which eliminates this problem and improves VOS performance. However, since OL is time-consuming, we employ more efficient matching techniques to handle this drifting problem.

Matching based methods [videomatch, pml, plm, Voigtlaender2019FEELVOS] are very efficient. They first calculate pixel-level matching between the features of the template frame and the current frame in videos, and then segment each pixel of the current frame directly from the matching results. Pixel-Wise Metric Learning [pml] predicts each pixel by nearest neighbor matching in pixel space to the template frame. However, the point-to-point correspondence strategy [deepmatching, plm] often results in noisy predictions. To ease this problem, we apply a decoder to utilize the matching results as guidance. Hu proposed a soft matching mechanism in VideoMatch [videomatch], which performs soft segmentation upon the averaged similarity score maps of matching features to generate smooth predictions. However, due to the lack of temporal information, they still suffer from the mismatching problem. In this work, we employ both the strategies of point-to-point correspondence matching for pixel-level object location and temporal propagation, to handle the mismatching and drifting problem. FEELVOS [Voigtlaender2019FEELVOS] employs global and local matching for more stable pixel-level matching, but only calculates extreme value maps for final segmentation, losing major information of the similarity maps. Our RAM can better utilize the similarity information. Moreover, for faster speed, we use a light-weight decoder and employ a standard ResNet [resnet]

pre-trained on ImageNet 

[imagenet] as the backbone, instead of the time-consuming semantic segmentation networks [deeplab, deeplabv3, deeplabv3+, YazhaoSegICCV2019] used in previous methods [videomatch, masktrack].

3 Proposed Method

In this section, we first provide an overview of the developed Ranking Attention Network (RANet) in §3.1. In §3.2, we describe the proposed Ranking Attention Module (RAM), and extend it for multi-object VOS in §3.3. Finally, we present the implementation details and training strategies for RANet in §3.4 and §3.5, respectively.

3.1 Network Overview

Our RANet consists of three seamless parts: an encoder for feature extraction, an integration of correlation and RAM, and a decoder for feature merging and final segmentation. An illustration of our RANet is shown in Fig. 2.

Siamese Encoder. To obtain correlation information for accurate VOS, we employ Siamese networks [siamfc] (with shared weights) as the encoder to extract features of the first frame and the current frame. Then we extract pixel-level features from the first frame, reshape it into a conformable shape, as the template features for correlation calculation.

Correlation and RAM for Matching. Correlation is widely used in object tracking. In SiamFC [siamfc], correlation is used to locate the position of the object using similarity maps. In our RANet, to locate each pixel of the object(s) for segmentation, we need pixel-level similarity maps by calculating the correlation between each pixel-level feature of the template and current frames. Note that there is one similarity map for each pixel-level template feature. The detailed formulation of correlation will be described in §3.2. We then utilize the mask of the first frame to select foreground (FG) or background (BG) similarity maps as FG or BG features for segmentation. Since the number of FG or BG pixels varies in different videos, the number of FG or BG similarity maps is dynamic, and hence the decoder has to deal with FG or BG similarity features with dynamic channel sizes. To handle this dynamic channel-size problem, we propose a RAM module to rank and select the most important similarity maps and organize them in conformable shape. This part will also be exhaustively explained in §3.2. The RAM module provides abundant and ordered features for segmentation, and leads to better performance, as will be shown in the ablation study in §4.3. For simplicity, here we only consider the single-object VOS in §3.2. Extension of our RANet for multi-object VOS will be described in §3.3.

Propagation. Here we utilize the simple mask propagation method [masktrack], while other propagation [reid, flownet2] or local-matching [Voigtlaender2019FEELVOS] methods would potentially improve our RANet. We feed the predicted mask of the previous frame, together with the selected features of FG (or BG) by the proposed RAM, into the subsequent decoder. In this way, our RANet utilizes both matching and propagation techniques.

Light-weight Decoder. This part contains a merge module and a pyramid-like network, which are described in the Supplementary File. The merge module refines the two streams of ranked similarity maps, and then concatenates these maps with previous frame’s mask. In the merge module, the two streams of the network share the same parameters. A pyramid-like network [u-net, refinenet, accv] is employed to obtain the final segmentation, with skip-connections to utilize multi-scale features of different layers.

3.2 Correlation and Ranking Attention Module

Figure 3: Mechanism of the proposed Ranking Attention Module

. In FG (or BG) path, only the FG (or BG) similarity maps are selected. The maps are ranked from top to bottom according to ranking scores learned from attention network, and padding or discarding is operated to make the

FG (or BG) maps. Finally, these maps are concatenated across the channel as features with the size of .

Correlation. We utilize correlation to find matching between pixels in the template and current frames. Denote and as the feature of template and current frames, extracted by the Siamese encoder, where is the number of feature channels, Denote () and () as the height (width) of template and current frame feature maps, respectively. We reshape the template features to the size of . Denote the reshaped template feature set as , which consist of features with the size of . In our RANet, the correlation is computed between the -normalized features in template frame and the current frame . After correlation, we have the similarity maps whose size is

. Denote the tensor

as the set of correlation maps. Then we have

(1)

In Fig. 4, we present some examples of the similarity maps. Each similarity map is associated with a certain pixel in template frame, whose new position in the current frame is at the maximum (, brightest point) of the similarity map. Additionally, in contrast with SiamFC [siamfc], since we obtain these maps in a weakly-supervised manner, the contours of the bear, which are essentially preserved for segmentation, are maintained. On the right side of Fig. 4, we show some output features of the merging module. The object can be distinguished after the merging networks.

Ranking Attention Module (RAM). We first utilize the mask of the first frame to filter FG and BG similarity maps. Then we design a FG path and a BG path network to process the similarity features. Since the number of the FG or BG pixels varies in different videos, the number of FG or BG similarity maps changes dynamically. However, regular CNNs require input features with a fixed number of channels. To tackle this issue, we propose a Ranking Attention Module (RAM) to rank and select important features. That is, we learn a scoring scheme for the similarity maps, and then rank and select these maps according to their scores.

Figure 4: Visualization of the similarity mapsLeft: the template and current frames, and 4 foreground correlation similarity maps. Right: the similarity maps after merging.

As shown in Fig. 2, there are three steps in our RAM. In the first step, we filter FG (or BG) similarity maps using the mask of the first frame. Specifically, we swap the spatial and channel dimensions of similarity maps (reshape into ) and then multiply them with the FG or BG mask (resized to ), respectively. Thus, we obtain the FG (or BG) features (or ). In FG component, the features of BG pixels are set as zero, and vice versa. In the second step, for each similarity map , we learn a ranking score which show the importance of each map. Taking the FG tensor for instance, to calculate the ranking scores of similarity maps in , we use a two-layer network

strengthened by summing with the channel-wise global max-pooling

of the tensor in an element-wise manner. Larger score indicates greater importance of the corresponding similarity map in . The channel-wise maximum of each similarity map represents the possibility of corresponding pixel in template frame to find a matching pixel in current frame. We define the final FG ranking score metric as

(2)

Then we reshape

into a vector

. Similarly, we can obtain the BG ranking score vector .

Finally, we rank the similarity maps in according to the corresponding scores in from largest to smallest:

(3)

If the number of the FG similarity maps is less than the target channel size (set as 256), we pad the ranked feature with zero maps; and if the number is larger than the target channel size, the redundant features are discarded, such that the channel size can be fixed. The BG tensor are similarly processed. An illustration of the proposed ranking mechanism is shown in Fig. 3.

3.3 Extension for Multi-object VOS

A trivial extension of single-object VOS methods to perform multi-object VOS is to deal with the multiple objects in videos one-by-one. But this strategy would be inefficient when there are many objects. To make the proposed RANet efficient for multi-object VOS, we share the features extracted by the encoder and also the similarity maps computed by correlation for all the objects. Then, for each object (), we generate its FG and the corresponding BG masks, and segment the FG (or BG) independently using the light-weight decoder. Finally, we use a softmax function to compute the final results on VOS.

Figure 5: Illustrations of the training samples.

3.4 Implementation Details

Here, we briefly describe the encoder and decoder, and present the detailed network structure in Supplemental File.

Encoder. The backbone of the two-stream Siamese encoder [siamfc] is the ResNet-101 network [resnet], pre-trained on ImageNet [imagenet]

. We replace the batch normalization 

[bn2015] with instance normalization [in2016]. The features from the last three blocks are extracted as multi-scale features. We reduce the channel sizes of these multi-scale features by four-fold via convolutional layers. The features are also resized into the conformable size. The channel-wise normalization [hoffer2018norm] is added after each convolutional layer for feature pruning and multi-scale merging.

Decoder. The decoder is a three-level pyramid-like network with skip connection. The multi-scale features of current frame extracted by encoder are fed into the decoder. However, using all the features in the decoder would bring huge computational costs. To speed up our RANet, we first reduce the channel sizes of the multi-scale features using convolutional layers, and then feed them into the decoder.

3.5 Network Training

We train our network using the Adam [adam] with a initial learning rate of , to optimize an binary cross-entropy loss. During training and test, the input image is resized into . We use random Thin Plate Splines (TPS) transformations, rotations (), scaling (), and random cropping for data augmentation, just as [masktrack]. The random TPS transformations are performed by setting control points and randomly shifting the points within a margin of the image size.

Pre-train on static images. Following [masktrack], we pre-train the proposed RANet using static images. To train our RANet for single-object VOS, we use the images from the MSRA10K [msra10k], ECSSD [ecssd], and HKU-IS [hku] datasets in the saliency community [fan2019rethinking, Zhao2019RgbdSal, Fan2019VideoSal, Zhao2019ebd, un03, Lu_2019_CVPR]. To train RANet for multi-object VOS, we add the SOC [soc] and ILSO [ilso] datasets containing multi-object images. Fig. 5 (a) shows a pair of generated static images. As will be shown in §4.2 and §4.3, the proposed RANet achieves competitive results when been trained only with static images.

Video fine-tuning. Though our RANet can achieve satisfactory results when been trained only with static images, we further exploit its performance by performing video fine-tuning on benchmark datasets. To fine-tune our RANet for specific single-object VOS task, we then fine-tune the network on the training set of the DAVIS dataset [davis2016]. During training, we randomly select two frames with data transformations from one video as the template and current frames, and randomly select the mask of a frame near the current frame (we set the maximum interval as 5). We fine-tune our RANet for specific multi-object VOS task on the training set of the DAVIS dataset [davis2017]. Fig. 5 (b) shows an example of paired video training images.

 

Method OL Time & Mean Recall Decay Mean Recall Decay
OSVOS   [osvos] 5000 80.2 79.8 93.6 14.9 80.6 92.6 15.0
MaskTrack [masktrack] 12000 77.6 79.7 93.1 8.9 75.4 87.1 9.0
CINM   [cinm] 70000 84.2 83.4 94.9 12.3 85.0 92.1 14.7
OSVOS-S [osvos-s] 4500 86.6 85.6 96.8 5.5 87.5 95.9 8.2
OnAVOS [onavos] 13000 85.5 86.1 96.1 5.2 84.9 89.7 5.8
PReMVOS [premvos] 38000 86.8 84.9 96.1 8.8 88.6 94.7 9.8
RANet+ 4000 87.1 86.6 97.0 7.4 87.6 96.1 8.2
PLM [plm] 500 66.4 70.2 86.3 11.2 62.5 73.2 14.7
VPN [vpn] 630 67.9 70.2 82.3 12.4 65.5 69.0 14.4
SiamMask [siammask] 28 70.0 71.7 86.8 3.0 67.8 79.8 2.1
CTN [ctn] 30000 71.4 73.5 87.4 15.6 69.3 79.6 12.9
OSMN [osmn] 130 73.5 74.0 87.6 9.0 72.9 84.0 10.6
SFL   [sfl] 7900 76.1 76.1 90.6 12.1 76.0 85.5 10.4
PML   [pml] 280 77.4 75.5 89.6 8.5 79.3 93.4 7.8
VideoMatch [videomatch] 320 - 81.0 - - - - -
FAVOS   [favos] 1800 81.0 82.4 96.5 4.5 79.5 89.4 5.5
FEELVOS [Voigtlaender2019FEELVOS] 510 81.7 81.1 90.5 13.7 82.2 86.6 14.1
RGMP [rgmp] 130 81.8 81.5 91.7 10.9 82.0 90.8 10.1
RANet 33 85.5 85.5 97.2 6.2 85.4 94.9 5.1
Table 1: Comparison on objective metrics and running time (in milliseconds) by different methods on the DAVIS-val dataset. The best results of online learning (OL) based methods and offline methods are both highlighted in bold.

 

Method Mean Recall Decay
BVS [bvs] 66.5 76.4 26.0
OFL [ofl] 71.1 80.0 22.7
VPN [vpn] 75.0 90.1 9.3
CTN [ctn] 75.5 89.0 14.4
MaskTrack [masktrack] 80.3 93.5 8.9
RANet 83.2 94.2 9.3
RANet+OL 86.2 96.2 7.6
Table 2: Comparison of different methods without video fine-tuning on DAVIS-trainval dataset. “RANet+OL” denotes the proposed RANet boosted by OL techniques.

4 Experiments

In this section, we first describe our experimental protocol (§4.1), and then compare the proposed ranking attention network (RANet) with the state-of-the-art VOS methods (§4.2). We next perform a comprehensive ablation study to gain deeper insights into the proposed RANet, especially on the effectiveness of the ranking attention module (§4.3). Finally, we present the visual results to show the robustness of RANet against challenging scenarios (§4.4). More results are provided in the Supplementary File.

4.1 Experimental Protocol

Training datasets. We evaluate the proposed RANet on the DAVIS [davis2016] and DAVIS [davis2017] datasets. The DAVIS dataset [davis2016] contains 50 videos (p), annotated with pixel-level object masks (one per sequence) densely on the total 3455 frames, and it is divided into a training set ( videos), and a validation set ( videos). The DAVIS dataset [davis2017], that contains videos with multiple objects, is an extension of DAVIS, and it contains a training set with videos, a validation set with videos, and a test-dev set with videos. In all datasets, there is no overlap among the training, validation, and test sets.

Testing phase. Similar to SiamFC [siamfc], we crop the first frame and extract the features as the template features ( in §3.2), then compute the similarity maps between the features of template frame and of the test frames one-by-one, and finally segment the current test frame. The video data used are in different goals: 1) to evaluate our RANet for single-object VOS, we test it on the validation set ( videos) of [davis2016]; 2) to judge the effectiveness of our RANet trained only on static images, we evaluate it on the videos of the whole DAVIS dataset; 3) to assess our RANet for multi-object VOS, we evaluate it on the validation and test sets of [davis2017], which respectively contain videos. To compare with OL based methods, we follow  [osvos, masktrack], fine-tuning on the first frame with data augmentation for each video. We use the same training strategy as pre-training on static images, but the learning rate is .

Evaluation metrics. We use seven standard metrics suggested by [davis2016]: three region similarity metrics Mean, Recall, Decay; three boundary accuracy metrics Mean, Recall, Decay; and Mean, which is the average of Mean and Mean.

4.2 Comparison to the state of the art

Comparison Methods. For single object VOS, we compare our RANet with state-of-the-art OL based and offline methods [osvos-s, masktrack, osvos, cinm, onavos, premvos, plm, vpn, siammask, ctn, osmn, sfl, pml, videomatch, favos, Voigtlaender2019FEELVOS, rgmp] in Table 1, including OSVOS-S [osvos-s], PReMVOS [premvos], RGMP [rgmp], FEELVOS [Voigtlaender2019FEELVOS], etc. To evaluate our RANet trained with static images, we compare it with some methods [bvs, ofl, vpn, ctn, masktrack] without using DAVIS training set. For multi-object VOS, we compare with some state-of-the-art offline methods [osvos, onavos, favos, osmn, videomatch], and also list results of some OL based methods [cinm, osvos-s, onavos, osvos, videomatch] for reference.

Results on DAVIS-val. As shown in Table 1, without online learning (OL) technique, our RANet still achieves a Mean of at a speed of milliseconds (FPS). For RANet, its metric results are higher than all the methods without OL techniques, while its speed is higher than all the compared methods, except SiamMask [siammask]. But please note that SiamMask performs badly on objective metrics, e.g., at , points lower than our RANet. Even when compared with the state-of-the-art OL based methods such as OSVOS-S [osvos-s] and OnAVOS [onavos], our offline RANet achieves comparative results. The RANet can be improved by OL techniques. The OL boosted RANet, denoted as RANet+, achieves a Mean of , outperforming all OL based VOS methods.

 

DAVIS-val DAVIS-testdev
Method OL & Mean & Mean
CINM   [cinm] 70.6 67.2 67.5 64.5
OSVOS-S [osvos-s] 68.0 64.7 57.5 52.9
OnAVOS [onavos] 65.4 61.6 52.8 49.9
OSVOS   [osvos] 60.3 56.6 50.9 47.0
VideoMatch [videomatch] 61.4 - - -
OSVOS   [osvos] 36.6 - - -
OnAVOS [onavos] 39.5 - - -
FAVOS   [favos] 58.2 54.6 43.6 42.9
OSMN [osvos-s] 54.8 52.5 41.3 37.7
VideoMatch [videomatch] 56.5 - - -
RANet 65.7 63.2 55.3 53.4
Table 3: Comparison of different methods on DAVIS-val and DAVIS-testdev datasets. The methods are divided into two groups according to whether online learning (OL) technique is employed or not. 

Results on DAVIS-trainval. We also evaluate the performance of our RANet trained only with static images (i.e., without video fine-tuning). MaskTrack [masktrack] has the most similar setting as our RANet in this case, since it also uses only static images to train its networks. Contrast to MaskTrack, our RANet does not rely on OL techniques, speeding up for nearly a hundred times faster. In Table 2, we list the results of different methods that do not require fine-tuning/training on video data. Again, our RANet outperforms all the other methods by a clear margin.

DAVIS dataset: The DAVIS dataset is challenging due to multi-object scenarios. To evaluate our RANet on DAVIS-val and DAVIS-test sets, we use the RANet trained on multi-instance static images and the DAVIS-train dataset, as described in §3.5. In Table 3, we show the comparison of our RANet with state-of-the-art VOS methods. It can be seen that on the DAVIS-val dataset, our RANet achieves higher metric results than the w/o OL methods. Furthermore, on the more challenging DAVIS-testdev dataset, our RANet even outperforms the OL based method OnAVOS in terms of Mean.

Speed. Here, we evaluate the speed-accuracy performance of different methods on DAVIS-val set. Our RANet runs on a TITAN Xp GPU. In Table 1, we list the average time of different methods processing a frame of 480p resolution. Note that the proposed RANet spends 33 milliseconds on each frame, much faster than most of the previous methods. As shown in Fig. 6. The recently proposed method SiamMask [siammask] is a little faster than our RANet but at expenses of much lower results on Mean than ours.

4.3 Validation of the Proposed RANet

We now conduct a more detailed examination of our proposed RANet on the VOS task. We assess 1) the contribution of the proposed ranking attention module (RAM) to RANet; 2) the importance of correlation layer (CL) to RANet; 3) the influence of propagating previous frame’s mask (PM) on RANet; 4) the effect of static image pre-train (IP) and video fine-tuning (VF) on RANet; and 5) the impact of online learning (OL) technique to RANet.

Figure 6: Comparison of Mean and Speed (in FPS) by different methods on DAVIS-val dataset.

 

Variant w/ RAM w/o Ranking Maximun
Mean 85.5 81.9 81.1
Table 4: Comparison of Mean by different variants of RANet on DAVIS-val dataset

1. Does the proposed ranking attention module contribute to RANet? To evaluate the contribution of the proposed RAM module to RANet on VOS task. We compare the original RANet, we call it w/ RAM, with two baselines. For the first one, w/o Ranking, we maintain all the similarity maps in , and obtain FG (or BG) similarity maps (or ) by setting corresponding BG (or FG) as zeros according to the template mask. For the second one, Maximum, instead of using RAM to obtain abundant embedding maps, we employ channel-wise maximum operation, which is also used in [Voigtlaender2019FEELVOS], on the similarity maps and , respectively, to get one FG and one BG map . Then we feed them into the decoder.

The comparison of RANet w/ RAM, w/o Ranking, and Maximum is listed in Table 4. It can be seen that, the RANet w/ RAM achieves and higher than the baselines w/o Ranking and Maximum, respectively. The RANet w/o Ranking organizes the similarity maps based on the spacial information of the template frame, while the RANet with Maximum losses most useful information in similarity maps by only extracting the maximum values.

Figure 7: Qualitative results of the proposed RANet on challenging VOS scenarios. The test frames are from videos in the DAVIS set (-st and -nd rows), the DAVIS-val set (-rd row), and the DAVIS-testdev set (-th and -th rows).

2. How important is the correlation and RAM to our RANet? To evaluate the importance of correlation layer in our RANet, we remove the correlation layer, and simply concatenate the features extracted by the encoder, as RGMP [rgmp] does. The following RAM module is also meaningless and is removed. Thus we have a new variant of RANet: -CL. However, as shown in Table 5, the performance of this variant is very bad (67.5% on Mean). Thus, the correlation layer is important to our RANet, and serves as the base for the proposed RAM module.

 

Method origin -CL -PM -IP -VF
RGMP [rgmp] 81.5 - 73.5 68.6 55.0
RANet 85.5 67.5 81.4 73.2 79.9
Table 5: Ablation study of RANet on Mean. CL, PM, IP, and VF mean Correlation Layer, Previous frame’s Mask, static Image Pre-train, and Video Fine-tuning, respectively.

 

Metric offline +online learning
& Mean 85.5 86.2 86.8 86.9 87.1
Time 0.033 0.30 1.00 1.50 4.00
Table 6: Influence of online learning to RANet with different iterations on & Mean and runtime (in seconds).

3. How does the previous frame’s mask (PM) influence our RANet? We study how PM influence our RANet. To this end, we set all the pixels of the PM as zero, and re-train our RANet. Thus we have a baseline of -PM. Results in Table 5 shows that, the variant -PM of RANet will drop Mean by points. This indicates that the temporal information propagated by PM is very useful for our RANet.

4. What are the effects of pre-training on static images and video fine-tuning in our RANet? To answer this question, we study how each training strategy affect the performance of RANet. We first train RANet only on video data and have a baseline: -IP. We then train RANet only on static images and have the second baseline: -VF. The results of Mean by the variants -IP and -VF on DAVIS-val dataset are listed in Table 5. As can be seen, both baselines drop significantly on Mean when compared to the original RANet. Specifically, static image pre-train (IP) improves the Mean from to , while video fine-tuning (VF) improves the Mean by points. The performance drops (from 85.5% to 73.2%) of removing IP is mainly due to the over-fitting of RANet on the DAVIS-training set, which only contains 30 single-object videos.

5. The trade-off between performance and speed using online learning. In Table 6, we also show the performance and run-time of RANet with or without OL technique. One can see that, as the number of iterations increases in OL, the results of our RANet on Mean are continuously improved with different extents, while at a cost of speed.

4.4 Qualitative Results

In Fig. 7, we show some qualitative visual results of the proposed RANet on the DAVIS and DAVIS datasets. It can be seen that, the RANet is very robust against many challenging scenarios, such as appearance changes (-st row), fast motion (-nd row), occlusions (-th row), and multi-objects (-rd and -th rows), etc.

5 Conclusion

In this work, we proposed a real-time and accurate VOS network, which runs at 30 FPS on a single Titan Xp GPU. The proposed ranking attention network (RANet) end-to-end learned the pixel-level feature matching and mask propagation for VOS. A ranking attention module was proposed to better utilize the similarity features for fine-grained VOS performance. The network treated the point-to-point matching feature as a guidance instead of the final results, to avoid noisy predictions. Experiments on DAVIS datasets demonstrate that our RANet achieves state-of-the-art performance on both segmentation accuracy and speed.

This work can be further extended. First, the proposed ranking attention module can be applied to other applications such as object tracking [siammask] and stereo vision [Khamis2018StereoNet]. Second, better propagation [flownet, flownet2] or local matching [Voigtlaender2019FEELVOS] techniques can be employed for better VOS performance.

Acknowledgements. We thank Dr. Song Bai on the initial discussion of this project.

References