CRVOS: Clue Refining Network for Video Object Segmentation
The encoder-decoder based methods for semi-supervised video object segmentation (Semi-VOS) have received extensive attentions due to their superior performances. However, most of them have complex intermediate networks which generate strong specifiers, to be robust against challenging scenarios, and this is quite inefficient when dealing with relatively simple scenarios. To solve this problem, we propose a real-time Clue Refining Network for Video Object Segmentation (CRVOS) which does not have complex intermediate network. In this work, we propose a simple specifier, referred to as the Clue, which consists of the previous frame's coarse mask and coordinates information. We also propose a novel refine module which shows higher performance than general ones by using deconvolution layer instead of bilinear upsampling. Our proposed network, CRVOS, is the fastest method with the competitive performance. On DAVIS16 validation set, CRVOS achieves 61 FPS and J F score of 81.6READ FULL TEXT VIEW PDF
Current state-of-the-art approaches for Semi-supervised Video Object
Semi-supervised video object segmentation (VOS) is a task that involves
Despite online learning (OL) techniques have boosted the performance of
Most state-of-the-art semi-supervised video object segmentation methods ...
We tackle the task of semi-supervised video object segmentation, i.e.
In this paper, we address several inadequacies of current video object
In this work we present SwiftNet for real-time semi-supervised video obj...
CRVOS: Clue Refining Network for Video Object Segmentation
Semi-supervised video object segmentation (Semi-VOS) is a task to find the labels of every pixel in video with the given mask, i.e. the initial frame’s mask. Semi-VOS can be divided into two categories, with and without online fine-tuning. The methods with online fine-tuning [caelles2017one, voigtlaender2017online, perazzi2017learning, hu2017maskrnn] generally show better performances compared to the ones without, since they can prepare the most suitable environment for the given condition. However, they usually show slower speed, since online fine-tuning is quite computationally expensive. On the other hand, the methods without online fine-tuning [marki2016bilateral, jampani2017video, cheng2018fast, yang2018efficient] are generally faster speed than with the ones, but the performances are not comparable to them.
Performances of the methods with and without online fine-tuning differed greatly due to aforementioned reason. RGMP [wug2018fast] reduced the gap between them by achieving satisfactory performance without online fine-tuning. Since RGMP succeeded with encoder-decoder architecture, many recent methods [johnander2019generative, lin2019agss, zeng2019dmm, oh2019video, wang2019ranet] were designed based on encoder-decoder architecture, and most of them show competitive performances. They extract features from the current frame’s input image with encoder. These features are connected to decoder with skip connections, and decoder predicts the current frame’s mask with the features and the specifier.
Semi-VOS deals with arbitrary target given in the initial frame, making it hard to specify the target only with encoder and decoder. Therefore, the encoder-decoder based methods have the intermediate networks between their encoders and decoders, to generate some additional information about the target, and we call the additional information as a specifier since it is used for specifying the given target. For example, RGMP [wug2018fast] uses Siamese encoder with Global Convolution Block, STM [oh2019video] uses Space-Time Memory Networks, and RANet [wang2019ranet] uses Ranking Attention Module as the intermediate network to generate the specifier. The problem arises because, these intermediate networks are generally complex and need quite large computational costs but are only effective when dealing with challenging scenarios. In other words, the existing encoder-decoder based methods are not efficient in most of relatively simple scenarios.
In this work, we propose a novel method that does not need complex intermediate network. Instead of using a strong specifier which needs complex intermediate network, we use a simple specifier, the Clue. In addition, we propose a novel refine module for Semi-VOS showing better performance than general ones. Our proposed network, CRVOS, runs in real-time, while maintaining competitive performance. In Fig. 1, we visualize quantitative results on DAVIS16 [perazzi2016benchmark] validation set. Our contributions can be summarized as follows.
We suggest a novel method to generate a simple specifier. Our specifier, the Clue, enables our network to be competitive with real-time performance.
We propose a novel refine module for Semi-VOS. It is more effective than general ones, since it uses a deconvolution layer instead of bilinear upsampling.
The goal of Semi-VOS is to segment the target in video when the initial frame’s mask is given. When the current frame’s input image is received, we should predict the current frame’s mask with the accumulated information, e.g. the initial frame’s input image, the initial frame’s mask, the previous frame’s input image and the previous frame’s mask. In Fig. 2, we show network architecture of CRVOS, which consists of an encoder, a decoder and the Clue. Our encoder is based on ResNet-50 [he2016deep]
, pretrained on ImageNet[deng2009imagenet]. It extracts the features from the current frame’s input image and these features are connected to refine modules with skip connections. The size of the last features is 1/16 of the input image size. The Clue is composed of the previous frame’s coarse mask with coordinates information, and it also has the size of 1/16 of the input image. Our decoder predicts the current frame’s mask by refining the Clue with three novel refine modules.
Most of the current encoder-decoder based methods have complex intermediate networks for strong specifiers. However as mentioned earlier, these methods are not efficient in relatively simple scenarios. Therefore, to efficiently deal with these scenarios, we focus on finding a simple specifier which does not need complex intermediate network, but still can specify the given target. We assume that the previous frame’s coarse mask can be the solution. Since positional changes of the target between adjacent frames are generally small, the previous frame’s mask has enough positional information about the current frame’s target, and more so at the coarse level. In addition, we add three Coord [liu2018intriguing] layers to explicitly reinforce positional information of the coarse mask. The layers consist of the values in range [-1,1], and the values are sorted by the height, the width and the distance from the center. In conclusion, the specifier that we use in CRVOS has two channels for the previous frame’s coarse mask and three channels for the coordinates information. We call our specifier the Clue, since it is indeed simple but plays a key role.
General refine modules for Semi-VOS upscale the features with bilinear upsampling. However, bilinear upsampling is not able to generate detailed spatial information. Therefore, we use a deconvolution layer instead of bilinear upsampling for detailed spatial information. Since a deconvolution layer is usually slower than bilinear upsampling, we reduce the output channel size to 2, to avoid the speed decrement. Furthermore, we add skip connection to our refine module. Our decoder uses multiple outputs from three refine modules to predict the current frame’s mask more accurately.
We use the two-step training strategy. First we pre-train the network on Youtube-VOS [xu2018youtube_1] train set, and then fine-tune the network on DAVIS16 [perazzi2016benchmark] train set. For pre-training and fine-tuning, we use randomly chosen 8 and 16 image frames as the input of the network respectively.
We use negative log-likelihood (NLL) function for the loss function and Adam for the optimizer. For pre-training, each input image is resized to
, and we train the network 100 epochs with learning rate of 1e-4 with no learning rate decay. For data augmentation, we use horizontal flip. For fine-tuning, each input image has its original size,, and we train the network 200 epochs with learning rate of 1e-4 and exponential learning rate decay of 0.995. For data augmentation, we use horizontal flip, rotation from to , shearing from to and scaling from to . If there are multiple targets, we randomly choose one in training process. During the test, we predict masks of multiple targets, and overlap them by comparing the scores of each target.
DAVIS16 [perazzi2016benchmark] is the most popular dataset which only deals with single-target scenarios. In Table 1, we show quantitative evaluation on DAVIS16 validation set. CRVOS shows 61 FPS which is the fastest speed among the existing methods, and score of 81.6% which is comparable to state-of-the-art methods. It can be observed, CRVOS is the most efficient method for single-target scenarios. DAVIS17 [pont20172017] is the extended version of DAVIS16, which also includes multi-target scenarios. In Table 2, we show evaluation on DAVIS17 [pont20172017] validation set. CRVOS shows score of 54.3% which is satisfactory considering its fast speed.
We conduct ablation studies to demonstrate our works. In Table 3, we compare the performances of four networks with different specifiers and refine modules.
Specifier: We show the effect of using different specifiers by comparing 1, 2 and 3. 1 is the network with no specifier. Since it is hard to specify the target, it shows score of 73.4% which is the worst performance among 1, 2 and 3. 2 uses the previous frame’s coarse mask as a specifier. It can specify the target, and therefore, it shows score of 80.6% which is a lot better than 1. 3 uses the Clue, i.e. the previous frame’s coarse mask and coordinates information, as a specifier. Since the coordinates information reinforce positional information of the coarse mask, 3 shows score of 81.6% which is the best performance among 1, 2 and 3.
Refine module: We show the effect of using different refine modules by comparing 3 and 4. 4 uses general refine module which is composed of one convolution layer and bilinear upsampling, and shows score of 77.6%. 3 uses our refine module which is composed of single convolution layer, with deconvolution layer and skip connection. Since deconvolution layer generates more detailed spatial information than bilinear upsampling, and using multiple outputs of refine modules enables to predict more accurate masks, 3 shows score of 81.6% which is better than 4.
Fig. 3 shows the qualitative results of CRVOS. The upper four rows are DAVIS16 [perazzi2016benchmark] validation set which deals with single-target scenarios, and the rest are DAVIS17 [pont20172017] validation set which deals with multi-target scenarios.
Single-target: In the first scenario, the target moves fast and the scale of the target rapidly changes. In this case, the Clue provides the imperfect information about the target, since the Clue assumes that the positional changes of the target between adjacent frames are small. As a result, CRVOS misses some parts of the target in the middle, but the error is recovered right after, since the target is quite clear. In the second scenario, it can be seen that, CRVOS is robust against appearance changes of the target since the Clue gives reliable information about the target. The third scenario and the fourth scenario are difficult since the appearance and the color of the target are similar with the backgrounds. Nevertheless, CRVOS produces accurate segmentations since the position of the target is precisely tracked by the Clue.
Multi-target: In the fifth and the last scenario, there are multiple targets with relatively small movements. It can be observed, if positional changes of the targets between adjacent frames are not large, CRVOS is able to predict the accurate masks in multi-target scenarios.
In this work, we have proposed a novel real-time network for Semi-VOS. Our proposed network, CRVOS, shows the fastest speed among the existing methods while maintaining competitive performance, with the Clue and our novel refine modules. The results on DAVIS16 [perazzi2016benchmark] validation set and DAVIS17 [pont20172017] validation set demonstrate that CRVOS is the state-of-the-art method considering both speed and performance, when dealing with comparatively simple scenarios.