Fast Video Object Segmentation via Mask Transfer Network

08/28/2019 ∙ by Tao Zhuo, et al. ∙ National University of Singapore 5

Accuracy and processing speed are two important factors that affect the use of video object segmentation (VOS) in real applications. With the advanced techniques of deep neural networks, the accuracy has been significantly improved, however, the speed is still far below the real-time needs because of the complicated network design, such as the requirement of the first frame fine-tuning step. To overcome this limitation, we propose a novel mask transfer network (MTN), which can greatly boost the processing speed of VOS and also achieve a reasonable accuracy. The basic idea of MTN is to transfer the reference mask to the target frame via an efficient global pixel matching strategy. The global pixel matching between the reference frame and the target frame is to ensure good matching results. To enhance the matching speed, we perform the matching on a downsampled feature map with 1/32 of the original frame size. At the same time, to preserve the detailed mask information in such a small feature map, a mask network is designed to encode the annotated mask information with 512 channels. Finally, an efficient feature warping method is used to transfer the encoded reference mask to the target frame. Based on this design, our method avoids the fine-tuning step on the first frame and does not rely on the temporal cues and particular object categories. Therefore, it runs very fast and can be conveniently trained only with images, as well as being robust to unseen objects. Experiments on the DAVIS datasets demonstrate that MTN can achieve a speed of 37 fps, and also shows a competitive accuracy in comparison to the state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video Object Segmentation (VOS) is a fundamental problem that can be applied in many computer vision tasks including video stabilization, retrieval, summarization, editing and scene understanding. In this paper, we focus on the semi-supervised VOS setting, which aims to segment a specific object in videos based on the annotated mask for this object in the first frame 

[28]. In recent years, due to the advance of deep neural networks, great progress has been achieved, especially on the segmentation accuracy (, OnAVOS [35] achieves an accuracy of 86.1% on the mean metric [28]). Although the processing speed has also been greatly improved (from 0.1 to 8 fps), it is still far from being real-time, which significantly limits their applications in practice. Fig. 1 shows the overall performance of the state-of-the-art semi-supervised VOS methods in terms of mean and processing speed. As we can see, the segmentation accuracy of current methods is reasonable while the processing speed still has a big gap to reach real-time.

Figure 1:

Accuracy (with Jaccard index metric

) versus processing time on the DAVIS-16 validation set. Our method MTN achieves 37 fps (frames per second), which exceeds the real-time speed of 25 fps and is significantly faster than existing methods by a large margin. MTN also achieves comparable accuracy when compared to the state-of-the-art methods.

The recent semi-supervised VOS methods mainly depend on two strategies: propagation-based and detection-based. The former approaches formulate the VOS task as a mask propagation problem from the first frame to the subsequent frames [29, 19]. To leverage the temporal consistency between two adjacent frames, many propagation-based methods  [29, 19, 5]

often transfer the mask of the previous frame to the current frame with optical flow estimation. However, because the computational cost of dense and accurate optical flow estimation is often expensive, their processing speed is severely limited. For example, SFL

[5] simultaneously predicts pixel-wise object segmentation and optical flow in videos, it runs at 7.9 seconds per frame.

The detection-based strategy [1, 38, 2, 16] addresses the VOS task as a pixel-level object retrieval problem with the annotation of the object mask in the first frame. Those methods often adopt a two-stage learning strategy that first trains an offline model to extract the semantic regions on generic objects, and then makes the offline model focus on the particular object by fine-tuning it on the first frame with annotated object mask. The problem is that the first frame fine-tuning step is often time-consuming. A typical example is the recent method OSVOS [19] that can run at 10 fps, but it needs more than 10 minutes for the first frame fine-tuning step on each video to achieve a high accuracy.

To speed up the video object segmentation, researchers recently have paid more attention to improve the VOS speed while sacrifice some accuracy, such as OSMN [37], BFVOS [2], VM [16] and RGMP [36]. Although great progress has been achieved so far (, RGMP can run at approximately 8 fps), they are still far from being real-time. In this paper, we aim to develop a real-time yet still accurate VOS method. To achieve this goal, we propose a novel method called Mask Transfer Network (MTN), which segments the target object by transferring the annotated mask from the first frame to the target frame with an efficient global pixel matching strategy. Different from the existing approaches, our method enjoys two main merits: avoids the time-consuming first frame fine-tuning step and does not rely on particular object categories and temporal cues. Therefore, it is robust to unseen objects and can be trained on the general image object segmentation datasets.

Specifically, given a video sequence with an annotated object mask on the first frame (, reference frame), the key idea of our method is to transfer the annotated object mask to the target frame by an efficient global pixel matching strategy between the reference frame and the target frame. For an arbitrary video frame, the target object may be located at any position. Consequently, the corresponding locations of an annotated object should be globally matched over the target frame. However, performing the global pixel matching step on all pixels of the original image size will be very time-consuming. To solve this problem, we propose to efficiently match pixels on a feature map of a much smaller size with high dimensions.

As shown in Fig. 2

, to speed up the global pixel matching step, we encode the image features of the reference frame and target frame into a very small size (, 1/32 size of the input image). Meanwhile, in order to transfer the reference mask to the target frame on the downsampled feature map (, 1/32) and simultaneously preserve detailed mask information (, object boundaries), we build a mask encoder network (5 convolutional layers with a stride of 2) to encode the reference mask. To this end, the encoded mask of the reference frame can be effectively transferred to the target frame by using a simple warping operation based on the pixel matching results. Finally, the concatenated features of the encoded image features of the target frame and the transferred mask are fed into the last layer of a bottom-up decoder for the target object segmentation.

Empirical studies on the DAVIS benchmark dataset [28] show that the proposed method MTN can achieve a speed of 37 fps on images of pixels, which is much faster than all the existing methods. Even when compared to RGMP [36] - one of the top fastest methods (less than 8 fps) - on the same platform, MTN is still faster. At the same time, MTN can also achieve competitive accuracy from different perspectives: (1) The proposed method MTN does not rely on temporal cues, and thus it can be trained on general image object segmentation datasets without any annotated video sequences. By training the proposed network on two image object segmentation datasets (PASCAL VOC [10] and MSRA10K [6]), MTN achieves 75.3% of on the DAVIS-16 trainval set (50 videos), which is highly competitive to the state-of-the-art methods (see Sec. 4.2). (2) As MTN does not rely on particular object categories, it can be used to segment unseen objects. Compared to other recently developed mask transfer methods, MTN significantly improves the accuracy by a large margin of (see Sec. 4.3). (3) Compared to the state-of-the-art methods on the DAVIS-16 validation set and multi-object segmentation on the DAVIS-17 validation set, MTN achieves competitive accuracy (see Sec. 4.4 and 4.5).

Figure 2: The architecture of the proposed method MTN. The relative spatial sizes and channel dimensions of feature maps are denoted below each module.

2 Related Work

Unsupervised methods. Unsupervised VOS methods [11, 13, 33, 9, 22, 23] aim to automatically segment prominent objects without any user annotations. These methods usually rely on the visual saliency cues such as motion and long-term trajectories [11, 13]. Based on motion cues, recent methods [33, 9]

often detect the moving regions that indicate semantic objects with deep learning networks, by jointly using optical flow and object proposal methods. On the other hand, long-term trajectory-based methods

[11, 13, 22] depend on the temporal consistency of pixels, superpixles or object proposals, with the assumption that pixels with consistent trajectories are foreground objects and the rest pixels are background. Due to the lack of information about the target object, however, unsupervised methods often fail to accurately segment a specific object in videos. Besides, they also easily suffer from the motion confusions between the dynamic background and other objects [33, 9], resulting in poor performance. Therefore, we mainly focus on the semi-supervised approach in this paper.

Semi-supervised methods. Given the first video frame with annotated object masks, semi-supervised methods [24, 38, 1, 29, 37, 2, 19] aim to segment specific objects across the entire video sequence. To deal with fast motion and heavy occlusion, many approaches  [1, 29, 35, 19] first train an offline model to generate generic object proposals, and then fine-tune the offline model on the first video frame for particular target object segmentation. Although this fine-tuning strategy significantly improve the accuracy, its expensive computational cost makes those algorithms unsuitable for practical applications.

Recently, many approaches  [2, 36, 38] are developed for fast VOS by avoiding the first frame fine-tuning step. FAVOS [4] adopts a part-based tracking method to predict bounding boxes of object parts and then segment the target object with a segmentation network. OSMN [37] uses a network modulation approach to manipulate intermediate layers of the segmentation network. RGMP [36] proposes a hybrid model that fuses the mask detection and propagation in a Siamese encoder-decoder network. RGMP achieves good segmentation accuracy and it can run at approximately 8 fps. However, due to the inefficient network architecture design, the processing speed of existing algorithms is still far from being real-time. In this work, we proposed to apply a global pixel matching strategy for fast and accurate VOS. More details on several matching-based approaches are discussed in Sec. 3.4.

3 Mask Transfer Network

Given a reference frame (the first frame of a video) with an annotated object mask, our goal is to achieve fast and accurate object segmentation over the entire video sequence. The key idea of our method is to transfer the annotated object mask to the target frame based on an efficient global pixel matching strategy between the reference frame and the target frame. However, performing global pixel matching on the original size of frames will be very time-consuming. Thus we need to downsample the image into a smaller size for fast processing. At the same time, the annotated mask needs to be downsampled accordingly for mask transfer. A dilemma here is that the reference mask cannot be directly resized to a very small size, which will cause significant information loss of the reference mask (object boundaries, see Fig. 3). Moreover, for objects which are already of very small sizes, this strategy is inapplicable either. To solve this problem, we apply a mask network to preserve detailed mask information for accurate mask transfer.

Figure 3: An example of the mask transfer. In order to clearly illustrate the mask transfer module, we demonstrate the transferred mask on the 1/16 sized image with one channel (reference mask is directly resized to 1/16 scale), which can provide a coarse location of the target object. Besides, the correlation score of a pixel with dimensions is resized to a score map, in which the pixel with maximum correlation score value indicates its corresponding pixel in the target frame.

3.1 Network Architecture

The architecture of the proposed method MTN is shown in Fig. 2. The key modules of our method include an image encoder

for feature extraction, a

global pixel matching module and a mask encoder for reference mask transfer, and a bottom-up decoder for the target object segmentation. In the next, we will introduce those modules in sequence.

Image encoder. The image encoder is used to extract the features of input RGB images. To match the extracted image features of the reference frame and the target frame in the same feature space, a Siamese network [21] with shared weights is used as the image encoder. In our implementation, the ResNet50 [14]

is used as the image encoder and the weights are initialized from the pre-trained model on ImageNet 

[7]. Notice that the initialized parameters can be fine-tuned during the training of MTN for better performance. For simplicity, we fixed those parameters in our experiments.

Pixel matching. Mask transfer is built upon the pixel correspondences between the reference frame and the target frame, which can be achieved by a global pixel matching method and a simple feature warping operation [32]. Our pixel matching algorithm is inspired by the optical flow estimation methods [8, 32]. In those methods, the pixel displacement is usually small, since optical flow estimation is operated on two adjacent frames. However, in VOS, the pixel displacements could be very large because the target object could be located at any position in the target frame. Consequently, the pixel matching needs to be operated globally to cover the whole feature map in VOS. To speed up the global pixel matching step, we downscale the feature map into 1/32 of the original size, reducing the number of pixels for matching to of the original magnitude. Hence, the processing speed can be greatly accelerated. Besides, we adopt an embedding layer for the purpose of robust pixel matching. It is expected that pixels are much easier to be matched in a learned embedding space [31]. As illustrated in Fig. 2, our embedding layer consists of two convolutional layers with 128 output channels. Notice that the processing speed of matching step is further improved by compacting the image feature dimension from 2048 to 128.

1) Local pixel matching. In deep learning based optical flow estimation [8, 32], efficient pixel matching can be achieved with a correlation layer [8] that measures the patch similarity between two images. Formally, let denote the -th pixel in the reference frame , denote the -th pixel in the -th target frame . Given the learned embedding features and of two patches (with a patch size ) centered at in and in , the “correlation score”  [8] of two pixels and by the patch similarity is computed by the cross-correlation as:

(1)

where is the pixel displacement and patch size is . Since the optical flow estimation methods [8, 32] often assume that only small pixel displacements exist between two adjacent images, the patch size is often set to a small value for local pixel matching. Note that Eq. 1 computes the patch similarity of two input features, and thus it does not involve any trainable parameters.

2) Global pixel matching. As the target object can be located at any position in target frames. Therefore, we design a global pixel matching strategy to compute the similarity of all pixels between the reference frame and the target frame.

Let and denote the width and height of a feature map, respectively. For global pixel matching, the maximum displacement of a pixel is set to , then the corresponding patch size is . The output of the correlation layer denotes all of the patch similarity scores for a given pixel. The size of the output is . Fig. 3 illustrates the resized correlation score map of a pixel on the 1/16 sized image.

The global pixel matching step is to find the most similar pixels with the maximum correlation score over the whole target image. Let be the -th pixel in the reference frame , denote the -th pixel in the target frame , and then be the correlation scores between and the image features of all the pixels in , where is the total number of pixels on the feature map. Let be the index of the pixel with maximum correlation score in , the displacement between a pixel and its corresponding pixel is computed as:

(2)

where and represent the pixel displacement in horizontal and vertical direction, respectively. denotes the mod operator and represents the division operator. From the description, it can bee seen that if we perform the global pixel matching on the original size of frames, it would be very time-consuming. Therefore, we downscaled the frame size into 1/32 of the original size as mentioned above. The problem is that the reference mask also needs to be downscaled accordingly for transfer, which will be described in the below.

Mask encoder and transfer. As pixel correspondences are computed on a smaller size of the input images, the reference mask also needs to be scaled to the same size for mask transfer. However, directly resizing the original mask to a very small size will lose detailed information of the reference mask [2, 16] (such as image boundaries, see Fig. 3). Especially for objects of very small sizes, this strategy is obviously inapplicable. To address the problem, we build a mask encoder to preserve the detailed information of the reference mask during the downscaling processing. As shown in Fig. 2, the mask encoder consists of 5 convolutional layers with a stride of 2, which encodes the original mask into a feature map with 1/32 size of the input image and high dimensions (512 channels). This mask encoder can effectively preserve the detailed mask information (according to the experimental results). Finally, based on the pixel matching results, a feature warping method [32] (assigning values to corresponding pixels based on computed pixel displacements) is used to transfer the encoded reference mask to the target frame. Fig. 3 illustrates the effectiveness of the mask transfer on a downsampled size with single channel representation. Although the appearances of the reference frame and the target frame are very different, the important location cues of the target object are still obtained by the proposed mask transfer method. Because of the proposed global pixel matching strategy and the mask encoder design, we can achieve accurate and fast target object mask estimation in the downsampled feature space.

Bottom-up decoder. After representing the target object mask at a downsampled scale, the next step is to implement the segmentation of the target object at the original size. This step is achieved by several de-convolutional layers (5 layers) in the bottom-up decoder module. Specifically, the encoded image features of the target frame and the transferred mask features are concatenated together, and then fed into the last layer of a global convolutional block [27] to extract the semantic features of the target object. Next, several residual-based boundary refinement blocks [27] are used to generate a score map with 64 channels. More details about the decoder processing are discussed in [27]

. Based on an additional convolutional layer, a single channel score map is further produced as the foreground probability map. Since the bottom-up decoder is implemented on a downsampled image feature map (

size of the input image), we upsample foreground probability map to the original image size by an interpolation operation.

3.2 Model Training

Given a reference frame (the first frame) with an annotated object mask (the reference mask which is manually labeled), MTN can automatically segment the target object in the subsequent frames. As shown in Fig. 2. all the target frames are matched with the reference frame independently, which means that MTN does not rely on temporal cues. Therefore, MTN can be trained from images without any annotated video sequences, see Sec. 4.2.

During the training stage, a pair of images or video frames with corresponding masks can be used in MTN for training. We adopt the dice coefficient loss [26] in image object segmentation to measure the overlap between predicted mask and the ground-truth. Specifically, the employed dice loss is defined as:

(3)

where and are the ground truth label and predicted foreground probability of the pixel , respectively. is a smooth factor and we set it to 1 in our implementation.

3.3 Multi-object segmentation

To deal with the multi-object segmentation problem, we first compute the foreground probability of each single object independently, and then we compute the background probability of the target frame as:

(4)

where denotes the foreground probability of the object at the -th pixel , represents the background and indicates the number of objects. The pixel label with the maximum output probability is used as the final target object segmentation.

3.4 Comparison with Matching-based Methods

A big advantage of our method MTN is that it does not need any model fine-tuning for VOS, which can greatly save the processing time. Besides, the encoder-decoder network is meticulously designed to accelerate the processing speed and can also ensure the segmentation accuracy (see Tab. 1). In the following, we discuss the main differences between our MTN with the approaches which also adopt the pixel-wise matching strategy.

PLM [38] combines the pixel-wise matching module with object proposals in a unified network. It uses a two-stage learning strategy that requires fine-tuning of the pre-trained model on the first video frame to segment the target object. Unlike PLM, our proposed approach MTN does not rely on the first frame fine-tuning. BFVOS [2] and VM [16]

use a nearest neighbour classifier to match pixels between two video frames. To reduce the computation cost of the matching step, they use the image feature maps of a downsampled scale (, 1/8 size of the input image), and then they directly resize the reference mask to the same scale for target mask prediction. In contrast, our method performs the pixel matching on a feature map with a smaller size, , 1/32 size of the input image. As discussed in Sec.

3.1, without the mask encoder module, BFVOS and VM cannot perform the matching step on a very small size, because direct resizing of the reference mask will cause significant information loss, leading to large segmentation accuracy degradation. Besides, instead of using the pixel-wise distance measurement in BFVOS and VM, our approach MTN employs a correlation layer [8] to measure the patch similarity centered at a pixel, in which the spatial information of the feature map is considered, and thus more accurate matching results can be obtained.

4 Experiments

In the following, we first introduce the implementation details, and then report and analyze the performance of our method from different perspectives.

4.1 Implementation Details

We use the ResNet50 network (pre-trained on ImageNet) as the image encoder to extract image features. The parameters of the ResNet50 are fixed. During the training stage, the size of image patches is and the learning rate is set to with the Adam optimizer [20]. The bottom-up decoder is designed based on the semantic image segmentation module [27]. In our experiments, the proposed method is implemented and evaluated on a single NVIDIA GeForce 1080 Ti GPU.

4.2 Results on the Entire DAVIS-16 Dataset

To demonstrate the advantage that our method can be trained on images without any annotated video sequences, we train the proposed network MTN on two image segmentation datasets, PASCAL VOC [10] and MSRA10K [6]. Then we evaluate the performance of trained MTN on the entire DAVIS-16 trainval set [28], in which all 50 videos (from both training and validation sets) are completely blind to PASCAL VOC and MSRA10K. PASCAL VOC is a popular semantic image segmentation dataset with instance-level object masks, which consists of 2913 images with 20 different object categories. MSRA10K is a salient object segmentation dataset that consists of 10,000 images. DAVIS-16 benchmark dataset focuses on the single object segmentation. It contains 50 high resolution () videos with pixel-level binary mask annotations. This dataset has 30 videos for training and 20 for validation.

In our experiment, we randomly choose an image with a single object mask from PASCAL VOC as a training sample. To simulate two video frames as the input to MTN for training, we augment the images with a set of random transformations (, horizontal flipping, brightness, contrast, scaling, affine transformation, and center cropping) to generate a pair of images. During training, we use a fixed learning rate of , and the number of iterations is set to 400. To improve the diversity of training samples, we further fine-tune the trained model on the MSRA10K dataset with a fixed learning rate of and 3 iterations.

FT PF OF Time (s) mean mean
MSK [29] Y Y Y 12 80.3 75.8
MTN (P+M) 0.027 75.3 76.1
MTN (P) 0.027 72.4 74.9
VPN [17] Y Y 0.63 75.0 72.4
CTN [18] Y Y 30 75.5 71.4
OFL [34] Y Y 120 71.1 67.9
BVS [25] Y 0.37 66.5 65.6
JMP [12] Y Y - 60.7 58.6
FCP [30] Y Y 14.5 63.1 54.6
Table 1: Comparisons on the DAVIS-16 trainval set (50 videos). FT: first frame fine-tuning; PF: previous frame for mask propagation; OF: optical flow.
Figure 4: mean versus runtime on DAVIS-16 trainval set.

Notice that our model is only trained on image object segmentation dataset (PASCAL VOC and MSRA10K) without any annotated video sequences. In order to demonstrate the robustness of our approach, we directly evaluate the proposed method MTN on the entire DAVIS-16 trainval set (, 50 videos). We compare with all the available methods of high performance on this benchmark dataset111https://davischallenge.org/davis2016/soa_compare.html, trainval set of the DAVIS-16 benchmark dataset., including MSK [29], CTN [18], VPN [17], OFL [34], BVS [25], FCP [30] and JMP [12]

. MSK requires the first frame fine-tuning on each video sequence to achieve high accuracy. CTN, VPN, OFL, FCP and JMP require the optical flow to propagate the previous mask to the current frame. BVS segments the target object in a bilateral space with graph cut. The standard evaluation metrics 

[28] include average region similarity , contour accuracy and processing time. The results of those competitors are obtained from the corresponding published papers.

As the processing time reported in Tab. 1, the proposed approach MTN only takes 0.027 seconds per frame for the single object segmentation, which is significantly faster than existing methods. In particular, MTN is about faster than MSK and faster than BVS. Besides, Fig. 4 reports the overall performance on mean and runtime, , and processing speed. It can be seen that only the proposed method can achieve a real-time speed. Besides, our method achieves the second best performance, which is also very close to MSK by a small margin of 2.35% on mean.

As shown in Tab. 1, by training on the image segmentation dataset PASCAL VOC, our method MTN (P) can achieve an accuracy of 72.4% on and 74.9% on . Besides, by increasing the diversity of training samples from MSRA10K dataset, the accuracy of our method MTN (P+M) is further improved by 2.9% on and 1.2% on . Notice that even without the first frame fine-tuning and temporal cues for mask propagation, our method still achieves competitive performance when compared to the state-of-the-art methods. In particular, the mean of MTN (P+M) is 75.3%, which is only less than the best method MSK which requires 12 seconds (0.027s of our method) to process one frame, and outperforms all the other methods. Moreover, in terms of contour accuracy , the proposed method MTN (P+M) achieves the best performance.

bear bswan camel eleph goat malw rhino Avg.
TFN [15] 73.7 83.4 65.5 76.1 78.1 17.9 42.4 62.4
UOS [3] 89.8 76.7 72.0 73.8 83.3 41.6 71.0 72.6
MTN (P) 91.6 90.6 80.3 84.9 83.9 89.2 91.6 87.5
Table 2: Results ( mean metric) on the DAVIS-16 dataset with categories excluded from the PASCAL VOC dataset.

4.3 Results on Unseen Objects in DAVIS-16 Dataset

Because our approach does not rely on the particular object categories, it is robust to unseen object segmentation. Similar to the previous work UOS (unseen object segmentation in videos) [3], we also manually exclude all the videos with object categories that exist in the 20 categories of the PASCAL VOC dataset [10]. To evaluate our method, we compare with two mask transfer based approaches UOS [3] and TFN [15], which focus on transferring the information of seen objects to unseen objects. As shown in Tab. 2, our method achieves the best performance across all 7 videos. Two examples on unseen object categories goat and rhino are demonstrated in the first row and second row of Fig. 6, respectively. Furthermore, our method achieves a high accuracy of 87.5% on mean, which significantly improves the overall performance by 14.9% on average.

FT PF OF Time (s) mean mean
OnAVOS [35] Y Y 13 86.1 84.9
OSVOS [1] Y 10 80.6 79.8
VM* [16] 0.17 79.2 -
MTN 0.027 75.9 76.2
BFVOS [2] Y 0.28 75.5 79.3
OSMN [37] Y 0.14 74.0 72.9
RGMP* [36] 0.13 73.5 74.2
CTN [18] Y Y 29.95 73.5 69.3
OnAVOS* [35] Y 3.55 72.7 -
VPN [17] Y Y 0.63 70.2 65.5
PLM [38] Y Y 0.30 70.0 62.5
BVS [25] Y 0.37 60.0 58.8
OSVOS* [1] 0.10 52.5 -
  • OnAVOS* and OSVOS*: without the first frame fine-tuning; VM* and RGMP*: without the previous frame for mask propagation.

Table 3: Comparisons on the DAVIS-16 validation set (20 videos). FT, PF and OF are the same as in Tab. 1.

4.4 Results on the DAVIS-16 Validation Dataset

In order to compare the proposed approach with more available state-of-the-art methods, we randomly choose a pair of video frames from the DAVIS-16 training set, and then fine-tune the pre-trained model (on the image segmentation PASCAL VOC and MSRA10K). Specifically, on the DAVIS-16 val set, we compare with a set of representative methods OnAVOS [35], OSVOS [1], PLM [38], CTN [18], BVS [25], VPN [17], and the most recent approaches BFVOS [2], OSMN [37], RGMP [36] and VM [16]. Besides, to demonstrate the performance of compared algorithms with detection-based strategy, we use RGMP [36] and VM [16] without the mask propagation component for fair comparison. In the following, we report both of the region similarity , contour accurayc and processing time (obtained from the corresponding published papers) of the compared methods.

FT PF OF Time (s) mean mean
OnAVOS [35] Y Y 26 64.5 71.1
OSVOS [1] Y 20 52.1 62.1
OSMN [37] Y 0.28 52.5 57.1
MSK [29] Y Y Y 18 51.2 57.3
MTN 0.048 49.4 59.0
MSK* [29] Y Y 18 44.6 47.6
OnAVOS* [35] Y 7.10 39.5 -
OSVOS* [1] 0.20 36.4 39.5
  • OnAVOS*, OSVOS*, MSK*: without the first frame fine-tuning.

Table 4: Comparisons on the DAVIS-17 validation set (30 videos). FT, PF and OF are the same as in Tab. 1.
Figure 5: mean versus runtime on DAVIS-17 validation set.

As shown in Tab. 3, among all the methods which have not used the first frame fine tuning step, including VM*, BFVOS, OSMN, RGMP*, CTN, OnAVOS*, VPN, BVS and OSVOS*, the mean of our method MTN is comparable, which is only not as good as VM* by a small margin, 75.9% vs 79.2%. However, in terms of processing speed, MTN is about faster than the matching-based method VM*, , 0.027 vs 0.17. In addition, when compared to another matching-based algorithm BFVOS, our method is about faster and achieves a better mean. It is worth mentioning that the accuracy of OSVOS and OnAVOS degrades a lot without the first frame fine-tuning, , 79.8% vs 52.5% and 86.1% vs 72.7%, respectively.

Figure 6: Example results of the proposed method. The first row and second row show the performance on unseen object category goat and rhino, respectively. The last two rows are results on the video horsejump-high for multi-object segmentation.

4.5 Results on Multi-object Segmentation

To validate the effectiveness of our method on multi-object segmentation, we carry out experiments on the DAVIS-17 val set, which consists of 30 videos with 61 different objects, and thus the average number of objects is 2 per video sequence. Since this dataset contains many similar objects in a video sequence, it is very challenging to obtain accurate segmentation results on this dataset. Therefore, the overall performance of all methods on this dataset is not as good as the single object segmentation, as shown in Tab. 4. Fig. 5 shows the overall performance on mean and runtime. From the perspective of speed analysis, MTN only takes 0.048 seconds per frame during testing, 20 fps. It is significantly faster than all the compared methods and close to a real-time speed of 25 fps. Besides, even without the first frame fine-tuning and mask propagation from previous frame, the performance of our method is still comparable to the state-of-the-art methods, MTN outperforms the most recent method OSMN on mean, , 59.0% vs 57.1%. Without the first frame fine-tuning, the performance of OnAVOS*, OSVOS* and MSK* degrades a lot. For example, the mean values of OnAVOS degrades from 64.5% to 39.5%.

Module
Image
encoder
Mask
encoder
Pixel
matching
Bottom-up
decoder
Time (ms) 10 0.78 7.6 5.0
Table 5: Runtime the of the main modules in MTN.

4.6 Runtime Analysis

The proposed method is implemented and evaluated on a single NVIDIA GeForce 1080 Ti GPU, and it achieves 37 fps on images of pixels. For detailed runtime analysis, we report the runtime of each module. As shown in Tab. 5, we can see that the most time-consuming component is the image encoder module for image feature extraction. The processing time of other components has been greatly improved due to the efficient network design of MTN. To have an intuitive impression, compared with the RGMP [36] which runs on the same platform, our method is 4.8 faster than RGMP (less than 8 fps).

4.7 Qualitative Analysis

To illustrate the effectiveness of our method, we present some representative examples for both of the unseen object segmentation and multi-object segmentation, as shown in Fig. 6. The appearance of the object goat in the first row is very similar to the background, and the second object rhino is heavily occluded by the tree. It demonstrates that our method is robust to background clutter and heavy occlusion and can achieve good performance on unseen object segmentation. The last two rows show that our method can also be used for multi-object segmentation.

5 Conclusion

In this paper, we report a novel mask transformation network (MTN) video object segmentation method, which can achieve a real-time processing speed with reasonable accuracy. MTN is developed based on the idea of mask transfer from the annotated object mask to the target frame. A global pixel matching method with a mask encoder network is proposed to accelerate the processing speed and ensure the segmentation accuracy. Experiments on the DAVIS datasets demonstrate that our method achieves a processing speed of 37 fps, which exceeds the real-time requirement of 25 fps. By training MTN on the image datasets only, it can achieve an accuracy of 75.3% on the entire DAVIS-16 trainval set, which is comparable to the state-of-the-art methods. Besides, the proposed method does not rely on the particular object categories, significantly improving the overall performance on unseen object segmentation by 14.9% on average. Moreover, in experiments, we also demonstrate that our method can be also used for multi-object segmentation.

References

  • [1] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool (2017) One-shot video object segmentation. In CVPR, Cited by: §1, §2, §4.4, Table 3, Table 4.
  • [2] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In CVPR, Cited by: §1, §1, §2, §2, §3.1, §3.4, §4.4, Table 3.
  • [3] Y. Chen, Y. Tsai, C. Yang, Y. Lin, and M. Yang (2018) Unseen object segmentation in videos via transferable representations. In ACCV, Cited by: §4.3, Table 2.
  • [4] J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang (2018) Fast and accurate online video object segmentation via tracking parts. In CVPR, Cited by: §2.
  • [5] J. Cheng, Y. Tsai, S. Wang, and M. Yang (2017) Segflow: joint learning for video object segmentation and optical flow. In ICCV, Cited by: §1.
  • [6] M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S. Hu (2015) Global contrast based salient region detection. TPAMI 37 (3), pp. 569–582. Cited by: §1, §4.2.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. Cited by: §3.1.
  • [8] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In ICCV, Cited by: §3.1, §3.1, §3.4.
  • [9] S. Dutt Jain, B. Xiong, and K. Grauman (2017) FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In CVPR, Cited by: §2.
  • [10] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §1, §4.2, §4.3.
  • [11] A. Faktor and M. Irani (2014) Video segmentation by non-local consensus voting. In BMVC, Cited by: §2.
  • [12] Q. Fan, F. Zhong, D. Lischinski, D. Cohen-Or, and B. Chen (2015) JumpCut: non-successive mask transfer and interpolation for video cutout.. ACM SIGGRAPH 34 (6), pp. 195–1. Cited by: §4.2, Table 1.
  • [13] K. Fragkiadaki, P. Arbelaez, P. Felsen, and J. Malik (2015) Learning to segment moving objects in videos. In CVPR, Cited by: §2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1.
  • [15] S. Hong, J. Oh, H. Lee, and B. Han (2016)

    Learning transferrable knowledge for semantic segmentation with deep convolutional neural network

    .
    In CVPR, Cited by: §4.3, Table 2.
  • [16] Y. Hu, J. Huang, and A. G. Schwing (2018) VideoMatch: matching based video object segmentation. In ECCV, Cited by: §1, §1, §3.1, §3.4, §4.4, Table 3.
  • [17] V. Jampani, R. Gadde, and P. V. Gehler (2017) Video propagation networks. In CVPR, Cited by: §4.2, §4.4, Table 1, Table 3.
  • [18] W. Jang and C. Kim (2017) Online video object segmentation via convolutional trident network. In CVPR, Cited by: §4.2, §4.4, Table 1, Table 3.
  • [19] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele (2017) Lucid data dreaming for multiple object tracking. In arXiv preprint arXiv: 1703.09554, Cited by: §1, §1, §2.
  • [20] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §4.1.
  • [21] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, Cited by: §3.1.
  • [22] Y. J. Koh and C. Kim (2017) Primary object segmentation in videos based on region augmentation and reduction.. In CVPR, Cited by: §2.
  • [23] S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C. J. Kuo (2018) Instance embedding transfer to unsupervised video object segmentation. In CVPR, Cited by: §2.
  • [24] X. Li and C. C. Loy (2018) Video object segmentation with joint re-identification and attention-aware mask propagation. In CVPR, Cited by: §2.
  • [25] N. Märki, F. Perazzi, O. Wang, and A. Sorkine-Hornung (2016) Bilateral space video segmentation. In CVPR, Cited by: §4.2, §4.4, Table 1, Table 3.
  • [26] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 3D Vision (3DV), Fourth International Conference on, pp. 565–571. Cited by: §3.2.
  • [27] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun (2017) Large kernel matters – improve semantic segmentation by global convolutional network. In CVPR, Cited by: §3.1, §4.1.
  • [28] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. In

    Computer Vision and Pattern Recognition

    ,
    Cited by: §1, §1, §4.2, §4.2.
  • [29] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung (2017) Learning video object segmentation from static images. In CVPR, Cited by: §1, §2, §4.2, Table 1, Table 4.
  • [30] F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung (2015) Fully connected object proposals for video segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 3227–3234. Cited by: §4.2, Table 1.
  • [31] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    FaceNet: a unified embedding for face recognition and clustering

    .
    In CVPR, Cited by: §3.1.
  • [32] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) PWC-net: cnns for optical flow using pyramid, warping, and cost volume. In CVPR, Cited by: §3.1, §3.1, §3.1.
  • [33] P. Tokmakov, K. Alahari, and C. Schmid (2017) Learning motion patterns in videos. In CVPR, Cited by: §2.
  • [34] Y. Tsai, M. Yang, and M. J. Black (2016) Video segmentation via object flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3899–3908. Cited by: §4.2, Table 1.
  • [35] P. Voigtlaender and B. Leibe (2017) Online adaptation of convolutional neural networks for video object segmentation. In BMVC, Cited by: §1, §2, §4.4, Table 3, Table 4.
  • [36] S. Wug Oh, J. Lee, K. Sunkavalli, and S. Joo Kim (2018) Fast video object segmentation by reference-guided mask propagation. In CVPR, Cited by: §1, §1, §2, §4.4, §4.6, Table 3.
  • [37] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos (2018) Efficient video object segmentation via network modulation. In CVPR, Cited by: §1, §2, §2, §4.4, Table 3, Table 4.
  • [38] J. S. Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, and I. S. Kweon (2017) Pixel-level matching for video object segmentation using convolutional neural networks. In ICCV, Cited by: §1, §2, §2, §3.4, §4.4, Table 3.