Demo code of the paper "Fast and Accurate Online Video Object Segmentation via Tracking Parts", in CVPR 2018
Online video object segmentation is a challenging task as it entails to process the image sequence timely and accurately. To segment a target object through the video, numerous CNN-based methods have been developed by heavily finetuning on the object mask in the first frame, which is time-consuming for online applications. In this paper, we propose a fast and accurate video object segmentation algorithm that can immediately start the segmentation process once receiving the images. We first utilize a part-based tracking method to deal with challenging factors such as large deformation, occlusion, and cluttered background. Based on the tracked bounding boxes of parts, we construct a region-of-interest segmentation network to generate part masks. Finally, a similarity-based scoring function is adopted to refine these object parts by comparing them to the visual information in the first frame. Our method performs favorably against state-of-the-art algorithms in accuracy on the DAVIS benchmark dataset, while achieving much faster runtime performance.READ FULL TEXT VIEW PDF
Demo code of the paper "Fast and Accurate Online Video Object Segmentation via Tracking Parts", in CVPR 2018
Video object segmentation aims at separating target objects from the background and other instances on the pixel level. Segmenting objects in videos is a fundamental task in computer vision because of its wide applications such as video surveillance, video editing, and autonomous driving. However, it is a challenging task due to camera motion, object deformation, occlusion between instances and cluttered background. Particularly for online applications, significant different issues arise when the methods are required to be robust and fast without given access to future frames. In this paper, we focus on solving the problem of online video object segmentation. Given the object in the first frame, our goal is to immediately perform online segmentation on this target object without knowing future frames. For real application usages, the difficulties lie in the requirement of efficient runtime performance while maintaining accurate segmentation. Figure1 illustrates comparisons of the state-of-the-art methods in terms of speed and performance, where we show that the proposed algorithm is fast, accurate and applicable to online tasks.
Existing video object segmentation algorithms can be broadly classified into unsupervised and semi-supervised settings. Unsupervised methods[9, 14, 18, 35] mainly segment moving objects from the background without any prior knowledge of the target, e.g., initial object masks. However, these methods cannot handle multiple object segmentation as they are not capable of identifying a specific instance. In addition, several methods require batch model processing (i.e., all the frames are available) before segmenting the object [21, 41], which cannot be applied to online applications.
On the other hand, semi-supervised methods [6, 16, 19, 20, 44] are given with an initial object mask which provides critical visual cues of the target. Thus, these methods can handle multi-instance cases and usually perform better than the unsupervised approaches. However, many state-of-the-art semi-supervised methods heavily rely on the segmentation mask in the first frame. For instance, before making predictions on the test video, the state-of-the-art methods need to finetune the networks for each video [4, 6, 19, 44], or the model for each instance [5, 34]. This finetuning step on the video or instance level is computationally expensive, where it usually takes more than ten minutes to update a model [4, 6]. In addition, data preparation (e.g., optical flow generation ) and training data augmentation  require additional processing time. As such, these methods cannot be used for time-sensitive online applications that require fast and accurate segmentation results of a specific target object (see Figure 1).
In this paper, we propose a video object segmentation algorithm that can immediately start to segment a specific object through the entire video fast and accurately. To this end, we utilize a part-based tracking method and exploit a convolutional neural network (CNN) for representations but does not need the time-consuming finetuning stage on the target video. The proposed method mainly consists of three parts: part-based tracking, region-of-interest segmentation, and similarity-based aggregation.
Naturally, object tracking is an effective way to localize the target in the next frame. However, non-rigid objects often have large deformation with fast movement, thereby making it difficult to accurately localize the target [2, 8, 30]. To better utilize the tracking cues, we adopt a part-based tracking scheme to resolve challenging issues such as occlusions and appearance changes . We first randomly generate object proposals around the target in the first frame, and select representative parts based on the overlapping scores with the initial mask. We then apply the tracker for each part to provide temporally consistent region of interests (ROIs) for subsequent frames.
Once each part is localized in the next frame, we construct a CNN-based ROI SegNet to predict the segmentation mask that belongs to the target object. Different from conventional foreground segmentation networks [4, 6, 26] that focus on segmenting the entire object, our ROI SegNet learns to segment partial objects given the bounding box of part.
With part tracking and ROI segmentation, the object location and segmentation mask can be roughly identified. However, there could be false positives due to incorrect tracking results. To reduce noisy segmentation parts, we design a similarity-based method to aggregate parts by computing the feature distance between tracked parts and the initial object mask. Figure 2 shows the main steps of the proposed algorithm.
To validate the proposed algorithm, we conduct extensive experiments with comparisons and ablation study on the DAVIS benchmark datasets [36, 38]. We show that the proposed method performs favorably against state-of-the-art approaches in accuracy, while achieving much better runtime performance. The contributions of this work are as the following. First, we propose a fast and accurate video object segmentation method that is applicable to online tasks. Second, we develop the part-based tracking and similarity-based aggregation methods that effectively utilize the information contained in the first frame, without adding much computational load. Third, we design an ROI SegNet that takes bounding boxes of parts as the input, and outputs the segmentation mask for each part.
Unsupervised Video Object Segmentation.
Unsupervised video object segmentation methods aim to automatically discover and separate prominent objects from the background. These methods are based on probabilistic models [23, 31], motions [18, 17, 35], and object proposals [24, 46]. Existing approaches often rely on visual cues such as superpixels, saliency maps or optical flow to obtain initial object regions, and need to process the entire video in batch mode for refining object segmentation. In addition, generating and processing thousands of candidate regions in each frame is usually time-consuming. Recently, CNN-based methods [14, 40, 41]
exploit learning rich hierarchical features (e.g., ImageNet pre-training) and large augmented data to achieve the state-of-the-art segmentation results. However, these unsupervised methods are not able to segment a specific object due to motion confusions between different instances and dynamic background.
Semi-supervised Video Object Segmentation.
Semi-supervised methods aim to segment a specific object with an initial mask. Numerous algorithms have been proposed based on tracking , object proposals , graphical model , and optical flow . Similar to the unsupervised approaches, CNN-based methods [4, 6, 20] have achieved significant improvement for video object segmentation. However, these methods usually heavily rely on finetuning models through the first frame [4, 20], data augmentation , online model adaptation  and joint training with optical flow . These steps are computationally expensive (e.g., it takes more than 10 minutes for finetuning on the first frame in each video) and are not suitable for online vision applications.
To alleviate the issue of computational loads, a few methods are developed by propagating the object mask in the first frame through the entire video [15, 16]. Without exploiting much information in the first frame, these approaches suffer from the error accumulation after propagating a long period of time and thus do not perform as well as other methods. In contrast, the proposed algorithm incorporates part-based tracking and always keeps eyes on the first frame by a similarity-based part aggregation strategy.
Tracking has been widely used to localize objects in videos as an additional cue for performing object segmentation . Conventional methods [3, 13] adopt correlation filter-based trackers to account for appearance changes. Recently, numerous methods have been developed based on deep neural networks and classifiers. The CF2 method  learns correlation filters adaptively based on CNN features, thereby enhancing the ability to handle challenging factors such as deformation and occlusion. In addition, the SINT scheme  utilizes a Siamese network to learn feature similarities between proposals and the initial observation of target object. The SiaFC algorithm  develops an end-to-end Siamese tracking network with fully-convolutional layers, which allows the tracker to compute similarity scores for all the proposals in one forward pass. In this work, we adopt the Siamese network for tracking object parts, where each part is locally representative and endures less deformation through the video.
In this section, we describe each component of the proposed method. First, we present the part-based tracker, where the goal is to localize object parts through the entire video. Second, we construct the ROI SegNet, a general and robust network to predict segmentation results for object parts. Third, we introduce our part aggregation method to generate final segmentation results by computing similarity scores in the feature space.
Object tracking is a difficult task due to challenging factors such as object deformation, fast movement, occlusion, and background noise. To deal with these issues, part-based methods  have been developed to track local regions instead of the entire object with larger appearance changes. Since our goal is to localize most object regions in the next frame for further segmentation, utilizing a part-based method matches our need and can effectively maintain a high recall rate.
In order to track parts, one critical problem is how to generate these parts in the first place. Conventional object parts are discovered from a large amount of intra-class data via discriminability and consistency. However, this assumption does not hold for online video segmentation, as only one object mask is provided in the first frame of the target video. To resolve this issue, we propose a simple yet effective way to generate representative parts guided by the object mask. First, we randomly generate part proposals with various sizes and locations around the object, and remove the ones with low overlapping ratio to the object mask. We compute the intersection-over-union (IoU) score between the proposal and the object, and keep the ones with scores larger than a threshold (i.e., in this work). To ensure that each part contains mostly pixels from the object, we further measure the score: , where is the bounding box of a proposal and is the known object box in the first frame. Part proposals with are used as candidates for a non-maximum suppression (NMS) step. Based on the proposed selection process, we reduce thousands of proposals to only representative parts depending on the object size. Note that, we also transform the bounding box for each part to be tight within the object mask, reducing background noise for more effective tracking and segmentation. Some example results are shown in Figure 3 for generated parts (with high scores) in the first frame.
Given a set of parts in frame , our goal is to output a score map that measures the location likelihood of part appearing in the next frame :
where is a function to compute similarity scores between the part and the image . We use the SiaFC method  as our baseline tracker to compute the score map . Due to its fully-convolutional architecture, we compute score maps for multiple parts in one forward pass. Once obtaining the score map, we select the bounding box with the largest response as the tracking result. Some tracking results are shown in Figure 3.
Based on the tracking results of object parts, the next task is to segment partial object within the bounding box. Recent instance-level segmentation methods [11, 7] have demonstrated the state-of-the-art results by training networks for certain categories and output their segmentations. Our part segmentation problem is similar to the instance-level segmentation task but for the partial object. In addition, training such a network would require an alignment step for different parts as they may vary significantly in size, shape, and appearance for different instances or object categories. Hence, we utilize an ROI data layer by cropping image patches from parts as inputs to the network, in which these patches are aligned through resizing. Similar to semantic segmentation, our objective is to minimize the weighted cross-entropy loss for a binary (foreground/background) task:
where denotes CNN parameters, denotes the network prediction for the input part at pixel and is the foreground-background pixel-number ratio used to balance the weights .
We utilize the ResNet-101 architecture  as the base network for segmentation and transform it to fully-convolutional layers . To enhance feature representations, we up-sample feature maps from the last three convolution modules and concatenate them together. The concatenated features are then followed by a convolution layer for the binary prediction. Figure 4 shows the architecture of our ROI SegNet.
To train the proposed network, we first augment images from the training set of the DAVIS dataset  via random scaling and affine transformations (i.e., flipping, shifting, scaling,
rotation). Then, parts are extracted for each instance as the same method as introduced in part-based tracking. We use the Stochastic Gradient Descent (SGD) optimizer with the patch sizeand the batch size of 100 for training. The initial learning rate starts from and decreases by half for every 50,000 iterations. We train the network for 200,000 iterations.
After obtaining all the segmentation results from parts, one simple way to generate the final segmentation is to compute an averaging score map from each part. However, parts may be tracked off the object or include background noise, resulting in inaccurate part segments. To avoid adding these false positives, we develop a scoring function by looking back to the initial object mask. That is, we seek to know if the current part is similar to any of the parts in the first frame. Although objects may appear quite differently from the first frame, we find that local parts are actually more robust to such appearance changes.
Specifically, we first compute the similarity score between each part in at frame and initial parts in the feature space. Then we select part with the highest similarity for the current part by:
is the feature vector representing each part, extracted from the last layer in our ROI SegNet with an average pooling on the part mask. Overall, our scoring function consists of three components:
where is a set of initial parts selected based on Equation (3) and is the element-wise multiplication operation. The first function is the simple averaging score of part segments in the current frame :
where is the set of parts at frame and is the segmentation score for each part . Second, is the similarity score between current and initial parts in the feature space based on (3). Since the selected initial part segment may have poor quality, we add by forwarding to the ROI SegNet and measuring its segmentation overlapping ratio to the initial mask as the confidence score:
where is the IoU measurement, is the ROI SegNet and is the object mask in the first frame. With the guidance of the initial object mask and parts without using expensive model finetuning step, our part aggregation method can effectively remove false positives. Figure 5 shows some examples of score maps with different scoring functions.
We conduct experiments on the DAVIS benchmark datasets [38, 36] which contain high-quality videos with dense pixel-level object segmentation annotations. The DAVIS 2016 dataset consists of 50 sequences (30 for training and 20 for validation), with 3,455 annotated frames of real-world moving objects. Each video in the DAVIS 2016 dataset contains a single annotated foreground object, so both semi-supervised and unsupervised methods can be evaluated. The DAVIS 2017 dataset contains 150 videos with 10,459 annotated frames and 376 object instances. It is a challenging dataset as there are multiple instances in each video, where objects could occlude each other. In this setting, it is difficult for unsupervised methods to separate different instances. For performance evaluation, we use the mean region similarity ( mean), contour accuracy ( mean) and temporal stability ( mean) as in the benchmark setting [38, 36]. The source code and models are available at https://github.com/JingchunCheng/FAVOS. More results and analysis are presented in the supplementary material.
Our part-based tracker focuses on tracking local regions and cannot directly output the object location in the next frame. However, we can roughly find the object center based on the aggregated part segments. Motivated by the tracking-by-detection algorithms , we utilize detection proposals  as candidates of object bounding boxes, and select the one closest to the object center as the tracking result. We then validate this part-based tracker on the DAVIS 2016 dataset with comparisons to our baseline SiaFC method  and other tracking algorithms including CF2 , ECO , and MDNet . Experimental results are presented in Figure 3 and 6, where we show that our part-based trackers consistently maintain better IoU-recall curves for localizing objects.
Although our ultimate goal is for video object segmentation, this evaluation is useful for understanding the challenges on the DAVIS dataset. One interesting fact is that if there is a good tracker, it should be able to help the segmentation task. Thus, a high recall rate under a high IoU is required as once partial object is missing, it is not possible to recover the corresponding segment. As shown in Figure 6, most trackers achieve around 60% recall rate under a 0.5 IoU while ours is 80%, which enables potential usages of applying our tracker to improve segmentation results. We will present our results by integrating this tracker in the ablation study section.
|+ Tracker||+ Tracker|
|Initial mask||Future frames||Pre-processing||Online||Speed||J mean||F mean||T mean|
|Lucid ||data, finetuning||weak||40s||0.848||0.823||0.158|
|MSK ||flow, finetuning||weak||12s||0.797||0.754||0.218|
We present ablation study in Table 1 on the DAVIS 2016 validation set to evaluate the effectiveness of each component in the proposed video object segmentation framework. We start with the unsupervised version of SFL  as our baseline due to its balance between speed and accuracy. To demonstrate the usefulness of using part, we first conduct an experiment by combining the baseline result and the score map from 
via tracking an entire object. Specifically, we average the foreground probability from and the segmentation map of  through the ROI SegNet. However, we find that the tracking accuracy is highly unstable, which usually loses objects and even results in a worse performance than the baseline segmentation (1.1% drop in Mean). It shows that combining tracking and segmentation is not a trivial task, and we use part-based model to achieve a better combination.
After adopting our part-based tracker and ROI SegNet to obtain part segments, we compare results with or without part aggregation. The one that utilizes part aggregation via the function in Equation (4) performs better (4% improvement in Mean) than only computing the score function . It shows that with the consideration of initial object mask, false part segmentations can be largely reduced as they are not similar to any of the object parts in the first frame. In addition, we take advantage of our tracker combined with detection proposals as mentioned in Section 4.2 and use it to further improve our results, denoted as “+Tracker” in Table 1. To further improve the boundary accuracy, we add a refinement step using dense CRF . In Figure 1, we denote the result of using as Ours-part, and the one combined with our tracker and CRF with refinement as Ours-ref.
|Finetuning||Method||Baseline||+ FG||+ FG||+FG|
We evaluate our proposed method on the validation set of DAVIS 2016  with comparisons to state-of-the-art algorithms, including semi-supervised and unsupervised settings. In Table 2, we show results with different settings, including the need of initial object mask, future frames and pre-processing steps. Based on these requirements and their runtime speed, we then analyze the capability for online applications.
For unsupervised methods that do not need the initial mask, they usually need to compute optical flow as the motion cue (FSEG  and LMP ) or foresee the entire video (LVO  and ARP ) to improve the performance, which is not applicable to online usages. In addition, these methods cannot distinguish different instances and perform segmentation on a specific object.
In the semi-supervised setting, recent methods require various pre-processing steps before starting to segment the object in the video, which weaken the ability for online applications. These pre-processing steps include model finetuning (OnAVOS , Lucid , OSVOS , MSK , SFL ), data synthesis (Lucid ) and flow computing (MSK , CTN , and OFL ). For fair comparisons in the online setting, these pre-processing steps are included in the runtime by averaging on all the frames.
The most closest setting to our method is VPN  and BVS  that do not have heavy pre-processing steps. However, these approaches may propagate segmentation errors after tracking for a long period of time. In contrast, our algorithm always constantly refers to the initial object mask via parts and can reduce such errors in the long run, improving more than 12% in Mean against VPN . Overall, the proposed video object segmentation framework runs at the fastest speed, and can achieve Mean in the 3rd place with further refinement, while still maintaining a fast runtime speed compared to state-of-the-art methods. Some qualitative comparisons are presented in Figure 8.
To evaluate how our method deals with multiple instances in videos, we conduct experiments on the DAVIS 2017 validation set  which consists of 30 challenging videos and each one has two instances on average. Existing methods all rely on sophisticated processing steps  to achieve better performance, and hence we compare our method with SPN  that only involves the finetuning step in Table 3. For the baseline algorithm, we start with our part-based aggregation method via part-based tracker and ROI SegNet, while  finetunes a CNN-based model for each instance. The baseline results show that, without the need of the computationally expensive finetuning process, our method even outperforms the existing method. One reason is that as the video becomes more complicated, finetuning-based methods may suffer from confusions between instances. In contrast, our method employs a part-based tracker that can effectively capture local cues for further segmentation.
Following , we then sequentially add different components, including foreground/background regularization, a Spatial Propagation Network and a region-based refinement step. In addition, we integrate the object tracker proposed in Section 4.2 to further refine the segmentation. Overall, without the need of finetuning on each instance, our approach achieves a similar performance or outperforms the one that requires finetuning. We also note that finetuning is expensive not only in speed but also in stored size, as hundreds of objects would result in a huge number of stored models, which is not practical in real-world applications. In Figure 9, we present some example results on the DAVIS 2017 dataset.
In the proposed framework, our method runs at 0.60 seconds on average per instance per frame without the refinement step, including part-based tracking (0.2s), ROI segmentation (0.3s), and part aggregation (0.1s). With CRF (1s) and tracker (0.2s) refinements, our method runs at 1.8 seconds per instance per frame with better performance. We note that for tracking and segmenting parts, we parallelly use Titan X GPUs to handle hundreds of parts for faster inference.
In this paper, we propose a fast and accurate video object segmentation method that is applicable to online applications. Different from existing algorithms that heavily rely on pre-processing the object mask in the first frame, our method exploits the initial mask via a part-based tracker and an effective part aggregation strategy. The part-based tracker provides good localization for local regions surrounding the object, ensuring that most portion of the object is retained for further segmentation purpose. We then design an ROI segmentation network to accurately output partial object segmentations. Finally, a similarity-based scoring function is developed to aggregate parts and generate the final result. Our algorithm exploits the strength of CNN-based frameworks for tracking and segmentation to achieve fast runtime speed, while closely monitoring the information contained in the first frame for the state-of-the-art performance. The proposed algorithm can be applied to other video analytic tasks that require fast and accurate online video object segmentation.
This project is supported in part by the NSF CAREER Grant #1149783, gifts from Adobe and NVIDIA.