The task of Video Object Segmentation (VOS) is to separate objects (foreground) from the background on the pixel level. This is important for a wide range of video understanding applications, such as video surveillance, unmanned vehicle navigation and action recognition. Traditionally, most approaches in VOS focused on background modeling in stationary camera scenarios . Recently, with the widely used moving cameras, the focus has moved from static camera condition to the freely moving camera environment [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]. Due to the complex video content (such as object deformation, occlusion, and background clutter) and the dynamic background caused by camera motion, the segmentation of moving objects under the moving camera environment is still a challenging problem.
Existing VOS algorithms can be broadly categorized into semi-supervised and unsupervised. Generally, semi-supervised approaches [8, 12, 13, 14] require users to manually provide the object region on one or a few frames, then the annotated object regions are propagated to the remaining video frames for the final segmentation. On the other hand, unsupervised methods [3, 9, 11, 15, 16, 17] aim to automatically segment the moving objects without any manual input. The most popular unsupervised methods often focus on clustering the long-term trajectories of pixels , superpixels [3, 18, 19] or object proposals [17, 20, 21] across the whole video, and the pixels with consistent trajectories are clustered as foreground. This long-term trajectory-based strategy often requires the entire video sequence to achieve good results. Thus it must operates in an offline manner and thus suffers from the following problems.
The requirement of entire video implies the offline methods cannot segment moving objects frame-by-frame. Due to the memory limitation, this strategy becomes infeasible when a long video sequence is given.
The offline approaches are also impractical for streaming video applications (e.g. the video surveillance).
In order to overcome the limitations of offline approaches, the development of unsupervised online VOS frameworks has attracted more attention. To segment the moving objects in an online manner, Wang et al.  combined the current processing frame with several forward-backward neighboring frames to generate the short-term trajectories. Based on the spatio-temporal saliency map generated by the optical flow field and salient object detection, the moving objects are automatically segmented within the neighboring frames. However, the moving object is not always salient in some videos. Consequently, the spatio-temporal saliency map cannot achieve good segmentation results in such cases. Different from the short-term trajectory-based online strategy, some researchers adopted another unsupervised online framework for VOS. By automatically initializing the object regions on a few frames with different motion cues, an online tracking method is used to propagate the initialized object regions to subsequent frames, such as [23, 24, 25, 26]. However, when the initialized object regions are not accurate enough, unsatisfactory segmentation would result. In addition, it easily suffers from the error accumulation problem  when tracking the initialized object region to subsequent video frames.
More recently, deep learning models have also been applied to automatically segment the moving objects with motion cues. For example, Tokmakovet al.  adopted an end-to-end deep learning framework to learn the motion patterns on the ground truth optical flow field and motion segmentation, followed by an object proposals model  to extract the candidate objects and a CRF model  for segmentation refinement. Jain et al. 
proposed a two-stream fully convolutional neural network, which combines the object proposals and motion information in a unified framework, for segmenting generic objects in videos. Different from the traditional methods, the deep learning approaches require a large number of well-annotated data for training. In addition, when the object movements and video scenes are very different from the training data, the performance of deep learning based algorithms may degrade a lot.
Based on above analysis, although much progress has been achieved by the existing methods, robust online VOS and accurate moving object extraction are still not well developed. In this paper, motivated by the moving object definition that a segmented region should be moving and indicate a generic object, called motion property, we propose a novel fully unsupervised online VOS framework for more accurate moving object segmentation. To extract the regions that satisfy both of the two moving object properties (moving and generic object), we propose a novel motion segmentation method that segments the moving objects between two video frames based on salient motion detection and object proposal methods. The salient motion detection method is used to extract the moving regions (denoted as a salient motion mask) on the optical flow; and the object proposal method is applied to detect the generic object regions (denoted as an objectness mask) on the video frames. However, neither the salient motion mask or objectness mask alone can accurately detect regions with the both properties of ”moving” and ”a generic object” . Therefore, we propose a pixel-level fusion method to operate on the intersection of the detected regions by the salient motion map and objectness map. As the examples shown in Figure 1, by fusing the salient motion detection and object proposals, the moving background regions and static objects can be effectively removed by our method. Different from the existing deep learning methods [9, 29] that learn the motion segmentation model from a large number of well annotated data, our method does not require any additional training data and it is able to directly use the pretrained object proposals model  without fine-tuning.
In addition, due to the complex video scenes, the salient motion detection and object proposals in individual video frame are not always reliable. With the observation that the video content in neighboring frames are continuously changed, we propose a forward propagation refinement method to obtain more accurate moving regions and generic object regions. By propagating the segmentation of few previous frames to the current processing frame, more accurate segmentation result is obtained with the refined salient motion map and objectness map (see Section III-C).
Finally, to produce accurate object boundaries, we adopt a CRF model  for further segmentation refinement. Based on the proposed motion segmentation and online refinement, our method is able to automatically segment the moving objects in an online manner. To demonstrate the performance of our method, we evaluate the proposed approach on the DAVIS-2016  and SegTrack-v2  benchmark dataset. Experimental results show that our method outperforms the state-of-the-art methods. Besides, our method significantly outperforms the unsupervised online methods by , and even achieves better performance than the best unsupervised offline method on DAVIS-2016 dataset.
In summary, our main contributions are follows.
We propose a novel unsupervised online VOS framework, which is derived from the motion property. In particular, we design a pixel-wise fusion method for the salient motion detection and object proposals, which can effectively remove the moving background and static object noises.
To deal with the unreliable salient motion and object proposals in the complex videos, we propose a forward propagation method by leveraging the segmentation of previous frames to refine the results.
We conducted comprehensive experiments on two benchmark datasets. The experimental results show that our method could achieve much better results than the state-of-art unsupervised offline and online methods, demonstrating the potential of our framework.
Ii Related Work
Ii-a Semi-supervised VOS
Given the manual object annotation on one or more frames in a long video sequence, the semi-supervised methods propagate the annotated labels to the entire video sequence. The recent semi-supervised methods [14, 31, 32, 33, 34] often assume that the object mask is known in the first frame, and follow by a tracking method to segment it in the subsequent video frames. In order to alleviate the drift problem  in tracking stage, Fan et al.  annotated the object mask in a few frames, and adopted a local mask transfer method to propagate the source annotation to terminal images in both forward and backward directions. In addition, based on several annotated video frames, Maerki et al.  exploited the non-local connections by minimizing the graph energy in bilateral space to model the global spatio-temporal constraints. Recently, the deep learning based approaches often utilize large image classification datasets for pretraining, and shown good VOS performance with robust online update strategy [8, 14]. Due to the object annotation requirement for videos, semi-supervised approaches are not very convenient and even not feasible in some real applications, such as the video surveillance systems.
Ii-B Unsupervised VOS
Unsupervised algorithms aim to automatically segment moving objects without any user annotation. The early unsupervised methods [15, 36] are often based on geometric scene modeling  and they use the geometric model fitting error to classify the foreground/background label of the corresponding pixels. Sheikh et al.  adopted a homography model to distinguish foreground/background trajectories, but they assume an affine model over a more accurate perspective camera model. For more accurate scene modeling, Jung  adopted multiple fundamental matrices to describe each moving object and segment the moving objects with epipolar geometry constraint. Unfortunately, this method is only valid for rigid objects and scenes. Later, the long-term trajectory-based strategy [3, 16, 17, 18] becomes a common method in unsupervised VOS frameworks. According to the different analytic levels, the long-term trajectories are often generated on pixels , superpixels [3, 18, 19] or object proposals [17, 20, 21], in which pixels with consistent trajectories are clustered as foreground and others are regarded as background. In order to obtain the accurate segmentation results, the long-term trajectory-based methods often require the entire video sequence as inputs, and thus they cannot segment moving objects in an online manner.
Ii-C Motion segmentation
In online VOS framework, motion segmentation between two consecutive frames is the key to segment the moving objects frame-by-frame. Due to the early geometry-based methods are sensitive to the model selection (2D homography or 3D fundamental matrix) , recent methods try to distinguish the foreground/background with different motion cues. Papzouglou and Ferrari  first detected the motion boundaries based on the magnitude of the gradient of optical flow filed, and then they used the filled binary motion boundaries to represent the moving regions. However, this method is very sensitive to the motion boundary extraction and it ignores the object information. In order to remove the camera translation and rotation, Bideau et al. 
utilized the angle and magnitude of the optical flow to maximize the information about how objects are moving differently. This method requires camera’s focal length to estimate the camera rotation and translation. However, given an arbitrary video sequence, the camera’s focal length is often unknown. Inspired by the salient object detection[38, 39, 40, 41] on static images, the salient motion detection methods [7, 42] have been used on the optical flow field for moving object segmentation, the pixels with high motion contrast are classified as foreground. Due to lack of the object information, the salient motion detection methods cannot handle the moving background (e.g. moving water) that do not indicate a generic object.
More recently, deep-learning based methods have been widely applied in VOS. For example, Tokmakov et al.  proposed an end-to-end CNN-based framework to automatically learn motion patters from optical flow field. Followed by an object proposals model and CRF model for segmentation refinement. To fuse the motion and appearance in a unified framework, Jain et al.  designed a two-stream CNN network, where an appearance stream is used to detect the object regions and another motion stream is used to find the moving regions. Very different from previous methods, in this paper, we proposed a new motion segmentation methods. According to the motion property that a segmented region should be moving and indicate a generic object, we apply the off-the-shelf techniques of salient motion detection  and object proposals  for accurate motion segmentation. Unlike the end-to-end deep learning based motion segmentation methods [9, 29] that require a large number of training samples to learn the motion patterns, our method is able to directly use the pretrained object proposal model without any fine-tuning. Experiments on two datasets demonstrate the generalization and effectiveness of our method.
Ii-D Semantic segmentation with object proposals
A comprehensive review on the topic of object proposal is out of the scope of this paper. Here, we only focus on the most related and recent works. The purpose of semantic segmentation [30, 43, 44] is to identify a set of generic objects in a given image with segmented regions. To generate the proposals, Krähenbühl et al. 
identified a set of automatically placed seed superpixels to hit all objects in the given image, then the foreground and background masks are generated by computing the geodesic distance transform on these seeds. Finally, they applied the critical level sets on geodesic distance transform to discover objects. Recently, with the success of deep learning in object detection, DeepMask model is learned to propose segment candidate objects with Fast R-CNN . More recently, He et al.  proposed Mask R-CNN framework for instance-level recognition and segmentation. By incorporating a mask branch for segmentation, Mask R-CNN extended the Faster R-CNN  and achieved good segmentation results. In this paper, we directly use the pretrained model of Mask R-CNN to generate objectness map without any fine-tuning.
Iii Our Approach
Let represent the foreground (denoted by ) or background (denoted by ) label of -th pixel in -th video frame . Given an input video stream , our goal is to predict a set of binary foreground/background masks for each video frame in a fully unsupervised and online manner.
Different from the existing methods, our method is based on the motion property that the segmented region in VOS should be “moving” and indicate a “generic object”. We propose a new moving object segmentation framework by referring to the salient motion detection and object proposal methods. More specifically, for each frame 111Notice that our method processes a given video in a frame-by-frame manner, the salient motion detection method is applied to detect the moving regions (i.e., salient motion mask) and the object proposal method is used to detect generic objects (i.e., objectness mask). Then the detected results of two methods are fused by a proposed fusion method for better results (Section III-B). The detected results by the salient motion detection and object proposal methods are not always reliable, especially for complex video scenes. To alleviate this problem, we propose a forward propagation refinement method to improve the segmentation results (Section III-C). In addition, a CRF model is applied to further refine the results (Section III-D).
Iii-B Motion Segmentation
In the following, the salient motion mask represents the moving regions and the objectness mask denotes the generic objects. As mentioned, our motion segmentation is an effective fusion of salient motion segmentation and object proposal techniques. In the next, we will introduce the two techniques in sequence, and then describe the proposed fusion method.
Iii-B1 Salient motion mask
Motion reveals how foreground pixels move differently than their surrounding background pixels, and thus it is very useful for moving regions extraction. Different from the static camera environment in traditional background subtraction problems, the foreground pixel displacements and camera movements are unknown under moving camera environments.
In this work, we adopt the saliency detection  on the optical flow to detect moving regions from the static background. This method computes the global motion contrast of each pixel in a video frame and has shown good performance in motion segmentation tasks [7, 42]. Specifically, let be the backward optical flow field between two frames and , where each element
is the optical flow vector of pixelin horizontal and vertical directions, is the total number of the frame pixels. Let be the salient motion map on optical flow field , the global motion contrast of each pixel is computed as:
where , the function is a distance metric  between the two flow vectors and . For the efficiency purpose, we adopt the Minimum Barrier Distance (MBD) transform for the salient motion detection .
Given an unconstrained video sequence, both of the object movements and camera motion are unknown. In order to extract the moving regions on various motion contrast, we utilize an adaptive threshold method  to split the salient motion map. Then the pixels with high motion contrast are classified as foreground and the rest are regarded as background pixels. Let be the binary split function of our adaptive threshold method, the salient motion mask is computed as:
where each element denotes the binary foreground/background label of each pixel .
Different from the moving object segmentation, salient motion mask only represents the moving regions. Without any prior information about the object, moving background (e.g. moving water) may be classified as a moving object, as shown in Figure 2. It is the reason why we incorporate the object proposals for generic objects detection.
Iii-B2 Objectness mask
As mentioned, the salient motion segmentation method cannot differentiate the moving objects from moving background. Therefore, the object proposal technique is applied on video frames to extract the generic objects. Based on the success of deep learning in object detection, by adding a branch for predicting segmentation masks on each region of interest, the recent algorithm Mask R-CNN  extends the Faster R-CNN  and achieves the state-of-the-art detection and segmentation performance in static images. In this work, we directly adopt the pretrained Mask R-CNN  model in VOS to remove the moving background regions.
In order to obtain an objectness mask with high recall, we adopt a low object confidence threshold (set to 0.5 in our experiments) to extract the generic object regions. Though the object proposal model is not reliable enough in some complex video scenes, such as the false positive detections and missing objects shown in Figure 3, it still provides very useful object information about the scenes. It is worth mentioning that we use the pretrained model of Mask R-CNN on COCO dataset  without any fine-tuning in our implementation. In spite of that, it can already achieve promising segmentation results on two different datasets (see Section IV-E). It shows the potential of our method, since it is very different from many existing methods (such as [8, 33]), which all require careful fine-tuning of the pretrained model on the testing datasets for better results.
Iii-B3 Mask fusion
As mentioned, the goal of motion segmentation is to detect moving objects. In this subsection, we introduce the proposed fusion method to compute the intersection region of the salient motion mask and the objectness mask, which satisfies both the moving and generic object conditions.
In practice, directly extracting the intersection region may result in inaccurate segmentation results. For example, as the motion segmentation results shown in Figure 4, when a part of the object moves in non-rigid objects, the segmentation results are incomplete to cover the whole object region. To alleviate such cases, we first dilate the salient motion mask to produce the moving regions with higher segmentation recall, and then use the dilated moving regions for mask fusion. Although it is possible that some background regions might be incorporated by the dilated moving regions, our experiments show that it can be effectively removed by fusing it with objectness mask. In the following, we will describe our mask fusion method.
Let be the salient motion mask on optical flow filed , be the objectness mask on current processing video frame , denote the image dilation function and represent the dilated radius. Then our fused segmentation mask of frame is obtained by fusing the binary mask and as:
where each element denotes the binary foreground/background label of each pixel , operator indicates the pixel-wise multiplication on the two components and . Our experiments on two datasets show that the salient motion detection and object proposals are complementary to each other in VOS (see Section IV-C).
Iii-C Forward Propagation Refinement
In some complex video scenes, it is difficult to obtain reliable salient motion detection and object proposals results for each video frame, as shown in Figure 2, 3 and 4. Note that the video content is continuously and slowly changed. In other words, the content in a frame is very similar to the previous frame. For example, only the object moves a little bit, which cause the pixels in a small part of the background different. The temporal continuity of video content in sequential frames can be used to improve the individual motion segmentation between two frames. Therefore, we propose a forward propagation refinement method, which leverages the segmentation masks of previous frames for the segmentation of current frame, and thus to achieve more robust and accurate segmentation.
Let denote the segmentation mask of -th frame without using forward propagation refinement (namely, obtained by Eqn. 3); denote the segmentation mask of -th frame with the refinement method. For the frame , suppose we consider the segmentation masks of the previous frames, i.e., , which are propagated to current processing frame (based on the pixel-wise tracking with optical flow) as for segmentation refinement.
The refined salient motion map of current processing frame is recomputed with original salient motion map (obtained from Eqn. 1) and propagated masks as:
where is a weight to balance these two components. As shown in Figure 6, the unreliable salient motion segmentation can be improved by forward propagating a set of previous segmentation masks.
Similarly, the objectness mask of current processing frame can be also improved by the same strategy as:
Similar to the motion segmentation between two video frames, the improved segmentation mask is obtained by fusing the refined masks and as:
where means the initial motion segmentation between the first and second video frames.
Compared to the individually motion segmentation between two video frames, by propagating previous segmentations to the current processing frame, our method is able to improve both of the unreliable salient motion segmentation and object proposals. As the examples shown in Figure 5. based on the forward propagation refinement, the performance of our approach is improved.
Iii-D CRF Refinement
Notice that the segmentation based on the motion cannot detect the object boundary very accurately in some cases [3, 9], which may degrade the results of our method, even after the forward propagation refinement (denoted by ). To alleviate the problem, the CRF model  can be applied to our framework (based on the segmentation ) to further improve the final segmentation result (denoted by ).
Based on the forward propagation refinement result , we use the binary label value of to compute the unary terms and standard color-based pairwise terms. The refined segmentations are shown in the second row of Figure 7, which improves over the initial segmentations (the first row of Figure 7) on the object boundaries. Finally, our entire approach is summarized in Algorithm 1.
Iv Experiments and Results
In this section, we first describe the implementation details of our method, followed by the introduction of experimental datasets. In the next, we detail the baselines and evaluation metrics, and finally report and analyze the experimental results.
Iv-a Implementation Details
Inspired by the salient object detection on static images, previous works often applied the salient object detection on the optical flow field as salient motion detection, which has been demonstrated to be effective in [7, 29]. In this work, for efficiency purpose, we adopt the salient object detection method MBD  on SIFT flow  to detect the moving regions. The objectness mask is detected by Mask R-CNN, which is the state-of-the-art method. In our implementation, we directly used the trained Mask R-CNN model (based on COCO dataset) without any fine-tuning. We adopted the CRF model in  for final segmentation refinement. It is worth mentioning that, for all the above models, we directly used the default provided parameters of these approaches without any fine-tuning and can already achieve the state-of-the-art performance ( as shown in Section IV-E).
Without additional specification hereafter, the reported results are based on the following parameter settings: for object proposals, the confidence threshold of the object detection is set to and the radius for image dilation operator is . The Otsu’s method  is used for adaptive threshold segmentation. The number of adaptive thresholds is set to 3 for salient motion segmentation and 2 for multi-frame object mask in our experiments, respectively. Besides, the number of previous frames for forward propagation refinement method is set to 2 (i.e., in Section III-C).
Our method is mainly implemented in MATLAB and evaluated on a desktop with 1.7GHz Intel Xeon CPU and 32GB RAM. Given an image of resolution pixels, the average processing time of the key components is shown in Table I. From the table, we can see that the main computational cost is for optical flow estimation and other steps are very fast. We will release our code if the paper is accepted.
|Salient motion detection||0.01|
|Forward propagation refinement||0.05|
Iv-B Datasets and Evaluation Metrics
The DAVIS-2016 dataset  is currently the most challenging VOS benchmark, which contains 50 high resolution video sequences of diverse object categories and 3455 densely annotated pixel-wise ground-truth. Videos in this dataset are unconstrained and the challenging problems include appearance change, dynamic background, fast-motion, motion blur and occlusion.
SegTrack-v2 dataset  is a widely used benchmarks for VOS, which consists of 14 low resolution videos with a total of 1066 frames. The ground truth of this dataset is also pixel-wise annotated. The main challenges in SegTrack-v2 dataset contain drastic appearance change, complex background, occlusion, abrupt motion and multiple moving objects. Similar to the previous methods [18, 29], we treated the multiple objects with individual ground-truth as a single foreground for evaluation.
Iv-B2 Evaluation metrics
For quantitative analysis, the standard evaluation metrics region similarity , contour accuracy and temporal stability are adopted. Region similarity is defined as the mean Intersection-over-Union (mIoU) of the estimated segmentation and the ground-truth mask. measures the accuracy of the contours and measures the temporal stability of the segmentation results in VOS. More descriptions about the evaluation metrics can be found in . For performance comparison between the proposed segmentation and the state-of-the-art approaches, we utilized the provided codes and parameters configurations from the benchmark website222https://graphics.ethz.ch/~perazzif/davis/code.html. Since mIoU denotes the region similarity between the segmentation result and ground-truth, and thus we mainly analysis the performance of each algorithm with mIoU metric as in previous works [9, 18, 21, 29].
|+||69.6 (+12.5)||55.3 (+8.0)|
|++||74.6 (+5.0)||61.5 (+6.2)|
|+++||77.2 (+2.6)||64.3 (+2.8)|
Iv-C Influence of Different Components
To demonstrate the influence of each component in the proposed method, we reported the performance of different modalities fusion on the two datasets DAVIS-2016  and SegTrack-v2 . In addition, we use the same parameters and for our forward propagation refinement method on the two datasets to show the robustness of the propose algorithm. For ease of presentation, we denote the key component of our approach as follows.
: salient motion segmentation on the optical flow filed.
: object proposals on current processing video frame.
: forward propagation refinement with several previous segmentations.
: coarse-to-fine segmentation with CRF.
Based on these components, the improvements of different components fusion are reported in Table II. We detailedly analyze the effectiveness of each component as follows.
Iv-C1 Effectiveness of the mask fusion.
As a reminder, the mask fusion is to remove some potential segmentation noise, such as moving background and static objects, in the moving regions and object regions, which are detected by the salient motion detection method and object proposal method, respectively.
As shown in Table II, on the DAVIS-2016 dataset, the mIoU of salient motion detection is , which denotes the accuracy of moving region segmentation. Similarly, the performance of object proposals is , which denotes the accuracy of object region detection. Based on our pixel-wise fusion method, moving background regions and static object regions can be effectively removed. Compared to the salient motion detection component , the mIoU after fusion (+, 69.6%) is improved by on the DAVIS-2016 dataset.
Similarly, on the SegTrack-v2 dataset, the mIoU of salient motion detection component is and the object proposals is . Based on the proposed mask fusion method, the fused results + () have achieved improvement when compared to ().
Because the videos in SegTrack-v2 dataset are of low resolution, the semantic object segmentation results of the pretrained object proposals model on COCO dataset  is not very good on some videos, and thus the improvement is not as high as in DAVIS-2016 dataset. As show in Figure 8, the moving objects is accurately extracted by the salient motion segmentation. However, due to the low-resolution video frame and background clutter in some complex scenes, Mask-RCNN  is failed to provide accurate generic object regions with the direct use of the pretrained model. It is expected that the performance can be further improved by fine-tuning the object proposal model.
Iv-C2 Effectiveness of the forward propagation refinement
In order to handle the unreliable salient motion detection and object proposals in each individual video frame, propose a forward propagation refinement method to improve the segmentation accuracy. As shown in Table II, compared to the motion segmentation + between two video frames, the forward propagation refinement (++) can achieve and absolute improvements on the Davis 2016 and the SegTrack v2 dataset, respectively. Although the object movements and video quality are very different in these two datasets, the proposed method is still robust enough to improve the results.
Iv-C3 Effectiveness of the CRF refinement
We also applied a CRF model for result refinement, and the segmentation accuracy can be further improved. As shown in Table II, we obtained the improvement and on the two datasets, respectively. From the results and above analysis, we can see that each component of our model is very useful and can indeed improve the performance.
Iv-D Influence of Key Parameters
In this section, we analyze the influence of key parameters in our approach, including the accumulation weight and frame number of for forward propagation refinement on both of the two datasets.
is a weight to decide how much of the segmentation results of previous frames affect the current frame segmentation. When , it denotes that no information from previous frame propagates to the current frame. As shown in the left of Figure 9, when the value of decreases from to , the performance is increased first and then decreases a little bit on both datasets. The best performance is achieved by 0.85 and 0.75 for the DAVIS-2016 dataset and the SegTrack-v2 dataset, respectively. Notice that the smaller the value of , the more information (segmentation results) from the previous frames are propagated to the current frame. Therefore, when becomes too small, the information from previous frames becomes dominating and thus deteriorate the performance333The extreme case is when , which denotes that the information from previous frames overwrite the current frame, which of course will mislead the result.. In our experiments, till , the performance on two datasets is still better than . This demonstrates that the effectiveness of the proposed method (forward propagation refinement), which is quite robust and can improve the performance within a wide range of .
Iv-D2 Frame number
Another key parameter is the number of previous frames (), which decides how many previous segmentation masks are used for the forward propagation refinement. denotes the motion segmentation between two consecutive video frames. To analyze the forward propagation refinement with different number of frames , The performance with different numbers of previous frames on two datasets is shown in the right of Figure 9, by setting . From the results, we can see that the performance can be improved when on both datasets. The larger of n, the more previous frames are considered. When n is very large (e.g., n = 100), it means the information of frames which are far from the current frame (e.g., the -th frame before this frame) is also considered, which may lead to noisy information.444It is highly possible to introduce noisy information as the frame which are far from the current frame may contain very different content. In particular, the proposed method achieves the best performance when on the DAVIS-2016 dataset and when on the SegTrack-v2 dataset, respectively. Because the SegTrack-v2 dataset is a low resolution dataset, the object proposals in SegTrack-v2 is not as reliable as in DAVIS-2016 dataset, and that is why the performance variation in DAVIS-2016 dataset is more smooth, as shown in Figure 9.
In our experiments, we set and on the two datasets for evaluation. Notice that we have fixed these parameters for the two datasets in all experiments. The superiority of our approach over the baselines across different datasets with the same parameter setting also demonstrates the robustness of our approach (see Section IV-E).
|ARP ||FST ||NLC ||MSG ||KEY ||TRC ||FSEG ||LMP ||CVOS ||SAL ||SFM ||Ours|
|FST ||KEY ||NLC ||FSEG ||Ours|
Iv-E Comparison to The State-of-the-art Methods
we compared our method with several state-of-the-art unsupervised moving object segmentation methods to verify the effectiveness of our method. Based on whether they can operate in offline or online manner, we group these competitors into two categories.
Unsupervised offline methods: To achieve good segmentation performance, the offline methods often require the entire video sequence to generate long-term trajectories, and the moving objects are identified by the motion or objectness cues. Depending on the provided results of DAVIS-2016 benchmark dataset, the compared offline methods include: ARP , FST , NLC , MSG , KEY  and TRC .
Unsupervised online methods: Instead of generating long-term trajectories on the entire video sequence, the online methods are able to segment the moving objects frame-by-frame. The compared online methods include: FSEG , LMP , CVOS , SAL  and SFM . To be specific, FSEG  and LMP  are deep learning based methods and attempt to learn the moving patterns from the optical flow field. FSEG  fuses the appearance and motion in a two-stream fully convolutional neural network, where the appearance-stream is used to extract the candidate object regions and the motion-stream is used to produce the moving foreground. LMP  is also a fully convolutional network, which is learned from the synthetic videos and their ground-truth optical flow and motion segmentation. Based on the coarse motion segmentation, LMP  adopts the object proposals and CRF to refine the initial result. CVOS  automatically segment the moving objects with several frames, then a tracking strategy is used to propagate the initialized mask to subsequent frames. SAL  is based on the spatio-temporal saliency detection and it performs the VOS on multiple frames for online processing. SFM  is a salient motion detection method that operates between two consecutive frames.
Iv-E2 Quantitative Analysis
To demonstrate the performance of our approach, we compare it with several unsupervised methods on the DAVIS-2016  and SegTrack-v2  benchmark datasets. The quantitative comparison results on DAVIS-2016 dataset are shown in Table III and the results on SegTrack-v2  dataset are reported in Table IV, respectively. In addition, the compared algorithms and results on SegTrack-v2  dataset are obtained from recent work . Similar to , we mainly analyze our method on the larger dataset DAVIS-2016.
Performance on DAVIS-2016: As shown in Table III, in terms of the mean and recall of the region similarity and contour accuracy metrics, our method achieves the best performance among all of the compared algorithms, even the best offline method ARP. Especially, our method obtained significant improvement in recall of both the region similarity () and contour accuracy (), which can achieve improvement by and respectively when compared to the best offline method ARP. Moreover, the decay of and , and temporal stability of our method is also better than ARP.
Because our method is an online method, we mainly analyze the state-of-the-art online methods. FSEG and LMP adopt an end-to-end deep learning framework for motion segmentation between two consecutive frames, and both of them fuse the optical flow filed and object proposals for moving object segmentation. However, since the video content often continuously changes, the important temporal connection of the video content has been ignored in those two methods. Our method is based on salient motion detection and object proposals and it does not require further training on a large number of well-annotated data. Specifically, our method can achieve competitive segmentation result ( for +, as shown in Table II) to FSEG () and LMP (). By incorporating the forward propagation refinement, even without the CRF model, the accuracy of our method ++ can achieve that is better than FSEG and LMP. In addition, with the help of CRF optimization, our result is further improved to , which outperforms the compared methods by a large margin. The online method CVOS is very sensitive to the object initialization and it suffers the drift problem when tracking the initialized object mask. As shown in Table III, due to the unreliable online segmentation strategy, the accuracy of CVOS is only . Another online approach SAL uses the spatio-temporal saliency detection method to extract the moving object regions. However, the moving object is not always salient in some videos, and thus their segmentation result () is also not good enough. SFM is a salient motion detection method, because it has not considered the object information and temporal connection of the video content, its segmentation result () is also not very good.
Performance on SegTrack-v2: To demonstrate the performance of our method on the low resolution dataset, we report the comparison results of our method with several available results on this dataset in FSEG . Compared to the high resolution dataset DAVIS-2016, it is more difficult to predict accurate object regions with pretrained object proposals model on the SegTrack-v2 dataset, as illustrated in Figure 8. The NLC achieves the best performance on this dataset. However, it is an offline method that based on the non-local consensus voting of the short-term and long-term motion saliency. Compared to the online method FSEG, our approach outperforms FSEG and achieves better performance in most of the videos, as shown in Table IV.
Iv-E3 Qualitative Evaluation
To qualitatively evaluate our method, we compare our method with several unsupervised offline and online methods on some challenging cases that include: multiple moving objects, heavy occlusion, dynamic background, fast motion and motion blur, non-planar scene. For performance comparison, we compare our method with the offline method NLC555The provided results of the best offline method ARP  are not correct in the DAVIS-2016 benchmark dataset ., automatic initialization and tracking strategy based method CVOS , and two deep learning based methods FSEG  and LMP . The segmentation results of the compared methods on the above challenging scenarios illustrated in Figure 10. We analyze the results of each scenarios as follows.
Multiple moving objects: An unconstrained video often contains multiple moving objects and the proposed method is able to segment them automatically. As the video object segmentation in FSEG and LMP, for videos with multiple moving objects, we treat them as a single foreground. As shown in the first row of Figure 10, for the two moving objects in this video, our method is able to segment both of them. For the offline method NLC, the moving person is classified as background, which may be because of the small region size of the person. CVOS cannot automatically initialize the moving person, and thus it failed to segment both of the two moving objects. The appearance stream of FSEG is not reliable to extract the object regions in this frame and it also failed to segment the moving person. Based on the accurate motion segmentation and object proposals, LMP and our method are able to successfully segment both of the two objects. More results on the multiple moving objects segmentation are reported in Table IV, such as the video bmx, drift, monkeydog and penguin.
Heavy occlusion: Occlusion is a very challenging problem in VOS, which can cause disconnection for long-term trajectories generation and drift problem for tracking. As shown in the second row of Figure 10, because of the disconnection trajectories caused by heavy occlusion, some background regions are classified as foreground by NLC. Besides, the segmentation is incomplete to cover the whole bus. CVOS adopts an automatic object initialization and tracking strategy, and thus it suffers from the drift problem of tracking. The segmentation result of CVOS is also incomplete. LMP is learned on the ground truth optical flow and motion segmentation of specific dataset, and thus the performance of LMP is stable on the dataset, such as the result shown in this video frame. FSEG can achieve better performance by fusing the object proposals and motion segmentation in a unified framework and our method is a slightly better than FSEG.
Dynamic background: Dynamic background regions are difficult to remove without the prior knowledge about the object information. As shown in the third row of Figure 10, NLC and CVOS cannot get an accurate segmentation in this video. LMP failed to segment the moving object in this video. Because LMP adopts an end-to-end framework that learns the motion pattern from the ground-truth optical flow and binary motion segmentation on the rigid scenes, and thus it is difficult to obtain accurate results when the motion is caused by non-rigid scene (such as waving water). Based on the salient motion detection and robust object proposals, our approach achieves good segmentation results.
Fast motion and motion blur: When the object moves fast, it would cause unreliable optical flow estimation and motion blur. As shown in the fourth row of Figure 10, due to the fast motion of the car, the computed optical flow filed is not accurate enough to indicate the region of the moving car. Therefore, the segmentation result of NLC  is incomplete and CVOS  contains too much background regions. Similar to the dynamic background condition, LMP  cannot obtain good segmentation when the optical flow field is not reliable enough. Based on the robust forward propagation refinement, our method achieves better performance than FSEG  in this video frame.
Non-planar scene: Because of projecting the 3D world to a 2D plane (optical flow field), it is difficult to distinguish the moving foreground and static background when the scene is non-planar. As shown in the last row of Figure 10, due to lack of the prior knowledge about the object information, the segmented foreground masks by NLC and CVOS are very different from each other, and both of the two methods fail to obtain reliable segmentation results. With the help of robust object proposals, our method is able to achieve good performance as FSEG  and LMP .
In this paper, we presented a new framework for the unsupervised online VOS problem. Motivated by the two key properties of moving objects - ”moving” and ”generic”, we propose to apply the salient motion detection and object proposal techniques for the challenging problem. Moreover, we designed a pixel-level fusion method and a forward propagation refinement method to improve the segmentation performance. Comprehensive experiments have been performed on two benchmark datasets. Without any fine-tuning of the applied pre-trained models, the results show that our method can outperform existing state-of-the-art methods by a large margin. Besides, we detailedly analyzed the results and how the proposed method deal with some very challenging scenarios. This work explores the potential of combining the salient motion detection and object proposal techniques for the VOS problem. We hope that it can motivate more works on this new framework in the future.
This research is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its International Research Centre in Singapore Funding Initiative. This research is also supported by National Natural Science Foundation of China 61571362 and Natural Science Basic Research Plan in Shaanxi Province of China (Program No. 2018JM6015).
-  M. Song, D. Tao, and S. J. Maybank, “Sparse camera network for visual surveillance–a comprehensive survey,” arXiv:1302.0446, 2013.
-  F. Guo, W. Wang, J. Shen, L. Shao, J. Yang, D. Tao, and Y. Y. Tang, “Video saliency detection using object proposals,” IEEE Transactions on Cybernetics, 2017.
-  A. Papazoglou and V. Ferrari, “Fast object segmentation in unconstrained video,” in ICCV, 2013.
-  J. Yang, B. Price, X. Shen, Z. Lin, and J. Yuan, “Fast appearance modeling for automatic primary video object segmentation,” TIP, vol. 25, no. 2, pp. 503–515, 2016.
-  X. Liu, D. Tao, M. Song, Y. Ruan, C. Chen, and J. Bu, “Weakly supervised multiclass video segmentation,” in CVPR, 2014.
-  L. Yang, J. Han, D. Zhang, N. Liu, and D. Zhang, “Segmentation in weakly labeled videos via a semantic ranking and optical warping network,” TIP, vol. 27, no. 8, pp. 4025–4037, 2018.
-  F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in CVPR, 2016.
-  A. Khoreva, F. Perazzi, R. Benenson, B. Schiele, and A. Sorkine-Hornung, “Learning video object segmentation from static images,” in CVPR, 2017.
-  P. Tokmakov, K. Alahari, and C. Schmid, “Learning motion patterns in videos,” in CVPR, 2017.
-  L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos, “Efficient video object segmentation via network modulation,” in CVPR, 2018.
-  S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C.-C. Jay Kuo, “Instance embedding transfer to unsupervised video object segmentation,” in CVPR, 2018.
Q. Fan, F. Zhong, D. Lischinski, D. Cohen-Or, and B. Chen, “JumpCut: Non-successive mask transfer and interpolation for video cutout,”SIGGRAPH ASIA, vol. 34, no. 6, 2015.
-  N. Maerki, F. Perazzi, O. Wang, and A. Sorkine-Hornung, “Bilateral space video segmentation,” in CVPR, 2016.
-  S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool, “One-shot video object segmentation,” in CVPR, 2017.
-  Y. Sheikh, O. Javed, and T. Kanade, “Background subtraction for freely moving cameras,” in ICCV, 2009.
P. Ochs and T. Brox, “Higher order motion models and spectral clustering,” inCVPR, 2012.
-  F. Xiao and Y. Jae Lee, “Track and segment: An iterative unsupervised approach for video object proposals,” in CVPR, 2016.
-  A. Faktor and M. Irani, “Video segmentation by non-local consensus voting,” in BMVC, 2014.
-  S. D. Jain and K. Grauman, “Supervoxel-consistent foreground propagation in video,” in ECCV, 2014.
-  K. Fragkiadaki, P. Arbelaez, P. Felsen, and J. Malik, “Learning to segment moving objects in videos,” in CVPR, 2015.
-  Y. J. Koh and C.-S. Kim, “Primary object segmentation in videos based on region augmentation and reduction,” in CVPR, 2017.
-  W. Wang, J. Shen, and F. Porikli, “Saliency-aware geodesic video object segmentation,” in CVPR, 2015.
-  F. Li, T. Kim, A. Humayun, D. Tsai, and J. Rehg, “Video segmentation by tracking many figure-ground segments,” in ICCV, 2013.
-  B. Taylor, V. Karasev, and S. Soattoc, “Causal video object segmentation from persistence of occlusions,” in CVPR, 2015.
-  Y. Yang, G. Sundaramoorthi, and S. Soatto, “Self-occlusion and disocclusion in causal video object segmentation supplementary material,” in ICCV, 2015.
-  E. L.-M. Pia Bideau, “It’s moving! a probabilistic model for causal motion segmentation in moving camera videos,” in ECCV, 2016.
-  P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár, “Learning to refine object segments,” in ECCV, 2016.
-  P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in NIPS, 2011.
-  S. Dutt Jain, B. Xiong, and K. Grauman, “Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos,” in CVPR, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
-  Y.-H. Tsai, M.-H. Yang, and M. J. Black, “Video segmentation via object flow,” in CVPR, 2016.
-  L. Wen, D. Du, Z. Lei, S. Z. Li, and M.-H. Yang, “JOTS: Joint online tracking and segmentation,” in CVPR, 2015.
-  P. Voigtlaender and B. Leibe, “Online adaptation of convolutional neural networks for video object segmentation,” in BMVC, 2017.
-  J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang, “Fast and accurate online video object segmentation via tracking parts,” in CVPR, 2018.
-  F. Liu, C. Gong, X. Huang, T. Zhou, J. Yang, and D. Tao, “Robust visual tracking revisited: From correlation filter to template matching,” TIP, vol. 27, no. 6, pp. 2777–2790, 2018.
-  H. Jung, J. Ju, and J. Kim, “Rigid motion segmentation using randomized voting,” in CVPR, 2014.
R. Hartley and A. Zisserman,
Multiple view geometry in computer vision. Cambridge university press, 2003.
-  M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global contrast based salient region detection,” TPAMI, vol. 37, no. 3, pp. 569–582, 2015.
-  C. Gong, D. Tao, W. Liu, S. J. Maybank, M. Fang, K. Fu, and J. Yang, “Saliency propagation from simple to difficult,” in CVPR, 2015.
-  J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mech, “Minimum barrier salient object detection at 80 fps,” in ICCV, 2015.
-  Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr, “Deeply supervised salient object detection with short connections,” in CVPR, 2017.
-  W.-T. Li, H.-S. Chang, K.-C. Lien, H.-T. Chang, and Y.-C. F. Wang, “Exploring visual and motion saliency for automatic video object extraction,” TIP, vol. 22, no. 7, pp. 2600–2610, 2013.
-  P. Krähenbühl and V. Koltun, “Geodesic object proposals,” in ECCV, 2014.
-  P. O. Pinheiro, R. Collobert, and P. Dollár, “Learning to segment object candidates,” in NIPS, 2015.
-  R. B. Girshick, “Fast R-CNN,” in ICCV, 2015.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in NIPS, 2015.
-  N. Ohtsu, “A threshold selection method from gray-level histograms,” IEEE Transactions on Systems Man Cybernetics, vol. 9, no. 1, pp. 62–66, 1979.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
-  C. Liu, J. Yuen, and A. Torralba, “SIFT flow: Dense correspondence across scenes and its applications,” TPAMI, vol. 33, no. 5, pp. 978–994, 2011.
-  N. Otsu, “A threshold selection method from gray-level histograms,” IEEE Transactions on Systems, Man and Cybernetics, vol. 9, no. 1, pp. 62–66, 1979.
-  P. Ochs and T. Brox, “Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions,” in ICCV, 2011.
-  Y. J. Lee, J. Kim, and K. Grauman, “Key-segments for video object segmentation,” in ICCV, 2011.
-  K. Fragkiadaki, G. Zhang, and J. Shi, “Video segmentation by tracing discontinuities in a trajectory embedding,” in CVPR, 2012.
-  F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in CVPR, 2012.