1 Introduction††footnotetext: Part of the work was done during an internship at Tencent AI Lab.††footnotetext: Equal contributions. Corresponding authors.
Semi-supervised Video Object Segmentation (VOS) is the task to automatically segment the objects of interests in a video given the annotations in the first frame, which is a fundamental task with wide applications in video editing, video summarization, action recognition, etc. Although tremendous progress has been made with semantic segmentation CNNs [24, 7, 8, 28] recently, VOS is still challenging in objects missing and association problems due to occlusions, large deformations, complex object interactions, rapid motions, etc., as shown in Fig. 1.
To tackle these challenges, many recent works [22, 35, 25] resort to object proposal schemes [13, 34] to restore missing objects or re-establish objects associations. In these works, proposals of target objects are either generated individually in each frame [22, 35] by semantic detectors, or further merged with a few neighboring frames . However, these approaches rely on a greedy selection of the best object proposal at each time step, for a given object, which becomes a complication with utter dependence on a reliable Re-ID network  that can provide accurate similarity scores. In this paper, we instead deal with this problem by employing a multiple hypotheses propagation approach, which builds up a tracking tree for different hypotheses in time steps, enabling us to defer the selection of the best object proposal for each target till a whole proposal tree along temporal domain is established. This delayed decision making provides us a global view to determine data associations in each frame by considering objects information over the entire video, provably more reliable than greedy methods.
The idea of tracking using multiple hypotheses is not new. In the seminal work by Cox and Hingorani , Multiple Hypotheses Tracking (MHT) was first introduced to the vision community and applied in the context of visual tracking. Unfortunately, the performance of MHT was limited by unreliable target detectors at that time and later abandoned for decades. More recently, it is again demonstrated to achieve state-of-the-art performances for multiple objects tracking when implemented with modern techniques . The basic idea of MHT is to build up a tracking tree with proposals from each frame, and then prune the tree using the tracking scores until the best track left. The key ingredients for the success of MHT in  are the gating scheme and scoring function during the construction and pruning of the tracking trees. In the gating scheme, Kalman filtering is employed to restrict proposal children to be spawned within a certain gating area near their parent, such that the tree does not expand too quickly. The scoring function is to determine the similarity between two hypotheses using motion and appearance cues. However, the algorithm is not that reliable when it comes to VOS, especially when there are large object deformations or sudden changes of object movements (see carousel in Fig. 1 as an example). In this case, the simple motion model of Kalman filtering would break and the appearance score would be very brittle.
In this paper, we adapt MHT to VOS and propose a novel method called Multiple Hypotheses Propagation for Video Object Segmentation (MHP-VOS). Starting from the initial bounding box (bbox) of object mask in the first frame, multiple hypotheses are spawned by proposals from the class-agnostic detector within a novel motion gated region instead of Kalman filtering. We also design a novel mask propagation score instead of the appearance similarity score that could be brittle due to large deformations in challenging cases. The mask propagation score, together with motion score, determines the affinity between hypotheses during the tree pruning. After pruning the proposal tree, the final instance segmentation can be generated and propagated with a mask refinement CNN for each object of interests. And the conflicts between objects are further handled with a novel mask merging strategy. Comparing to state-of-the-art approaches, our method is much more robust and achieves the best performances on the DAVIS datasets.
Our main contributions are summarized as follows:
We adapt a multiple hypotheses tracking method to the VOS task to build up a bbox proposal tracking tree for different objects with a new gating and pruning method, which can be regarded as a delayed decision for global consideration.
We apply a motion model to proposal gating instead of using the Kalman filtering, and design a novel hybrid pruning score of motion and mask propagation, which are tailored for VOS tasks. We also design a novel mask merging strategy for multi-objects tasks.
We conduct extensive experiments to show the effectiveness of our method in distinguishing similar objects, handling occluded and re-appearing objects, modeling long-term object deformations, etc., which are very difficult to deal with for previous approaches.
2 Related Work
In this section, we briefly summarize recent researches related to our work, including semi-supervised video object segmentation and multiple hypotheses tracking.
Matching-based Video Object Segmentation. This type of approaches generally utilize the given mask in the first frame to extract appearance information for objects of interests, which is then used to find similar objects in succeeding frames. Yoon et al. 
proposed a siamese network to match the object between frames in a deep feature space. In, Caelles trained a parent network on still images and then finetuned the pre-trained work with one-shot online learning. To further improve the finetuning performance in , Khoreva et al.  synthesized more training data to enrich the appearances on the basis of the first frame. In addition, Chen et al.  and Hu et al.  used pixel-wise embeddings
learned from supervision in the first frame to classify each pixel in succeeding frames.Cheng et al.  proposed to track different parts of the target object to deal with challenges like deformations and occlusions.
Propagation-based Video Object Segmentation. Different from the appearance matching methods, mask propagation methods utilize temporal information to refine segmentation masks propagated from preceding frames. MaskTracker  is a typical method following this line, which is trained from segmentation masks of static images with mask augmentation techniques. Hu et al.  extended MaskTracker  by applying active contour on optical flow to find motion cues. To overcome the problem of target missing when fast motion or occlusion occurs, methods [40, 38] combined temporal information from nearby frame to track the target. The CNN-in-MRF method  embeds the mask propagation step into the inference of a spatiotemporal MRF model to further improve temporal coherency. Oh et al.  applied instance detection to mask propagation using a siamese network without online finetuning for a given video. Another method 
that does not need online learning uses Conditional Batch Normalization (CBN) to gather spatiotemporal features.
Detection-based Video Object Segmentation. Object detection has been widely used to crop out the target from a frame before sending it to a segmentation model. Li et al.  proposed VS-ReID algorithm to detect missing objects in video object segmentation. Sharir et al.  produced object proposals using Faster R-CNN  to gather proper bounding boxes. Luiten et al.  used Mask R-CNN  to detect supervised targets among the frames and crop them as the inputs of Deeplabv3+ . Most works based on detections select one proposal at each time step greedily. In contrast, we keep multiple proposals at each time step and make decisions globally for the segmentation.
Multiple Hypotheses Tracking. MHT method is widely used in the field of target tracking [3, 4]. Hypotheses tracking  algorithm originally evaluates its usefulness in the context of visual tracking and motion correspondence, and the MHT in  proposed a scoring function to prune the hypothesis space efficiently and accurately which is suited to current visual tracking context. Also, Vazquez et al.  first adopted MHT in the semantic video segmentation task without pruning. In our method, we adapt the approach to the class-agnostic video object segmentation scenario, where propagation scoring is class-agnostic with the motion rules and the propagation correspondences instead of the unreliable appearance scores.
The overall architecture of our proposed MHP-VOS is illustrated in Fig. 2. We first generate bbox object proposals of image from frame with a class-agnostic detection approach in Sec. 3.1, and then apply multiple hypotheses propagation recurrently during building the hypotheses propagation tree (Sec. 3.2) with our novel gating and scoring strategies and filter out disturbing hypotheses by -scan pruning (Sec. 3.3) to introduce long-term knowledge for hypotheses decision. To take advantage of spatial information between different objects in a sequence, the propagation trees for each object are built at the same time. After acquiring each corresponding bounding box proposal associated with the best hypotheses for each object, we obtain current mask for object using a segmentation model with in Sec. 3.4. At last, we merge instance masks to multi-objects mask with consideration of intra-objects conflicts in Sec. 3.5.
3.1 Proposal Generation
There are many approaches [34, 13] used to detect the target object in each video frame. In this paper, we take Mask R-CNN  network fine-tuned on each sequence as the base-model to generate coarse object proposals, which are the bbox around the objects. Specially, we change the category number of Mask R-CNN from classes to only one class to make it class-agnostic for detecting foreground objects. Note that segmentation results from the Mask branch are not used for VOS, as this branch shares the classification confidence which is not suitable for the segmentation task. With the input of each frame image, we just extract coarse object bounding box proposals with the detection confidence greater than , and non-maximum suppression threshold of to retain all possible proposals for the further mask proposal propagation in the next step. Here, we denote the output proposal of frame as , where is the -th proposal of all proposals in detection step.
3.2 Hypotheses Tree Construction
After generating coarse object proposals, we construct the hypotheses propagation tree, whose data structures are designed as follows: each hypothesis node in the tree consists of a bounding box proposal and its corresponding mask hypothesis . For each target object, the tree starts from the ground-truth mask in the first frame, and will be extended by appending children proposals in the next frame. In this children spawning step, only proposals within a gated region are considered. And the mask hypothesis for each child proposal is obtained using the method detailed in Sec. 3.4. This process is repeated until the final hypotheses tree is constructed completely. In addition, each proposal outside the gated region is treated as the starting node in a new tree to catch missing objects. During the tree construction, a novel mask propagation score of each node can be recorded and would be used for tree pruning later, which is more robust than the appearance score.
Gating. To build the hypotheses tree, we need to gate most closely proposals in next frame to be the child nodes, shown in Fig. 3 (a). In general, the bounding box of objects in frame depends on two main variables: size and center point coordinate . Thus, the historical movements in frame from to
are adopted as prior knowledge to predict the probability bbox in frame. For the position prediction, the velocity
is estimated by
Then the predicted center point is obtained by . And the corresponding average size is taken as the predicted object size , since the change in size is tiny and smooth. With the estimation of and , it gives the bbox candidate for comparison in gating.
In order to filter out disturbing proposals, we gate the candidate proposals by computing the IOU score with the bounding box in the last frame as follows:
where is the threshold of gating, and denotes whether the candidate box gates in or out. With proposals chosen from gating, we can build up the propagation tree to simulate multiple hypotheses proposal propagation.
Scoring. In the propagation tree, each hypotheses is associated with a class-agnostic score for further pruning. It is a recurrent process in each tree node, which is formalized as:
where and denote the motion score and mask propagation score, respectively. means the current video frame number, denotes the proposal of the -th hypotheses track. and control the ratio between motion score and propagation score. There is no Re-ID score involved since it may cause ambiguity when objects of similar appearances exist.
For each bounding box proposal of the node in the propagation tree, we define the motion score as:
The motion score is composed of two parts: a) iou score between proposals of same hypotheses in continuous frames, which is positive to the decision; b) iou score between frame proposal of -th track and the -th proposal node in other hypotheses track, and it is expected to be small.
Motion score gives a qualitative mark when the continuity of propagation track is smooth. However, the motion score will be out of order when severe occlusion occurs. In order to handle such case, the mask propagation score is proposed utilizing the quality of segmentation propagated in target proposal, which can be formalized as:
where denotes the warp operation that warps mask from last frame to current frame with optic flow . And denotes the single object mask segmentation obtained by method in Sec. 3.4 with the proposal . composes the mask hypothesis with bounding box proposals: it starts from ground-truth in frame , and forwards propagation with the construction of proposal tree (warp to next frame as priori mask for mask generation in progressively). As for the new start tree for the missing object, the mask of tree root is obtained with blank mask as the priori mask.
At last, the final score of the long-term hypotheses can be computed recursively as:
where denotes the probabilities of detection.
3.3 Hypotheses Tree Pruning
During the construction of the hypotheses tree, the number of hypotheses tracks increases exponentially during propagation, which leads to the explosion of memory and computation. Thus, we have to take a pruning step to limit the size of the tree. In other words, we need to determine the most likely context propagation tracks in long term, of which the optimization can be formulated as:
where means a proposal propagation hypothesis (track path from root to leaf node in the propagation tree) and means hypothesis space for tracks of an object. means the Hypotheses space size for the target object.
To find the best track among the kinds of propagation tracks, this task can be formulated as a Maximum Weighted Independent Set (MWIS) problem as described in . For the track tree in frame , we build an undigraph with each propagation hypothesis taken as a node in . The edge in connects the hypothesis pair which has the same proposal at the same frame, which means the two hypotheses are conflicting and cannot co-exist for the final independent set . With the track score described in Eq. (8) as the weight of each track branch, we optimize the problem to find the maximum weight independent set as follows:
We utilize the existing phased local search (PLS) algorithm [32, 33, 2] to solve the MWIS optimization problem. Also, we take the -scan pruning method to prune the disturbing branches gradually instead of pruning the whole tree. First, we apply the Eq. (9) to choose the maximum independent set as the best hypothesis from hypothesis space , and then track the nodes in frame back to the node in frame as sub-trees. Finally, we prune the sub-trees except the independent tracks. A larger makes a longer decision delay, which will bring an improvement in precision but take time efficiency as price. In addition, we also limit the number of branches to avoid proposal tree growing too large. If the number of branches is more than at any node in any frame, we retain the top branches with the propagation scores and prune the other branches.
3.4 Single Object Segmentation
We employ Deeplabv3+  network with a ResNet101  backbone as our segmentation module, to generate segmentation results from bounding box proposals. Similar to MaskTracker , the segmentation network takes an additional rough mask as input, which is warped from the mask of the previous frame to the current frame using optical flow estimated by FlowNet2 . This module is used to generate mask hypothesis from proposal during the tree construction, and can produce the final segmentation result once the best proposal for an object is obtained after the tree pruning. Taking the final segmentation as an example, we crop the bounding box of a single object and its previous mask by with margin ratio , and then concatenate the RGB image with the warped mask as a fourth channel input. After obtaining the segmentation probability map from Deeplabv3+, we obtain the instance-specific mask with threshold as following:
where denotes the total object number in one sequence.
|Dataset||Metric||OSMN ||FAVOS ||OSVOS ||OnAVOS ||OSVOS-S ||CINM ||Ours|
3.5 Conflicts Handling for Multiple Objects
To merge the instance-specific masks into the final multi-instance segmentation , we propose a merging strategy as shown in Algorithm 1. In general, there are two kinds of cases when we decide each pixel id in the final segmentation. For the pixel belonging to one object, we set the object id to be the same as the the corresponding pixel among the single instance masks. However, the pixel may belong to different objects at the same time when the overlap conflicts happen between multi-instance masks. To determine the object id for the overlapped region, we first take the top two possible object ids sorted by the corresponding values in the probability map from DeeplabV3+ as id candidates. We then accept the object id with higher probability only when there is a large margin between the two probability values (we use a marginal ratio ). Otherwise, we take temporal coherency of the warped mask in consideration when it is ambiguous to use spatial information only. Besides, a two-dimensional gaussian map is generated from the proposal with parameters of and as prior knowledge to obtain the weighted mask without noise out of the region of interests, where and are the width and height of proposal , respectively.
In this section, we investigate the performance of our method on standard benchmark datasets: DAVIS2016  and DAVIS2017 . We compare our model with state-of-the-art methods and perform ablation study to demonstrate the advantage of each component in MHP-VOS.
4.1 Implementation Details
dataset with the pre-trained ImageNet weights, and then finetune it on DAVIS dataset. Before testing, we finetune the parent model weights on each sequence respectively with the corresponding synthetic in-domain image-pairs of Lucid Dreaming . Then, coarse proposals are selected with the and .
During the training of the Deeplabv3+  network with a ResNet101  backbone, we crop the bbox of the four channel input by using the spatial information of the annotation with margin ratio . Then, we resize the cropped data into 512
512, jitter the image color, and then train them for 100 epochs both on COCO and DAVIS [30, 31, 6]
datasets. We use BCEWithLogits loss function, and set Adam optimizer with which reduces by power of 0.9 for every 10 epochs. In the fine-tuning, we only train the parent model on synthetic image-pairs for 50 epochs, and the lr starts from and also reduces by power of 0.9 for every 10 epochs. We set to get the valid mask with the corresponding probability map. At last, the instance masks are merged with . In N-scan pruning phase, we set and . All experiments are implemented on a single NVIDIA 1080 GPU. The code is available at https://github.com/shuangjiexu/MHP-VOS.
4.2 Datasets and Evaluation
DAVIS2016. DAVIS2016  dataset is proposed recently to evaluate VOS methods and contains 50 video sequences divided into train and test parts. Each video sequence consists of a single object, and it provides each object with the corresponding mask among the sequences.
DAVIS2017. DAVIS2017  dataset is extended from DAVIS2016, and it is more challenging in multiple objects which correspond to different targets. It provides extra test-dev data with 30 challenging videos, which contains some similar objects in the same videos and object occlusion or missing in the continues frames. Background noise is also a challenge which has similar appearance with target objects.
Comparison to the State-of-the-arts. Table. 1 shows the quantitative comparison on DAVIS-2017 valid and test-dev sets, where we find that MHP-VOS performs the state-of-the-art in most evaluation matrices. Especially on the validation set, MHP-VOS beats all the latest methods and achieves higher Mean value. As illustrated in Table. 1 on the more challenge test-dev set, our model also gets great results. In terms of , and , our method outperforms the state-of-the-art CINM  by 2.1%, 2.2% and 2.0% respectively, with neither CRF or MRF applied.
Improvement. Many previous works are troubled by occlusion, similar objects or fast motion. However, as shown in Fig. 5, our method handles these challenges well. In the case of similar objects like ”carousel”, which will be mistakenly switched identities by OSVOS , our propagation proposals can track different instances well and identify each object. Also, we investigate that our method is robustly enough to the issues of fast motion and small instances, especially in ”monkeys-tree” sequence. For the occlusion problem, we find that the segmentation on ”salsa” performs identifiable which demonstrates the strong representation power of our model. The performances on these challenge sequences can also be illustrated in Fig. 4, where we achieve the state-of-the-art on almost all the videos.
Ablation Study. Table. 2 shows how much each presented component builds up to the final result. We start by the baseline model only with the motion score for pruning (,), and there is no no multiple hypotheses (N=1), no merge strategy ( in Merge, which means choose area with larger probability when conflict) and no traditional gating strategy  ( in Gating) in addition. Results show that the hybrid scoring of motion and propagation achieves 4.8 higher than the original motion score. Multiple hypotheses and the conflicts handling strategy both make the maximum improvement of performance with 7.6, respectively. At last, our gating strategy brings another improvement of 2.2 instead of using Kalman Filter .
In the scoring phase, four hyper-parameters (, , and ) are introduced to balance the weights between the scores of motion and propagation, where and . We apply grid search on parameters and with the step set as . Part of the grid search result is shown in Fig. 6. Experimental results show that MHP-VOS achieves the best result when and . As the phase of proposal tree formation, we apply N-scan pruning with parameter to control the delay time of proposal decision. In practice, is an interesting parameter that makes a trades off between performance and speed. Shown as Table. 3, lager decision delay time () receives a performance boost, but gets the punishment in speed. We set to achieve a balanced performance.
Weakness. Here we report typical examples of mistaken cases on DAVIS2017 test-dev. In the first video sequence, the segmentation of deer in the left (green) is partly missing, which is due to the similar appearance in the context pixels. The instance detector may regard the body of the deer to part of the tree and only generates the proposal of the head with the contrast background. Next in the middle sequence, we find that the racket is segmented well in previous frames but missed in the later. This is because the proposed merging strategy that classifies the identity of overlap region wrongly in the ambiguous case. In the last video, the person in yellow is gradually switched to blue which means the proposal of this person is propagated wrongly during the tree building with two overlap bounding boxes of these disturbing objects.
As illustrated in Table. 4, our method achieves great progress with the , and of 85.7%, 88.1% and 86.9%, which outperforms the state-of-the-art OSVOS-S  by 0.1%, 0.6% and 0.3% respectively. Compared to the traditional method MSK , our MHP-VOS improves a lot by 9.3% on the Global Mean . Also, we investigate that our performance is better than many latest models, like FAVOS  and MoNet . Although our method performs well on DAVIS2016 validation set, there are not huge improvement between ours and the state-of-the-art models, for the reason that the proposal propagation is not essential for single object tracking, and the CNN-based segmentation module is capable enough to locate the foreground instance. As shown in Fig. 5, each target object has corresponding accurate segmentation even in motion blur or occlusion cases.
In this work, we presented a novel detection based Multiple Hypotheses Propagation (MHP-VOS) method for semi-supervised video object segmentation. The key to MHP-VOS is that the decision for proposal in one frame is delayed to eliminate ambiguity with long-term information. Therefore, a hypothesis propagation tree was introduced to catch more potential proposals in each frame for tracking, with a novel class-agnostic gating and scoring strategy adapted to the VOS scenario. In addition, a novel conflicts handling method for multiple objects was proposed to transfer MHP-VOS to the multiple objects setting. Our experiments investigate performances of the pipeline and each component module, which are demonstrated to achieve significant performance gains compared against the state-of-the-arts.
-  L. Bao, B. Wu, and W. Liu. Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In , pages 5977–5986, 2018.
-  L. Barth, B. Niedermann, M. Nöllenburg, and D. Strash. Temporal map labeling: A new unified framework with experiments. In Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, page 23, 2016.
-  S. Blackman and R. Popoli. Design and analysis of modern tracking systems (artech house radar library). Artech house, 1999.
-  S. S. Blackman. Multiple hypothesis tracking for multiple target tracking. IEEE Aerospace and Electronic Systems Magazine, 19(1):5–18, 2004.
-  S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. One-shot video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  S. Caelles, A. Montes, K.-K. Maninis, Y. Chen, L. Van Gool, F. Perazzi, and J. Pont-Tuset. The 2018 davis challenge on video object segmentation. arXiv preprint arXiv:1803.00557, 1(2), 2018.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2018.
-  L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611, 2018.
-  Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1189–1198, 2018.
-  J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang. Fast and accurate online video object segmentation via tracking parts. arXiv preprint arXiv:1806.02323, 2018.
-  I. J. Cox and S. L. Hingorani. An efficient implementation of reid’s multiple hypothesis tracking algorithm and its evaluation for the purpose of visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(2):138–150, 1996.
-  J. Deng, W. Dong, R. Socher, and L. J. Li. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In IEEE International Conference on Computer Vision, pages 2980–2988, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. pages 770–778, 2015.
-  P. Hu, G. Wang, X. Kong, J. Kuen, and Y.-P. Tan. Motion-guided cascaded refinement network for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1400–1409, 2018.
-  Y.-T. Hu, J.-B. Huang, and A. G. Schwing. Videomatch: Matching based video object segmentation. arXiv preprint arXiv:1809.01123, 2018.
-  E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  R. E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960.
-  A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Lucid data dreaming for multiple object tracking. arXiv preprint arXiv:1703.09554, 2017.
-  C. Kim, F. Li, A. Ciptadi, and J. M. Rehg. Multiple hypothesis tracking revisited. In Proceedings of the IEEE International Conference on Computer Vision, pages 4696–4704, 2015.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. Computer Science, 2014.
-  X. Li, Y. Qi, Z. Wang, K. Chen, Z. Liu, J. Shi, P. Luo, X. Tang, and C. C. Loy. Video object segmentation with re-identification. arXiv preprint arXiv:1708.00197, 2017.
-  T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. 8693:740–755, 2014.
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
-  J. Luiten, P. Voigtlaender, and B. Leibe. Premvos: Proposal-generation, refinement and merging for the davis challenge on video object segmentation 2018, 2018.
-  K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
-  D. J. Papageorgiou and M. R. Salpukas. The maximum weight independent set problem for data association in multiple hypothesis tracking. In Optimization and Cooperative Control Strategies, pages 235–255. Springer, 2009.
-  C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters—improve semantic segmentation by global convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1743–1751, 2017.
-  F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
-  F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 724–732, 2016.
-  J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017.
Phased local search for the maximum clique problem.
Journal of Combinatorial Optimization, 12(3):303–323, 2006.
-  W. Pullan. Optimisation of unweighted/weighted maximum independent sets and minimum vertex covers. Discrete Optimization, 6(2):214–219, 2009.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):1137–1149, 2017.
-  G. Sharir, E. Smolyansky, and I. Friedman. Video object segmentation using tracked object proposals. arXiv preprint arXiv:1707.06545, 2017.
-  A. Vazquez-Reina, S. Avidan, H. Pfister, and E. Miller. Multiple hypothesis video segmentation from superpixel flows. In Proceedings of the European Conference on Computer Vision, pages 268–281, 2010.
-  P. Voigtlaender and B. Leibe. Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364, 2017.
-  S. Wang, Y. Zhou, J. Yan, and Z. Deng. Fully motion-aware network for video object detection. In Proceedings of the European Conference on Computer Vision, pages 542–557, 2018.
-  S. Wug Oh, J.-Y. Lee, K. Sunkavalli, and S. Joo Kim. Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7376–7385, 2018.
-  H. Xiao, J. Feng, G. Lin, Y. Liu, and M. Zhang. Monet: Deep motion exploitation for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1140–1148, 2018.
-  L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos. Efficient video object segmentation via network modulation. algorithms, 29:15, 2018.
J. S. Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, and I. S. Kweon.
Pixel-level matching for video object segmentation using convolutional neural networks.In IEEE International Conference on Computer Vision, pages 2186–2195, 2017.