Video object segmentation (VOS) is the task to segment foreground objects from background across all frames in a video clip. The VOS methods can be classified into two categories: semi-supervised and unsupervised VOS methods. Semi-supervised VOS methods[23, 34, 35, 2, 28] require the ground truth segmentation mask in the first frame as the input and, then, segment the annotated object in the remaining frames. Unsupervised VOS methods [15, 32, 12, 4, 17, 18] identify and segment the main object in the video automatically.
have achieved great success due to the emergence of deep neural networks such as the fully convolutional network (FCN). The one-shot video object segmentation (OSVOS) method  uses large classification datasets in pretraining and applies the foreground/background segmentation information obtained from the first frame to object segmentation in the remaining frames of the video clip. It converts the image-based segmentation method to a semi-supervised video-based segmentation method by processing each frame independently without using the temporal information.
However, since manual annotation is expensive, it is desired to develop the more challenging unsupervised VOS solution. This is feasible due to the following observation. Inspired by vision studies , moving objects can attract infant and young animals’ attention who can group things properly without knowing what kinds of objects they are. Furthermore, we tend to group moving objects and separate them from background and other static objects. In other words, semantic grouping is acquired after motion-based grouping in the VOS task.
In this paper, we propose to tag the main object in a video clip by combining the motion information and the instance segmentation result. We use optical flow to group segmented pixels to a single object as the pseudo ground truth and, then, take it as the first frame mask to perform the OSVOS. The pseudo ground truth is the estimated object mask for the first frame to replace the true ground truth in the semi-supervised VOS methods. The main idea is sketched below. We apply a powerful instance segmentation algorithm, called the Mask R-CNN, to the first frame of a video clip as shown in Figure 1, where different objects have different labels. Then, we extract optical flow from the first two frames and select and group different instance segmentations to estimate the foreground object. Next, we finetune a pretrained CNN using the estimated foreground object from the first frame as the pseudo ground truth and propagate the foreground/background segmentation to the remaining frames of the video one frame at a time. Finally, we achieve state-of-the-art performance in the benchmark datasets by incorporating online adaptation . Example results are shown in Figure 2.
Our goal is to segment the primary video object without manual annotations. The proposed method does not use the temporal information of the whole video clip at once but one frame at a time. Errors from each consequent frames do not propagate along time. As a result, the proposed method has higher tolerance against occlusion and fast motion. We evaluate the proposed method extensively on the DAVIS dataset , the FBMS dataset . Our method gives state-of-the-art performance in both datasets with the mean intersection-over-union (IoU) of 79.3% on DAVIS, and 77.9% on FBMS.
Main contributions in this work are summarized below. First, we introduce a novel unsupervised video object segmentation method by combining instance segmentation and motion information. Second, we transfer a recent semi-supervised network architecture to the unsupervised context. Finally, the proposed method outperforms state-of-the-art unsupervised methods on several benchmarks datasets.
The rest of this paper is organized as follows. Related work is reviewed in Sec. 2. Our novel unsupervised video object segmentation method is proposed in Sec. 3. Experimental results are shown in Sec. 4. Finally, concluding remarks are given in Sec. 5.
2 Related Work
2.0.1 Instance segmentation.
Many video object segmentation methods [35, 12, 28, 2] are based on semantic segmentation networks  for static images. State-of-the-art semantic segmentation techniques are dominated by fully convolutional networks[22, 3]. Semantic segmentation segments the same category of objects with one mask while instance segmentation
provides a segmentation mask independently for each instance. One key reason that these deep learning based methods for instance segmentation have developed very rapidly is that there are large datasets with instance mask annotations such as COCO. It is difficult to annotate all categories of objects and apply a supervised training. It is more difficult to extend image instance segmentation to video instance segmentation due to the the lack of large-scale manual labeled instance video object segmentation datasets. In contrast, we focus on generic object segmentation in the video and we do not care whether the object category is in the training dataset or not. We propose a method to transfer the image instance segmentation to enable finetuning the pretrained fully convolutional network.
2.0.2 Semi-supervised video object segmentation.
Semi-supervised VOS requires the manual labels for the first frame and then propagate it to the rest of the video. The annotation provides a good initialization for the object appearance model and the problem can be considered as a foreground/background segmentation guided by the first frame annotation. Deep learning approaches have achieved higher performance , most of the recent work are based on OSVOS  and MaskTrack . OSVOS creates a new model for each new video initialized with the pretrained model and finetunes on the first frame without using any temporal information. OSVOS treats each frame independently while MaskTrack considers the relationship between consecutive frames when training the network. Lucid data dreaming  proposed a data augmentation technique by cutting-out foreground, in-painting the background, perturbing both the foreground and background and finally reconstructed the frames. VOS with re-identification  adds a re-identification step to recover the lost instances in the long term by feeding the cropped patch contained the object instead of forwarding the entire image to the network. OnAVOS  proposed a online finetuning approach to segment future frames based on the first frame annotation and the previous predicted segmented frames.
2.0.3 Unsupervised video object segmentation.
discover the primary object segmentation in a video and assume no manual annotations. Some approaches formulate segmentation as foreground and background labeling problem, such as Gaussian Mixture Models and Graph Cut[23, 34]. ARP  proposed a unsupervised video object segmentation approach by iteratively refining the augmentation with missing parts or reducing them by excluding noisy parts. Recently more CNN-based approaches identify the primary object by using motion boundaries, saliency maps [32, 12]. LMP  trains an encoder-decoder architecture using ground truth optical flow and motion segmentation and then refines by the objectness map. Both LVO  and FSEG 
have two-stream fully convolutional neural networks that combine appearance features and motion features, LVO further improves the segmentation by forwarding the features to bidirectional convolutional GRU, while FSEG fuses these two models and put it as an end-to-end training. Unsupervised approach is more desired since it needs no human interactions and we focus on the unsupervised VOS in this paper.
3 Proposed Method
Our goal is to segment generic object in the video in an unsupervised approach. In the semi-supervised VOS, the first frame ground truth label is needed. Inspired by the semi-supervised approach, we propose a method to tag the “pseudo ground truth” and then take it as input for the pretrained network, and then output the segmentation masks for the rest of the video. To our best knowledge, this is the first attempt to transfer semi-supervised VOS approach to unsupervised VOS approach by utilizing “pseudo ground truth”. Figure 3 shows the overview of the proposed method, which includes three key components, the criterion to tag the primary object, appearance model and online adaptation.
3.1 Learning to tag the foreground object
3.1.1 Image instance segmentation.
We apply an image-based instance segmentation algorithm to the first frame of the given video. Specifically, we choose Mask R-CNN  as our instance segmentation framework and generate instance masks. We further exploit the error analysis to demonstrate that better initial instance segmentations improve the performance in a large margin which suggests that our proposed method has the potential to improve further with more advanced instance segmentation methods.
Mask R-CNN is a simple yet high performance instance segmentation model. Specifically, Mask R-CNN adds an additional FCN mask branch to the original Faster R-CNN  model. The mask branch and the bounding box branch are trained simultaneously in the training, while the instance masks are generated from the detection results at inference time. The box prediction branch generates bounding boxes based on the proposals followed by non-maximum suppression. The mask branch is then applied to predict segmentation masks from the 100 detection boxes with the highest scores. This step speeds up the inference time and improves accuracy, which is different from the training step with parallel computation. For each region of interests (ROIs), the mask can predict n times where n is the class number in the training set, and the only used k-th mask is from the predicted class by the classification branch.
We note that the mask branch generates class-specific instance segmentation masks for the given image whereas VOS focuses on class-agnostic object segmentation. Our experiments show that even though Mask R-CNN can only generate limited-class labels due to the labels of COCO  and PASCAL , we can still output instance segmentation masks with closest class label. Our algorithm needs to further force all the classes to one foreground class, and thus the mis-classification has little influence to the performance of VOS.
3.1.2 Optical flow thresholding.
There are two important cues in video object segmentations: appearance and motion. To use the information from both spatial and temporal domain, we incorporate optical flow with instance segmentation to learn to segment the primary object. Instance segmentation can generate precise class-specific segmentation masks, however, the algorithm cannot determine the primary object in the video. While optical flow can separate moving objects from the background, however the optical flow esimation is still far from perfect. Motivated by the moving objects attract people’s attention , so we use motion information to select and group the static image instance segmentation proposals, which takes advantange of the merits of optical flow and instance segmentation. We apply optical flow algorithm Coarse2Fine  to extract the optical flow between the first frame and the second frame of a given video clip. To combine with the instance segmentation proposals, we normalize the flow magnitude and then threshold and select the optical flow motivated by the faster motions are more likely to attract attentions.
We select instance segmentation proposals which have more than 80% overlap with optical flow mask. We further group the selected proposal masks with different class labels to one foreground class without any class labels. In image-based instance segmentation, the same object may be separated into different parts due to the differences of colors, textures and the influence of occlusions. We can efficiently group the different parts to one primary object without knowing the categories of the objects. We named this foreground object as “pseudo ground truth” and forward it to the pretrained appearance model. Sample “pseudo ground truths” are shown in Figure 1.
3.2 Unsupervised video object segmentation
Our proposed method is built on one-shot video object segmentation (OSVOS)  which finetunes the pretrained appearance model on the first annotated frame. We replace the first annotated frame with our estimated “pseudo ground truth” so that semi-supervised network architecture can be used in our proposed approach. Our goal is to train a ConvNet to segment a generic object in a video.
3.2.1 Network overview.
We adopt a more recent ResNet  architecture pretrained on ImageNet  and MS-COCO  to learn powerful features. In more detail, the network uses the model A of the wider ResNet with 38 hidden layers as the backbone. The data in DAVIS training datasets is very scarce and we further pretrain the network using PASCAL  by mapping all the 20-class labels to one foreground label and keep background unchanged. As demonstrated in OnAVOS , the two steps of finetuning on DAVIS and PASCAL are complementary. Hence, we finetune the network using DAVIS training datasets and obtain the final pretrained network. The above training steps are all offline training to construct a model to identify foreground object. At inference time, we finetune the network on the “pseudo ground truth” of the first frame to tell the network which object to be segmented. However, the first frame does not provide all the information through the whole video, and thus online adaptation is needed during the test time.
3.2.2 Online adaptation.
The major difficulty for video object segmentation is the appearance may change dramatically throughout the video. A model learned only from the first frame cannot address the severe appearance changes. Therefore online adaptation for the model is needed to exploit the information from the rest frames during inference time.
We adopt test data augmentation method from Lucid Data Dreaming augmentation  and online adaptation approach from OnAVOS  to perform our online finetuning. We generate augmentation of the first frame using Lucid Data Dreaming approach. As each frame is segmented, foreground pixels with high confidence predictions are taken as further positive training examples, while pixels far away from the last assumed object position are taken as negative examples. Then an additional round of fine-tuning is performed on the newly acquired data.
To evaluate the effectiveness of our proposed method, we conduct experiments on three challenging video object segmentation datasets: DAVIS , Freiburg-Berkeley Motion Segmentation (FBMS) dataset , SegTrack-v2 
. We use region similarity, which is defined as the intersection-over-union (IoU) between the estimated segmentation and the ground truth mask, and F-score evaluation protocol proposed in to estimate the accuracy.
We provide a detailed introduction to evaluation benchmarks below.
The DAVIS dataset is composed of 50 high-definition video sequences, 30 in the training set and the remaining 20 in the validation set. There are totally 3, 455 densely annotated, pixel-accurate frames. The videos contain challenges such as occlusions, motion blur, and appearance changes. Only the primary moving objects are annotated in the ground truth.
The Freiburg-Berkeley motion segmentation dataset is composed of 59 video sequences with 720 frames annotated. In contrast to DAVIS, it has multiple moving objects in several videos with instance-level annotations. We do not train on any of these sequences and evaluate using mIoU and F-score respectively. We also convert the instance-level annotations to binary ones by merging all foreground annotations, as in .
The SegTrack-v2 dataset contains 14 videos with a total of 1, 066 frames with pixel-level annotations. For videos with multiple objects with individual ground-truth segmentations, each object can be segmented in turn, treating each as a problem of segmenting that object from the background.
4.2 Implementation details
We jointly use optical flow and semantic instance segmentation to group foreground objects that move together into a single object. We use the optical flow from a re-implementation of Coarse2Fine optical flow 
. We implemented the objectness network using Tensorflow library and set wider ResNet  with 38 hidden layers as the backbone. The segmentation network is simple without using upsampling, skip connections or multi-scale structures. In some convolution layers, increasing the dilation rates and removing the down-sampling operations accordingly are applied to generate score maps at 1/8 resolution. Large field-of-view setting in Deeplabv2 
is used to replace the top linear classifier and global pooling layer which exist in the classification network. Besides, the batch normalization layers are freezed during finetuning.
We adopted the initial network weights provided by the repository which were pre-trained on the ImageNet and COCO dataset. We further finetune the objectness network based on the augmented PASCAL VOC ground truth from  with a total of 12, 051 training images. Note that we force all the foreground objects in a certain image to one single foreground object and keep background the same.
For the DAVIS dataset evaluation, we further train the network on DAVIS training set and then apply a one-shot finetuning on the first frame with “pseudo ground truth”. The segmentation network is trained on the first frame image/“pseudo ground truth” pair, by Adam with learning rate . We set the number of finetuning n on the first frame as 100, we found that a relative small n can improve the accuracy which is opposite with semi-supervised VOS. For the online part, we used the default parameters in OnAVOS  by setting the number of finetuning as 15, finetuning interval as 5 frames, and learning rate as and adopted the CRF parameters from DeepLab . For completeness, we also conduct experiments on FBMS and SegTrack-v2 datasets, we conduct the same procedures for FBMS as DAVIS. To check the effectiveness of the “pseudo ground truth” we only perform one-shot branch for SegTrack-v2 without online adaption.
4.3 Comparison with state-of-the-art methods
We compare our proposed approach with state-of-the-art unsupervised techniques, NLC , LMP , FSEG , LVO , and ARP  in Table 1. We achieve the best performance for unsupervised video object segmentation: 3.1% higher than the second best ARP. Besides, we achieve mIoU of 71.2% on the DAVIS validation set by extracting the pseudo ground-truth on each frame of a given video. When we break down the performance on each DAVIS sequence, we outperform the majority of the videos shown in Table 2, and especially for drift-straight, libby and scooter-black, our results are more than 10% higher than the second best results. Our approach could segment unknown object classes which do not need to be in the PASCAL/COCO vocabulary. The goat in the third row is an unseen category in the training data, the closest semantic category horse is matched instead. Note that our algorithm only needs the foreground mask without knowing the specific category, and performs better than state-of-the-art methods. Our method performs even better when the object classes are in the MS COCO, the top two rows show a single instance segmentation with large appearance changes (first row) and viewing angle and gesture changes (second row). The bottom two rows show that our algorithm works well when merging multiple object masks to one single mask with viewing angle changes (forth row) and messy background (fifth row).
To verify where the improvements come from, we utilize similar backbone with previous method. We test OSVOS  by replacing the first frame annotations with pseudo ground truths. OSVOS uses the VGG architecture, and we set the number of first-frame fine-tuning to 500 without applying boundary snapping. The mIoUs of our approach and the original OSVOS are 72.3% and 75.7%, respectively. Our approach in the VGG architecture still outperforms FSEG (70.7%) without online adaptation, CRF, test time data augmentation.
We further analyze the finetuning times on the first frames for both semi-supervised and unsupervised approaches in Table 3. In the table, the second column shows that the performance improves with the increasing finetuning times for semi-supervised approach in terms of mIoU, which indicates more finetuning times with image/ground truth pairs can predict better results. The right two columns show the different relationships between the performance in mIoU and finetuning times on the first frames for unsupervised approach. They both achieve the highest performance by setting the number of finetuning as 100, which indicates the model learns better with an appropriate number of finetuning since the pseudo ground truth is not as accurate as the ground truth.
|Finetuning times||Semi-supervised oneshot||Unsupervised oneshot||Unsupervised online|
We evaluate the proposed approach on the test set, with 30 sequences in total. The results are shown in Table 4
. Our method is outperformed in both evaluation metrics, with an F-score of 85.1% which is 7.3% higher than the second best method LVO, and the mIoU of 77.9% which is 18.1% better than ARP , which performs the second best on DAVIS. Figure 5 shows qualitative results of our method, our algorithm performs well for most of the sequences. The last row shows the failure case for rabbits04 since there are severe occlusions in this video and the rabbit is also an unseen category in the MS COCO. To recover a better prediciton mask, further motion information should be used to address this problem.
|NLC ||FST ||CVOS ||MP-Net-V ||LVO ||ARP ||Ours|
Our method achieves mIoU of 58.7% on this dataset, which is higher than other methods that do well on DAVIS, CUT  (47.8%), FST  (54.3%), and LVO  (57.3%). Note that we did not apply online adaptation on this dataset which could further improve the performance. Our method performs worse than NLC  (67.2%) due to low resolution of SegTrack-v2 and the fact that NLC is designed and evaluated on this dataset. We outperform NLC on both FBMS and DAVIS datasets by a large margin. Figure 6 shows qualitative results of the proposed method on the SegTrack-v2. All these visual results demonstrate the effectiveness of our approach where the category of the object is not existed in MS COCO  or PASCAL VOC 2012 . The accurate category is not needed in our approach, as long as the foreground object is consistent in the whole video. The objectness of the worm sequence in the third row cannot be detected using instance segmentation algorithm, in this case the thresholded flow magnitude is used as the pseudo ground truth mask instead.
4.4 Ablation studies
Table 5 presents our ablation study on DAVIS 2016 validation set on the three major components: online adaptation, CRF  and test time data augmentation. The baseline ours-oneshot in Table 5 is the wider-ResNet trained on the PASCAL VOC 2012 dataset and the DAVIS 2016 training set. Online adaptation provides 1.4% improvement over the baseline in terms of mIoU. Additional CRF post processing brings further 1.1% boost in terms of mIoU. Combining with test time data augmentation (TTDA) gives the best performance of 79.3% in mIoU which is 3.5% higher than the baseline without any post processing.
Figure 7 shows qualitative comparisons for oneshot and online approaches on the video sequences camel and car-roundabout. Our online approach outperforms our oneshot approach for the sequence car-roundabout in the second row, which is due to the right bottom pixels are considered as negative training examples from the previous frames. The additional round of fintuning is performed on the newly acquired data to remove the false positive masks. The first row shows the failure case for the two approaches, the two branches both wrongly predict the foreground mask when the moving camel is walking across the static camel. This example shows the weakness of the oneshot approaches by propagating thoughout the whole video without using motion information.
|Ours-oneshot||+Online||+Online +CRF||+Online +CRF +TTDA|
4.4.1 Error analysis.
To analyze the effect of the first frame tagging, we apply OSVOS to the entire DAVIS validation set using the pseudo ground truth and the ground truth of the first frame respectively, the mIoUs of the entire dataset and two difficult sequences are shown in Table 6. The mIoUs of the entire DAVIS validation set is 5.5% lower when using pseudo ground truth of the first frame. This demonstrates that more accurate mask prediction for the first frame can generate better segmentation masks for the remaining frames of the whole video, which shows the potential performance improvement when using more advanced tagging technique.
We also erode and dilate the pseudo ground truth by 5 pixels respectively and use the erosion and dilation masks as the new pseudo ground truths to apply OSVOS approach to the videos. The performances have largely degraded from 3.2% to 10.9% compared with those of the original pseudo ground truth. This demonstrates accurate tagging is the key component for our tagging and segmenting approach.
|Sequences||Erosion||Dilation||Pseudo Ground Truth||Ground Truth|
5 Conclusion and future work
In this paper, we present a simple yet intuitive approach for unsupervised video object segmentation. Specifically, instead of manually annotating the first frame like existing semi-supervised methods, we proposed to automatically generate the approximate annotation, pseudo ground truth, by jointly employing instance segmentation and optical flow. Experimental results on the DAVIS, FBMS and SegTrack-v2 demonstrate that our approach enables effective transfer from semi-supervised VOS to unsupervised VOS and improves the mask prediction performance by a large margin. Our error analysis shows that using better instance segmentation has a dramatic performance boost which shows great potential for further improvement. Our approach is able to extend from single object tracking to multiple arbitrary object tracking based on the category-agnostic ground truths or pseudo ground truths.
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. In: OSDI. vol. 16, pp. 265–283 (2016)
-  Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR 2017. IEEE (2017)
-  Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016)
Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: Segflow: Joint learning for video object segmentation and optical flow. In: 2017 IEEE International Conference on Computer Vision (ICCV). pp. 686–695. IEEE (2017)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. pp. 248–255. IEEE (2009)
-  Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88(2), 303–338 (2010)
-  Faktor, A., Irani, M.: Video segmentation by non-local consensus voting. In: BMVC. vol. 2, p. 8 (2014)
-  Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: European Conference on Computer Vision. pp. 297–312. Springer (2014)
-  He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Computer Vision (ICCV), 2017 IEEE International Conference on. pp. 2980–2988. IEEE (2017)
-  Huang, Q., Xia, C., Li, S., Wang, Y., Song, Y., Kuo, C.C.J.: Unsupervised clustering guided semantic segmentation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1489–1498. IEEE (2018)
-  Huang, Q., Xia, C., Wu, C., Li, S., Wang, Y., Song, Y., Kuo, C.C.J.: Semantic segmentation with reverse attention. arXiv preprint arXiv:1707.06426 (2017)
-  Jain, S.D., Xiong, B., Grauman, K.: Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. arXiv preprint arXiv:1701.05384 2(3), 6 (2017)
-  Keuper, M., Andres, B., Brox, T.: Motion trajectory segmentation via minimum cost multicuts. In: Computer Vision (ICCV), 2015 IEEE International Conference on. pp. 3271–3279. IEEE (2015)
-  Khoreva, A., Benenson, R., Ilg, E., Brox, T., Schiele, B.: Lucid data dreaming for object tracking. arXiv preprint arXiv:1703.09554 (2017)
-  Koh, Y.J., Kim, C.S.: Primary object segmentation in videos based on region augmentation and reduction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. vol. 1, p. 6 (2017)
-  Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: Computer Vision (ICCV), 2013 IEEE International Conference on. pp. 2192–2199. IEEE (2013)
-  Li, S., Seybold, B., Vorobyov, A., Fathi, A., Huang, Q., Kuo, C.C.J.: Instance embedding transfer to unsupervised video object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6526–6535 (2018)
-  Li, S., Seybold, B., Vorobyov, A., Lei, X., Jay Kuo, C.C.: Unsupervised video object segmentation with motion-based bilateral networks. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 207–223 (2018)
-  Li, X., Qi, Y., Wang, Z., Chen, K., Liu, Z., Shi, J., Luo, P., Tang, X., Loy, C.C.: Video object segmentation with re-identification. arXiv preprint arXiv:1708.00197 (2017)
-  Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)
-  Liu, C., et al.: Beyond pixels: exploring new representations and applications for motion analysis. Ph.D. thesis, Massachusetts Institute of Technology (2009)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)
-  Märki, N., Perazzi, F., Wang, O., Sorkine-Hornung, A.: Bilateral space video segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 743–751 (2016)
-  Ochs, P., Malik, J., Brox, T.: Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence 36(6), 1187–1200 (2014)
-  Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. In: Computer Vision (ICCV), 2013 IEEE International Conference on. pp. 1777–1784. IEEE (2013)
-  Pathak, D., Girshick, R., Dollár, P., Darrell, T., Hariharan, B.: Learning features by watching objects move. In: Proc. CVPR. vol. 2 (2017)
-  Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Computer Vision and Pattern Recognition (2016)
-  Perazzi, F., Khoreva, A., Benenson, R., Schiele, B., Sorkine-Hornung, A.: Learning video object segmentation from static images. In: Computer Vision and Pattern Recognition (2017)
-  Port, R.F., Van Gelder, T.: Mind as motion: Explorations in the dynamics of cognition. MIT press (1995)
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. pp. 91–99 (2015)
-  Taylor, B., Karasev, V., Soatto, S.: Causal video object segmentation from persistence of occlusions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4268–4276 (2015)
-  Tokmakov, P., Alahari, K., Schmid, C.: Learning motion patterns in videos. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 531–539. IEEE (2017)
-  Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. arXiv preprint arXiv:1704.05737 (2017)
-  Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3899–3908 (2016)
-  Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364 (2017)
-  Wu, Z., Shen, C., Hengel, A.v.d.: Wider or deeper: Revisiting the resnet model for visual recognition. arXiv preprint arXiv:1611.10080 (2016)