1 Introduction
Discovering objects in videos, as they move and change appearance over space and time, is one of the most challenging, and still unsolved problems in computer vision. This impacts the way we learn about objects and how we process large amounts of video data. One of our core scientific goals is to understand how much we could learn automatically from a video about the main objects of interest. We are interested in how objects properties relate in both space and time and how we could exploit these consistencies in order to discover the objects in a fast and accurately manner.
We formulate and validate two main assumptions: 1) pixels that belong to the same object are highly likely to be connected through long range optical flow chains, as we will describe in detail later. The main desirable property of optical flow is its ability to track the same points from one frame to the next. 2) pixels belonging to the same object are also likely to have similar motion and distinctive appearance patterns in spacetime. What moves together, belongs together. What looks alike is also likely to be related. While these ideas are not new, we use them to define a novel spacetime graph structure with motion and appearance constraints, which will enable us to solve the problem efficiently.
We define a node for each pixel in the video, which should normally make the problem intractable. Despite this challenge, we propose a fast optimization algorithm, able to find the object of interest as a strong cluster of pixels in the spacetime graph. While the nodes are pixels in the video, long range edges are established through connecting optical flow chains. While optical flow ultimately defines the structure of the graph, it is the features along the motion chains that ensure that those having similar motion and appearance have similar labels. These features could be simple (e.g. color and motion cues) or learned, which makes our formulation suitable for either the unsupervised or supervised cases. Depending on the specific features and cues used, we could discover in principle any object in the video by simply using cues that favor one object vs. others. While our formulation uses a graph representation, the graph itself is only used implicitly. The optimization algorithm, which provides the solution, is based on a fast fixedpoint iteration process that does not access the full graph explicitly. That makes our approach efficient in practice, suitable for realworld applications.
Main contribution: While there are many papers on object discovery in video, which use different local features and optical flow, we are the first to introduce a graph structure in space and time at the pixel level, that couples long range motion and appearance in a single data clustering formulation, with a fast optimization algorithm. While directly working with such a graph is not feasible in practice, our algorithm does not access the full huge graph explicitly, but only through a sequence of efficient fixed point iterations that ensure both improvement in objective score and convergence.
Related work: Video object segmentation is increasingly popular in the literature, with a relatively wide range of methods being proposed. The different specific tasks related to this problem differ in the amount of supervision, the precision of the segmentation (bounding box vs precise segmentation) and number of target objects.
One of the most important aspect that differentiates between different approaches and tasks is the amount of supervision used, which could vary from complete absence of any human supervised cues, to using features pretrained in a supervised manner. In some scenarios, methods could even have access to ground truth object masks provided in one or multiple frames of the test video sequence. For example, Perazzi [27] define two main tasks: unsupervised video object segmentation and semisupervised video object segmentation, where the amount of supervision is w.r.t to the test video sequence such that the unsupervised task has no object mask given for that test video sequence, while the semisupervised task has access to the ground truth mask in the first frame of the sequence. The state of the art methods on these tasks all use some kind of supervision, usually making extensive use of features that are pretrained for the task of video object segmentation or for strongly related tasks [21, 22, 35, 1, 37, 5, 3, 26, 4, 30, 32, 11].
The completely unsupervised solutions for the task of video object segmentation are based on lowlevel features, general observations and common sense insights about objects, changes over time and the nature of the video sequence [15, 25, 13, 8, 9].
Another important aspect regarding video object segmentation and discovery is that of computation time, with many approaches requiring more than 2 seconds per frame [25, 15]. Our approach can accommodate both supervised and completely unsupervised cases, depending on the features used to define our objective score. We demonstrate the capabilities of our algorithm in both cases. We achieve top results over unsupervised methods, when functioning standalone with no supervised features used. We are also able to improve over the state of the art when including features from other supervised approaches. Regarding speed, for reaching convergence, our method requires less than 1 sec/frame, and can be further optimized.
There are several notable works that use optical flow in the video object segmentation literature. For example, Brox and Malik introduce in [2]
a motion clustering method that simultaneous estimates the optical flow and the object segmentation. That is also similar to the approach of Tsai
[34]. Zhuo [38] build salient motion masks which are further combined with objectness masks in order to generate the final segmentation. Li [19] and Wang [36] introduce approaches that are based on averaging saliency masks over connections defined by optical flow. Different from ours, most approaches using flow connections operate at superpixel level, avoiding direct connections between pixels.2 Our approach
Given a sequence of consecutive video frames, our aim is to extract a set of softsegmentation masks, one per frame, containing the main object of interest. We represent the entire video as a graph with one node per pixel and a structure defined by optical flow chains, as shown in Sec. 2.1. We formulate segmentation as a clustering problem in Sec. 2.2, for which we find an analytical solution in Sec. 2.3. Then we introduce the algorithm in Sec. 3, where we also discuss its properties and implementation details.
2.1 Spacetime graph
Graph of pixels in spacetime: in the spacetime graph , each node is associated to a pixel in one of the video frames. has nodes, where , where is the frame size and the number of frames.
Optical flow chains: given optical flows between pairs of consecutive frames, both forward and backward, we form optical flow chains by following the flow (in the same direction) starting from a given pixel in a frame, all the way to the end of the video. Thus, through a pixel could pass multiple chains, at least one moving forward and one moving backward. A chain in a given direction could start at that pixel, if there is no incoming flow in that direction or it could pass through that pixel and start at a different previous frame w.r.t to that particular direction. Note that, for a given direction, a pixel could have none or several incoming flows, whereas it will always have only one outgoing flow chain. These flow chains are important in our graph as they define its edges. Thus, there is an undirected edge between two nodes if they are connected by an optical flow chain in one direction or both. Note that based on our definition above, there could be maximum two different optical flow chains between two nodes, one per direction.
Adjacency matrix: we introduce adjacency matrix , defined as , where is a Gaussian kernel as function of the temporal distance between nodes and , while if there is an edge between and and zero otherwise. Thus, if and are connected and zero otherwise. According to definition, is also symmetric, semipositive definite and has nonnegative elements and expected to be very sparse. is a Mercer kernel, since the pairwise terms satisfy Mercer’s condition. In Figure 1, we introduce a visual representation of how edges in the spacetime graph are formed through long range optical flow chains.
Nodes labels and their features: besides the graph structure, completely described by pairwise functions between nodes, in , each node is also described by unary, nodelevel feature vectors , computed along the two outgoing chains starting at , one per direction (Figure 1.b). We stack all features into a feature matrix . In practice we can consider different features, pretrained or not, but we refer the reader to Sec. 3 and Sec. 3.1 for details.
Each node has a (soft) segmentation label
, which, at any moment in time represents our belief that the node is part of the object of interest. Thus we can represent a solution to the segmentation problem, over the whole video, as a vector of labels
, with a label for each pixel . The second assumption we make in this paper is that nodes with similar features should have similar labels. From a mathematical point of view, we want to be able to regress the labels on the features  this says that the features associated with a node should suffice for predicting its label. If the regression is possible with sufficiently small error, then the assumption that pixels with similar features have similar labels is automatically satisfied.Now, we are prepared to formulate the segmentation problem mathematically. On one hand we want to find a strong cluster in , as defined by , on the other we want to be able to regress on the node features . In the next section we show these factors interact and define object segmentation in video as an optimization problem.
2.2 Problem formulation
Nodes belonging to the main object of interest should form a strong cluster in the spacetime graph, such that they are strongly connected through long range flow chains and their features are able to predict their labels . Vector represents the segmentation labels of individual nodes and also defines the segmentation cluster. Nodes with label 1 are part of this cluster, those with label zero are not. We define the intracluster score to be [17], which can be written in the matrix form as:
(1) 
We relax the condition on , allowing continuous values for the labels in . For the purpose of video object segmentation, we only care about the labels’ relative values, so for stability of convergence, we impose the L2norm of vector to be 1. The pairwise links are stronger when nodes and are linked through flow chains and close to each other, therefore we want to maximize the clustering score. Under the constraint , the score
is maximized by the leading eigenvector of
, which must have nonnegative values by PerronFrobenius theorem, since the matrix has nonnegative elements. Finding the main cluster by inspecting the main eigenvector of the adjacency matrix is a classic case of spectral clustering [24] and also related to spectral approaches in graph matching [17]. However, in our case alone is not sufficient for our problem, since it is defined by simple connections between nodes with no informations regarding their appearance or higher level features to better capture their similarity.As mentioned previously, we impose the constraint that nodes having similar features should have similar labels. We require that should be predicted from the features , through a linear mapping: , for some . Thus, besides the problem of maximizing the clustering score , we also aim to minimize an error term , which enforces a featurelabel consistency such that labels could be predicted well from features. After including a regularization term which should be minimized, we obtain the final objective score for segmentation:
(2) 
Our goal is to maximize this objective subject to the constraint , resulting in our optimization problem:
(3) 
2.3 Finding the optimal segmentation
The optimization problem defined in Eq. 3 requires that we find a maxima of the function , subject to an equality constraint . In order to solve this problem, we introduce the Lagrange multiplier and define the Lagrange function:
(4) 
The stationary points of our problem (Eq. 3) satisfy . In the discussion section we show that the stationary point is also a global optimum  the principal eigenvector of a specific matrix (Eq. 8), which we do not use, since it is expensive to compute in practice. Next, we obtain the following system of equations:
(5) 
From
we arrive at the closedform solution for ridge regression, with optimum
, as a function of . All we have to do now is compute the optimum , for which we take a fixedpoint iteration approach, which (as discussed in more detail later) should converge to a solution of the equation .We rewrite and in the form such that any fixed point of will be a solution for our initial equation . Thus, we apply a fixed point iteration scheme and iteratively update the value of as a function of its previous value. We immediately obtain , where . The term cancels out and we end up with the following power iteration scheme that optimizes the segmentation objective (Eq. 2) under L2 constraint :
(6) 
where and are the values of , respective at iteration . We have reached a set of compact segmentation updates at each iteration step, which efficiently combines the clustering score and the regression loss, while imposing a constraint on the norm of . Note that the actual norm is not important. Values in are always nonnegative and only their relative values matter. They can be easily scaled and shifted to range between 0 and 1 (without changing the direction of vector ), which we actually do in practice.
3 Algorithm
In practice we need to estimate the free parameter , in order to balance in an optimum way the graph term , which depends on node pairs, with the regression term , which depends on features at individual nodes. To keep the algorithm as simple and efficient as possible, we drop completely and reformulate the iterative process in terms of three separate operations: a propagation step, a regression step and a projection step.
The propagation step is equivalent to the multiplication , which can be written for a node as . The equation can be implemented efficiently, for all nodes, by propagating the soft labels , weighted by the pairwise terms , to all the nodes from the other frames to which is connected in the graph according to forward and backward optical flow chains. Thus, starting from a given node we move along the flow chains, one in each direction, and cast node ’s votes, , to all points met along the chain: . We also increase the value at node by the same amount: . By doing so for all pixels in video, in both directions, we perform, in fact, one iteration of . Since decreases rapidly towards zero with the temporal distance between and along a chain, in practice we cast votes only between frames that are within a radius of time steps. That greatly speeds up the process. Thus, the complexity of propagation is reduced from to , where is the number of frames, the number of pixels per frame and a relatively small constant ( in our tests).
The regression step estimates for which best approximates in the least squares error sense. Note that this step is equivalent to ridge regression, where the target values are unsupervised, given by the current solution that we want to solve for.
The projection step: once we compute the optimal at the current iteration, we can reset the values in to be equal to their predicted values . Thus, if the propagation step is a power iteration that pulls the solution towards the main eigenvector of , the regression and projection steps take the solution closer to the space in which labels can be predicted from actual node features.
Algorithm: the final GOVOS algorithm (Alg. 1) is a slightly simplified version of Eq. 6 and brings together, in sequence, the three steps discussed above: propagation, regression and projection:
Discussion: in practice it is simpler and also more accurate to compute per frame, by using features of nodes from that frame only, such that we get a different for each frame. This brings a richer representation power, which explains the superior accuracy.
Initialization: in the iterative process defined in Alg. 1 we need to establish the initial labels associated to the graph nodes. Our algorithm is robust and flexible to different initializations, as we show in Sec. 3.1. We could start from softsegmentation masks generated by other video object segmentation solutions, such as [15, 30, 9], or from noninformative masks, e.g. a central Gaussian mask or even a completely random initialization.
The adjacency matrix is constructed considering the optical flow provided by the solution of [10], pretrained on synthetic data: FlyingThings3D datasets [23] and FlyingChairs dataset [7]. Thus, the optical flow solution is in essence unsupervised, as no human annotations are required. Other powerful optical flow methods that do not require human supervision are also available [29]. Thus, creating is safely done in an unsupervised manner.
The supervised case: Our algorithm can have different degrees of supervision, depending on our choice of features matrix . In our experiments we tested both scenarios, when contained only simple, untrained features and when it included supervised features, such as it is the output of other available VOS methods (Sec. 3.1). It is a convenient property of our approach, as it can very easily incorporate different kinds of features (by simply including them in ) and adapt to different levels of supervision.
The unsupervised case: the simple motion and appearance features we experiment with, in the fully unsupervised case, are patterns of colors and motions along the flow chains centered at a given node . Motions can be computed as optical flow displacements along the chain around a node, starting in both directions. Thus, points that move similarly (regardless of absolute location) will have similar motion features. In terms of appearance features, we collect pixel colors along the flow chain around the same pixel . The motion and appearance vectors, for a given pixel , are then concatenated into a single descriptor vector (Figure 1).
3.1 Algorithm analysis
Further we perform a more in depth analysis of our algorithm and its performance, while addressing different aspects, such as initialization, convergence and the actual features used. Tests are performed on the validation set of DAVIS dataset [27] and we adopt their metrics (Sec. 4.1).
Convergence to global optimum: next we show that our algorithm should always converge to the same solution, regardless of the initialization  the solution depends only on the graph structure and the feature matrix . More precisely, we show that the stationary point of our optimization problem, as defined by Eq. 5, is in fact a global optimum, namely the principal eigenvector of a specific matrix, which we construct below. This implies that the final solution should not depend on the initialization. We observed this behaviour in practice every time, which validates the theoretical conclusion.
In Eq. 6, if we write in terms of and replace it in all equations, we can then write as a function of , and :
(7) 
where matrix is defined as:
(8) 
Matrix is symmetric therefore the power iteration in Eq. 7 will converge to its principal eigenvector. Therefore, at least in theory, the stationary point is the global optimum, the principal eigenvector of matrix . We do not implement this method in practice and choose the simpler and effective Algorithm 1. It would be practically impossible to explicitly compute (which is dense and of size , with the number of pixels).
Motion structure vs. Feature projection: Algorithm 1 could also be written as a power iteration method, with a slightly different . Therefore the actual Algorithm 1 is also guaranteed to converge to the leading eigenvector of , which can be factored as the product , where can be seen as a motion structure matrix and is the feature projection matrix. Thus, at the point of convergence, the segmentation reaches an equilibirum between its motion structure in spacetime and its consistency with the features.
Convergence in practice: the role of initialization. Our experiments verify the theoretical result above. We observed that the method approaches the same point of convergence, regardless of the initialization. We considered different choices for the initialization
, ranging from uninformative masks such as isotropic Gaussian softmask placed in the center of each frame with varied standard deviations, randomly initialized mask or a uniform full white mask, to masks given by state of the art methods, such as ELM
[15] and PDB [30]. In Figure 3 we present an example result, showing the evolution of the softsegmentation masks over three iterations of our algorithm, when we start from a random mask. We observe that the main object of interest emerges from this initial random mask, as its softsegmentation mask is visibly improved after each iteration. In Figure 2 we present more examples regarding the evolution of our softsegmentation masks over several iterations.In Figure. 4
we present quantitative results which confirm the theoretical insights. The performance evolves in terms of Jaccard index  J Mean, over seven iterations of our algorithm, towards the same common segmentation. Note, as expected, that convergence is faster for methods that start closer to the convergence point.
Convergence in practice: the role of features. We further study experimentally the influence of features, which are expected to have a strong impact on the final result. Our tests clearly show that by adding strong, informative features the performance is boosted significantly. For these experiments we considered two different starting points : softsegmentation masks provided by the method of Song [30] (PDB) and noninformative Gaussian softsegmentation masks. Regarding the features used in , we also have two options: to use only motion vectors along the flow chains (as explained previously) or to combine (concatenate) the motion patterns with softsegmentation values along the same path, as they are provided by [30]. The first choice is completely unsupervised, while the second is supervised, as method of Song is based on pretrained features with supervised signal. For simplicity, we will refer to the two solutions as the unsupervised solution (motionsonly features) and the supervised solution (softsegmentation masks of [30] are considered as features, along with the motion features). We present results in Figure. 5. Note that regardless of the starting point, the method converges towards the global optimum, which is strictly dictated by the set of considered features.
To conclude, if we initialize with a poor solution, we expect to improve over iterations. However, if the initialization point is better than the point of convergence (in terms of the segmentation evaluation metric), then the solution will degrade over iterations. The main point is that the only way to bring the solution up, is to improve the features in
or the optical flow that defines the graph structure in (through optical flow chains). That is because the solution is unique and depends only on and .4 Experiments
We compare our proposed approach, GOVOS, against state of the art solutions for video object segmentation, on three challenging datasets DAVIS [27], SegTrack v2 [18] and YouTubeObjects [12]. In all the experiments we have initialized the softsegmentation masks with noninformative Gaussian softmasks placed in the center of each frame. Unless otherwise specified, we have used only unsupervised features: motion and color cues. We present some qualitative results in Figure 6, in comparison to other methods on the DAVIS2016 dataset, which we present next.
4.1 DAVIS dataset
Perazzi introduce in [27] a new benchmark dataset and evaluation methodology for video object segmentation tasks. The original dataset is composed of 50 videos (30 train + 20 test), each accompanied by accurate, per pixel annotations of the foreground region. The foreground may be composed of multiple connected objects (e.g. the bicyclist and its bicycle), but for the 2016 version of the dataset, those connected objects are considered as a single object and this is the version we test on. DAVIS is a challenging dataset as it contains many difficult cases such as appearance changes, occlusions and motion blur. Metric: For evaluation we compute both regionbased (J Mean) and contourbased (F Mean) measures as established in [27]. J Mean is computed as the intersection over union between the estimated segmentation and the ground truth. F Mean is the Fmeasure of the segmentation contour points (for details see [27]). Results: In Table 1 we compare our method (GOVOS) against both supervised and unsupervised methods on the task of unsupervised single object segmentation on DAVIS validation set. For the supervised formulation we consider the softsegmentation masks provided by various state of the art solutions as additional features concatenated in feature matrix (Sec. 3.1). We highlight that our fully unsupervised GOVOS achieves state of the art results among fully unsupervised methods and also improves over the ones that use supervised pretrained features.
Task  Method  J Mean  F Mean  sec/frame  

Unsupervised  Supervised features  PDB[30]  77.2  74.5  0.05 
ARP[14]  76.2  70.6  N/A  
LVO[32]  75.9  72.1  N/A  
FSEG[11]  70.7  65.3  N/A  
LMP[31]  70.0  65.9  N/A  
blueGOVOS supervised + features of [30]  79.9 (green+2.7)  78.1  0.61  
blueGOVOS supervised + features of [14]  78.7 (green+2.5)  73.1  0.61  
blueGOVOS supervised + features of [32]  77.0 (green+1.1)  73.7  0.61  
blueGOVOS supervised + features of [11]  74.1 (green+3.5)  69.9  0.61  
blueGOVOS supervised + features of [31]  73.7 (green+3.7)  69.2  0.61  
Unsupervised  ELM[15]  61.8  blue61.2  20  
FST[25]  55.8  51.1  4  
CUT[13]  55.2  55.2  1.7  
NLC[8]  55.1  52.3  12  
blueGOVOS unsupervised  blue65.0  61.1  blue0.91 
4.2 SegTrack v2 dataset
The SegTrack dataset was originally introduced in [33] and further adapted for the task of video object segmentation in [18]. Last version of the dataset (v2) contains 14 videos with pixel level annotations for the main objects of interest (8 videos with one primary object and 6 videos with multiple objects). In contrast to the multiple object videos of DAVIS dataset, the multiple objects videos include objects that are separated from each other. SegTrack contains deformable and dynamic objects, with video at relative poor resolution  making it a very challenging dataset for video object segmentation. Metric: For evaluation we used the average intersection over union score. Results: In Table 2 we present quantitative results of our method, in the fully unsupervised case, and compare to published methods that test on all 14 videos in SegTrack v2. Our solution achieves the top score and is the second fastest. We did not include in comparison the solution of [8], as although it is designed and tested on SegTrack, they have tested only on 12 videos.
Task  Method  IoU  sec/frame  
Unsupervised  Supervised features  KEY [16]  57.3  120 
FSEG [11]  61.4  N/A  
LVO [32]  57.3  N/A  
[20]  59.3  N/A  
Unsupervised  FST [25]  54.3  4  
CUT [13]  47.8  1.7  
HPP [9]  50.1  blue 0.35  
blueGOVOS unsupervised  blue62.2  0.91 
4.3 YouTubeObjects dataset
YouTubeObjects (YTO) dataset [28] consists of videos collected from YouTube (720000 frames). It is also very challenging, and contains thousands of video shots (2511). However, on YTO ground truth annotations are only bounding boxes. Although we do not test against pixel level annotations, the tests are relevant on YTO considering the large number of videos and wide diversity. In the paper we show results on the latest version of the dataset (v2.2), which has more annotated boxes (6975), but we also provide results on v1.0 in supplementary material. Following the methodology of published works on YTO, we test our solution on the training set, which contains videos with only one annotated object. Metric: We used the CorLoc metric, computing the percentage of correctly localized object bounding boxes. A box is considered to be correct based on the PASCALcriterion (IoU ). Results: In Table 3 we present the results on YTO v2.2 and compare against the published state of the art. All methods are fully unsupervised. We obtain the top average score, while outperforming the other methods on 5 out of 10 object classes.
Method  aero  bird  boat  car  cat  cow  dog  horse  moto  train  avg  sec/frame 
[6]  75.7  56.0  52.7  57.3  46.9  blue57.0  48.9  44.0  27.2  56.2  52.2  blue0.02 
HPP[9]  76.3  68.5  blue54.5  50.4  blue59.8  42.4  53.5  30.0  blue53.5  blue60.7  54.9  0.35 
blueGOVOS blueunsupervised  blue79.8  blue73.5  38.9  blue69.6  54.9  53.6  blue56.6  blue45.6  52.2  56.2  blue58.1  0.91 
4.4 Computation cost
Considering our problem formulation defined over a dense graph, with one node per every pixel in a video and with long range connections given by optical flow chains, one would expect the method to be memory expensive and slow. However, since we never construct the adjacency matrix and break the optimization into three steps which, one could show, are all in the number of pixels in the video, our algorithm is actually fast (less than 1 sec per frame, total computation time)  quicker than most published method on this task as shown in Tables 1, 2 and 3. In more detail: we require sec/frame for computing the optical flow and sec/frame for computing information related to matrices and . The optimization steps, per iteration total sec/frame, and we let it run for 7 iterations, resulting in a total of
sec/frame. We implement it in PyTorch and the runtime analysis is performed on a computer with specifications: Intel(R) Xeon(R) CPU E52697A v4 @ 2.60GHz, GPU GeForce GTX 1080.
5 Conclusions
We presented an efficient solution for video object segmentation, defined as optimization in a the spacetime graph, with nodes at every pixel in the video and longrange links between nodes connected through optical flow chains. Our mathematical formulation enforces that the final segmentation forms a strong cluster defined by the motion structure of the object in video and it is also consistent with the features on nodes. The two forces, motion and features, are brought together into a single optimization problem which reaches, through our proposed algorithm, the global optimum in a few iterations. We show in extensive experiments on three challenging datasets, namely DAVIS2016, SegTrack v2 and YoutubeObjects v2.2, that our algorithm is relatively fast and accurate. GOVOS outperforms other unsupervised and supervised methods on this challenging datasets.
References

[1]
L. Bao, B. Wu, and W. Liu.
Cnn in mrf: Video object segmentation via inference in a cnnbased
higherorder spatiotemporal mrf.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 5977–5986, 2018.  [2] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In European conference on computer vision, pages 282–295. Springer, 2010.
 [3] S. Caelles, K.K. Maninis, J. PontTuset, L. LealTaixé, D. Cremers, and L. Van Gool. Oneshot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 221–230, 2017.
 [4] Y. Chen, J. PontTuset, A. Montes, and L. Van Gool. Blazingly fast video object segmentation with pixelwise metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1189–1198, 2018.
 [5] J. Cheng, Y.H. Tsai, W.C. Hung, S. Wang, and M.H. Yang. Fast and accurate online video object segmentation via tracking parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7415–7424, 2018.
 [6] I. Croitoru, S.V. Bogolin, and M. Leordeanu. Unsupervised learning from video to detect foreground objects in single images. In Proceedings of the IEEE International Conference on Computer Vision, pages 4335–4343, 2017.
 [7] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015.
 [8] A. Faktor and M. Irani. Video segmentation by nonlocal consensus voting. In BMVC, volume 2, page 8, 2014.

[9]
E. Haller and M. Leordeanu.
Unsupervised object segmentation in video by efficient selection of highly probable positive features.
In Proceedings of the IEEE International Conference on Computer Vision, pages 5085–5093, 2017.  [10] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2462–2470, 2017.
 [11] S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. arXiv preprint arXiv:1701.05384, 2(3):6, 2017.
 [12] V. Kalogeiton, V. Ferrari, and C. Schmid. Analysing domain shift factors between videos and images for object detection. IEEE transactions on pattern analysis and machine intelligence, 38(11):2327–2334, 2016.
 [13] M. Keuper, B. Andres, and T. Brox. Motion trajectory segmentation via minimum cost multicuts. In Proceedings of the IEEE International Conference on Computer Vision, pages 3271–3279, 2015.
 [14] Y. J. Koh and C.S. Kim. Primary object segmentation in videos based on region augmentation and reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 6, 2017.
 [15] D. Lao and G. Sundaramoorthi. Extending layered models to 3d motion. In Proceedings of the European Conference on Computer Vision (ECCV), pages 435–451, 2018.
 [16] Y. J. Lee, J. Kim, and K. Grauman. Keysegments for video object segmentation. In 2011 International conference on computer vision, pages 1995–2002. IEEE, 2011.
 [17] M. Leordeanu, R. Sukthankar, and M. Hebert. Unsupervised learning for graph matching. International journal of computer vision, 96(1):28–45, 2012.
 [18] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video segmentation by tracking many figureground segments. In Proceedings of the IEEE International Conference on Computer Vision, pages 2192–2199, 2013.
 [19] J. Li, A. Zheng, X. Chen, and B. Zhou. Primary video object segmentation via complementary cnns and neighborhood reversible flow. In Proceedings of the IEEE International Conference on Computer Vision, pages 1417–1425, 2017.
 [20] S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C.C. Jay Kuo. Instance embedding transfer to unsupervised video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6526–6535, 2018.
 [21] J. Luiten, P. Voigtlaender, and B. Leibe. Premvos: Proposalgeneration, refinement and merging for the davis challenge on video object segmentation 2018. In The 2018 DAVIS Challenge on Video Object SegmentationCVPR Workshops, 2018.
 [22] K.K. Maninis, S. Caelles, Y. Chen, J. PontTuset, L. LealTaixé, D. Cremers, and L. Van Gool. Video object segmentation without temporal information. arXiv preprint arXiv:1709.06031, 2017.
 [23] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016.
 [24] M. Meila and J. Shi. A random walks view of spectral segmentation. In AISTATS, 2001.
 [25] A. Papazoglou and V. Ferrari. Fast object segmentation in unconstrained video. In Proceedings of the IEEE International Conference on Computer Vision, pages 1777–1784, 2013.
 [26] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. SorkineHornung. Learning video object segmentation from static images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2663–2672, 2017.
 [27] F. Perazzi, J. PontTuset, B. McWilliams, L. Van Gool, M. Gross, and A. SorkineHornung. A benchmark dataset and evaluation methodology for video object segmentation. In Computer Vision and Pattern Recognition, 2016.
 [28] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3282–3289. IEEE, 2012.

[29]
J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid.
Epicflow: Edgepreserving interpolation of correspondences for optical flow.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1164–1172, 2015.  [30] H. Song, W. Wang, S. Zhao, J. Shen, and K.M. Lam. Pyramid dilated deeper convlstm for video salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 715–731, 2018.
 [31] P. Tokmakov, K. Alahari, and C. Schmid. Learning motion patterns in videos. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 531–539. IEEE, 2017.
 [32] P. Tokmakov, K. Alahari, and C. Schmid. Learning video object segmentation with visual memory. arXiv preprint arXiv:1704.05737, 2017.
 [33] D. Tsai, M. Flagg, A. Nakazawa, and J. M. Rehg. Motion coherent tracking using multilabel mrf optimization. International journal of computer vision, 100(2):190–202, 2012.
 [34] Y.H. Tsai, M.H. Yang, and M. J. Black. Video segmentation via object flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3899–3908, 2016.

[35]
P. Voigtlaender and B. Leibe.
Online adaptation of convolutional neural networks for the 2017 davis challenge on video object segmentation.
In The 2017 DAVIS Challenge on Video Object SegmentationCVPR Workshops, volume 5, 2017.  [36] W. Wang, J. Shen, and F. Porikli. Saliencyaware geodesic video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3395–3402, 2015.
 [37] S. Wug Oh, J.Y. Lee, K. Sunkavalli, and S. Joo Kim. Fast video object segmentation by referenceguided mask propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7376–7385, 2018.
 [38] T. Zhuo, Z. Cheng, P. Zhang, Y. Wong, and M. Kankanhalli. Unsupervised online video object segmentation with motion property understanding. arXiv preprint arXiv:1810.03783, 2018.
Comments
There are no comments yet.