Spacetime Graph Optimization for Video Object Segmentation

07/07/2019 ∙ by Emanuela Haller, et al. ∙ 4

In this paper we address the challenging task of object discovery and segmentation in video. We introduce an efficient method that can be applied in supervised and unsupervised scenarios, using a graph-based representation in both space and time. Our method exploits the consistency in appearance and motion patterns of pixels belonging to the same object. We formulate the task as a clustering problem: graph nodes at the pixel level that belong to the object of interest should form a strong cluster, linked through long range optical flow chains and with similar motion and appearance features along those chains. On one hand, the optimization problem aims to maximize the segmentation clustering score based on the structure of pixel motions through space and time. On the other, the segmentation should be consistent with the features at the level of nodes, s.t. these features should be able to predict the segmentation labels. The solution to our problem relates to spectral clustering as well as to the classical regression analysis. It leads to a fast algorithm that converges in a few iterations to a global optimum of the relaxed problem, using fixed point iteration. The proposed method, namely GO-VOS, is relatively fast and accurate. It can be used both as a standalone and completely unsupervised method or in combination with other segmentation methods. In experiments, we demonstrate top performance on several challenging datasets: DAVIS, SegTrack and YouTube-Objects.



There are no comments yet.


page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Discovering objects in videos, as they move and change appearance over space and time, is one of the most challenging, and still unsolved problems in computer vision. This impacts the way we learn about objects and how we process large amounts of video data. One of our core scientific goals is to understand how much we could learn automatically from a video about the main objects of interest. We are interested in how objects properties relate in both space and time and how we could exploit these consistencies in order to discover the objects in a fast and accurately manner.

We formulate and validate two main assumptions: 1) pixels that belong to the same object are highly likely to be connected through long range optical flow chains, as we will describe in detail later. The main desirable property of optical flow is its ability to track the same points from one frame to the next. 2) pixels belonging to the same object are also likely to have similar motion and distinctive appearance patterns in space-time. What moves together, belongs together. What looks alike is also likely to be related. While these ideas are not new, we use them to define a novel space-time graph structure with motion and appearance constraints, which will enable us to solve the problem efficiently.

We define a node for each pixel in the video, which should normally make the problem intractable. Despite this challenge, we propose a fast optimization algorithm, able to find the object of interest as a strong cluster of pixels in the space-time graph. While the nodes are pixels in the video, long range edges are established through connecting optical flow chains. While optical flow ultimately defines the structure of the graph, it is the features along the motion chains that ensure that those having similar motion and appearance have similar labels. These features could be simple (e.g. color and motion cues) or learned, which makes our formulation suitable for either the unsupervised or supervised cases. Depending on the specific features and cues used, we could discover in principle any object in the video by simply using cues that favor one object vs. others. While our formulation uses a graph representation, the graph itself is only used implicitly. The optimization algorithm, which provides the solution, is based on a fast fixed-point iteration process that does not access the full graph explicitly. That makes our approach efficient in practice, suitable for real-world applications.

Main contribution: While there are many papers on object discovery in video, which use different local features and optical flow, we are the first to introduce a graph structure in space and time at the pixel level, that couples long range motion and appearance in a single data clustering formulation, with a fast optimization algorithm. While directly working with such a graph is not feasible in practice, our algorithm does not access the full huge graph explicitly, but only through a sequence of efficient fixed point iterations that ensure both improvement in objective score and convergence.

Related work: Video object segmentation is increasingly popular in the literature, with a relatively wide range of methods being proposed. The different specific tasks related to this problem differ in the amount of supervision, the precision of the segmentation (bounding box vs precise segmentation) and number of target objects.

One of the most important aspect that differentiates between different approaches and tasks is the amount of supervision used, which could vary from complete absence of any human supervised cues, to using features pre-trained in a supervised manner. In some scenarios, methods could even have access to ground truth object masks provided in one or multiple frames of the test video sequence. For example, Perazzi [27] define two main tasks: unsupervised video object segmentation and semi-supervised video object segmentation, where the amount of supervision is w.r.t to the test video sequence such that the unsupervised task has no object mask given for that test video sequence, while the semi-supervised task has access to the ground truth mask in the first frame of the sequence. The state of the art methods on these tasks all use some kind of supervision, usually making extensive use of features that are pretrained for the task of video object segmentation or for strongly related tasks [21, 22, 35, 1, 37, 5, 3, 26, 4, 30, 32, 11].

The completely unsupervised solutions for the task of video object segmentation are based on low-level features, general observations and common sense insights about objects, changes over time and the nature of the video sequence [15, 25, 13, 8, 9].

Another important aspect regarding video object segmentation and discovery is that of computation time, with many approaches requiring more than 2 seconds per frame [25, 15]. Our approach can accommodate both supervised and completely unsupervised cases, depending on the features used to define our objective score. We demonstrate the capabilities of our algorithm in both cases. We achieve top results over unsupervised methods, when functioning standalone with no supervised features used. We are also able to improve over the state of the art when including features from other supervised approaches. Regarding speed, for reaching convergence, our method requires less than 1 sec/frame, and can be further optimized.

There are several notable works that use optical flow in the video object segmentation literature. For example, Brox and Malik introduce in [2]

a motion clustering method that simultaneous estimates the optical flow and the object segmentation. That is also similar to the approach of Tsai

[34]. Zhuo [38] build salient motion masks which are further combined with objectness masks in order to generate the final segmentation. Li [19] and Wang [36] introduce approaches that are based on averaging saliency masks over connections defined by optical flow. Different from ours, most approaches using flow connections operate at superpixel level, avoiding direct connections between pixels.

Figure 1: Visual representation of how the space-time graph is formed. a) and are pixels in different frames, which become nodes in the graph. A flow chain connects points that are linked by optical flow in the same direction (forward or backward) along a path of consecutive frames. Through a given node there are always two chains going out, one in each direction. However, there might be several chains or none coming in from both directions (e.g. node ). So, (out)degree of a node , (in)degree . Thus, there could be multiple flow chains connecting any two nodes (pixels) in the graph. The flow chains define a graph structure, in which pixels are connected if there is at least one flow chain between them. One of the assumptions we make is that the stronger a cluster, the more likely it belongs to a single object. Thus, pixels that are strongly connected to other foreground pixels, are also likely to be foreground themselves. b)

Flow chains are also used to describe node features, by providing patterns of motion (vectors of optical flow displacements) and patterns of appearance and other pre-computed features along the chains. For a given node

, the features forming are collected along the flow chains.

2 Our approach

Given a sequence of consecutive video frames, our aim is to extract a set of soft-segmentation masks, one per frame, containing the main object of interest. We represent the entire video as a graph with one node per pixel and a structure defined by optical flow chains, as shown in Sec. 2.1. We formulate segmentation as a clustering problem in Sec. 2.2, for which we find an analytical solution in Sec. 2.3. Then we introduce the algorithm in Sec. 3, where we also discuss its properties and implementation details.

2.1 Space-time graph

Graph of pixels in space-time: in the space-time graph , each node is associated to a pixel in one of the video frames. has nodes, where , where is the frame size and the number of frames.

Optical flow chains: given optical flows between pairs of consecutive frames, both forward and backward, we form optical flow chains by following the flow (in the same direction) starting from a given pixel in a frame, all the way to the end of the video. Thus, through a pixel could pass multiple chains, at least one moving forward and one moving backward. A chain in a given direction could start at that pixel, if there is no incoming flow in that direction or it could pass through that pixel and start at a different previous frame w.r.t to that particular direction. Note that, for a given direction, a pixel could have none or several incoming flows, whereas it will always have only one outgoing flow chain. These flow chains are important in our graph as they define its edges. Thus, there is an undirected edge between two nodes if they are connected by an optical flow chain in one direction or both. Note that based on our definition above, there could be maximum two different optical flow chains between two nodes, one per direction.

Adjacency matrix: we introduce adjacency matrix , defined as , where is a Gaussian kernel as function of the temporal distance between nodes and , while if there is an edge between and and zero otherwise. Thus, if and are connected and zero otherwise. According to definition, is also symmetric, semi-positive definite and has non-negative elements and expected to be very sparse. is a Mercer kernel, since the pairwise terms satisfy Mercer’s condition. In Figure 1, we introduce a visual representation of how edges in the space-time graph are formed through long range optical flow chains.

Nodes labels and their features: besides the graph structure, completely described by pairwise functions between nodes, in , each node is also described by unary, node-level feature vectors , computed along the two outgoing chains starting at , one per direction (Figure 1.b). We stack all features into a feature matrix . In practice we can consider different features, pretrained or not, but we refer the reader to Sec. 3 and Sec. 3.1 for details.

Each node has a (soft) segmentation label

, which, at any moment in time represents our belief that the node is part of the object of interest. Thus we can represent a solution to the segmentation problem, over the whole video, as a vector of labels

, with a label for each pixel . The second assumption we make in this paper is that nodes with similar features should have similar labels. From a mathematical point of view, we want to be able to regress the labels on the features - this says that the features associated with a node should suffice for predicting its label. If the regression is possible with sufficiently small error, then the assumption that pixels with similar features have similar labels is automatically satisfied.

Now, we are prepared to formulate the segmentation problem mathematically. On one hand we want to find a strong cluster in , as defined by , on the other we want to be able to regress on the node features . In the next section we show these factors interact and define object segmentation in video as an optimization problem.

2.2 Problem formulation

Nodes belonging to the main object of interest should form a strong cluster in the space-time graph, such that they are strongly connected through long range flow chains and their features are able to predict their labels . Vector represents the segmentation labels of individual nodes and also defines the segmentation cluster. Nodes with label 1 are part of this cluster, those with label zero are not. We define the intra-cluster score to be [17], which can be written in the matrix form as:


We relax the condition on , allowing continuous values for the labels in . For the purpose of video object segmentation, we only care about the labels’ relative values, so for stability of convergence, we impose the L2-norm of vector to be 1. The pairwise links are stronger when nodes and are linked through flow chains and close to each other, therefore we want to maximize the clustering score. Under the constraint , the score

is maximized by the leading eigenvector of

, which must have non-negative values by Perron-Frobenius theorem, since the matrix has non-negative elements. Finding the main cluster by inspecting the main eigenvector of the adjacency matrix is a classic case of spectral clustering [24] and also related to spectral approaches in graph matching [17]. However, in our case alone is not sufficient for our problem, since it is defined by simple connections between nodes with no informations regarding their appearance or higher level features to better capture their similarity.

As mentioned previously, we impose the constraint that nodes having similar features should have similar labels. We require that should be predicted from the features , through a linear mapping: , for some . Thus, besides the problem of maximizing the clustering score , we also aim to minimize an error term , which enforces a feature-label consistency such that labels could be predicted well from features. After including a regularization term which should be minimized, we obtain the final objective score for segmentation:


Our goal is to maximize this objective subject to the constraint , resulting in our optimization problem:


2.3 Finding the optimal segmentation

The optimization problem defined in Eq. 3 requires that we find a maxima of the function , subject to an equality constraint . In order to solve this problem, we introduce the Lagrange multiplier and define the Lagrange function:


The stationary points of our problem (Eq. 3) satisfy . In the discussion section we show that the stationary point is also a global optimum - the principal eigenvector of a specific matrix (Eq. 8), which we do not use, since it is expensive to compute in practice. Next, we obtain the following system of equations:



we arrive at the closed-form solution for ridge regression, with optimum

, as a function of . All we have to do now is compute the optimum , for which we take a fixed-point iteration approach, which (as discussed in more detail later) should converge to a solution of the equation .

We rewrite and in the form such that any fixed point of will be a solution for our initial equation . Thus, we apply a fixed point iteration scheme and iteratively update the value of as a function of its previous value. We immediately obtain , where . The term cancels out and we end up with the following power iteration scheme that optimizes the segmentation objective (Eq. 2) under L2 constraint :


where and are the values of , respective at iteration . We have reached a set of compact segmentation updates at each iteration step, which efficiently combines the clustering score and the regression loss, while imposing a constraint on the norm of . Note that the actual norm is not important. Values in are always non-negative and only their relative values matter. They can be easily scaled and shifted to range between 0 and 1 (without changing the direction of vector ), which we actually do in practice.

Figure 2: Qualitative evolution of the soft masks over 7 iterations of the proposed GO-VOS algorithm, in the fully unsupervised scenario: the segmentation is initialized (first row - ) with a non-informative Gaussian, while the features in are only motion directions along optical flow chains. Note how the object of interest (in example frames from DAVIS2016) is discovered in very few iterations, even though no supervised information or features are used.

3 Algorithm

In practice we need to estimate the free parameter , in order to balance in an optimum way the graph term , which depends on node pairs, with the regression term , which depends on features at individual nodes. To keep the algorithm as simple and efficient as possible, we drop completely and reformulate the iterative process in terms of three separate operations: a propagation step, a regression step and a projection step.

The propagation step is equivalent to the multiplication , which can be written for a node as . The equation can be implemented efficiently, for all nodes, by propagating the soft labels , weighted by the pairwise terms , to all the nodes from the other frames to which is connected in the graph according to forward and backward optical flow chains. Thus, starting from a given node we move along the flow chains, one in each direction, and cast node ’s votes, , to all points met along the chain: . We also increase the value at node by the same amount: . By doing so for all pixels in video, in both directions, we perform, in fact, one iteration of . Since decreases rapidly towards zero with the temporal distance between and along a chain, in practice we cast votes only between frames that are within a radius of time steps. That greatly speeds up the process. Thus, the complexity of propagation is reduced from to , where is the number of frames, the number of pixels per frame and a relatively small constant ( in our tests).

The regression step estimates for which best approximates in the least squares error sense. Note that this step is equivalent to ridge regression, where the target values are unsupervised, given by the current solution that we want to solve for.

The projection step: once we compute the optimal at the current iteration, we can reset the values in to be equal to their predicted values . Thus, if the propagation step is a power iteration that pulls the solution towards the main eigenvector of , the regression and projection steps take the solution closer to the space in which labels can be predicted from actual node features.

Algorithm: the final GO-VOS algorithm (Alg. 1) is a slightly simplified version of Eq. 6 and brings together, in sequence, the three steps discussed above: propagation, regression and projection:

Algorithm 1 blueGO-VOS

Discussion: in practice it is simpler and also more accurate to compute per frame, by using features of nodes from that frame only, such that we get a different for each frame. This brings a richer representation power, which explains the superior accuracy.

Initialization: in the iterative process defined in Alg. 1 we need to establish the initial labels associated to the graph nodes. Our algorithm is robust and flexible to different initializations, as we show in Sec. 3.1. We could start from soft-segmentation masks generated by other video object segmentation solutions, such as [15, 30, 9], or from non-informative masks, e.g. a central Gaussian mask or even a completely random initialization.

The adjacency matrix is constructed considering the optical flow provided by the solution of [10], pretrained on synthetic data: FlyingThings3D datasets [23] and FlyingChairs dataset [7]. Thus, the optical flow solution is in essence unsupervised, as no human annotations are required. Other powerful optical flow methods that do not require human supervision are also available [29]. Thus, creating is safely done in an unsupervised manner.

The supervised case: Our algorithm can have different degrees of supervision, depending on our choice of features matrix . In our experiments we tested both scenarios, when contained only simple, untrained features and when it included supervised features, such as it is the output of other available VOS methods (Sec. 3.1). It is a convenient property of our approach, as it can very easily incorporate different kinds of features (by simply including them in ) and adapt to different levels of supervision.

The unsupervised case: the simple motion and appearance features we experiment with, in the fully unsupervised case, are patterns of colors and motions along the flow chains centered at a given node . Motions can be computed as optical flow displacements along the chain around a node, starting in both directions. Thus, points that move similarly (regardless of absolute location) will have similar motion features. In terms of appearance features, we collect pixel colors along the flow chain around the same pixel . The motion and appearance vectors, for a given pixel , are then concatenated into a single descriptor vector (Figure  1).

3.1 Algorithm analysis

Further we perform a more in depth analysis of our algorithm and its performance, while addressing different aspects, such as initialization, convergence and the actual features used. Tests are performed on the validation set of DAVIS dataset [27] and we adopt their metrics (Sec. 4.1).

Convergence to global optimum: next we show that our algorithm should always converge to the same solution, regardless of the initialization - the solution depends only on the graph structure and the feature matrix . More precisely, we show that the stationary point of our optimization problem, as defined by Eq. 5, is in fact a global optimum, namely the principal eigenvector of a specific matrix, which we construct below. This implies that the final solution should not depend on the initialization. We observed this behaviour in practice every time, which validates the theoretical conclusion.

In Eq. 6, if we write in terms of and replace it in all equations, we can then write as a function of , and :


where matrix is defined as:


Matrix is symmetric therefore the power iteration in Eq. 7 will converge to its principal eigenvector. Therefore, at least in theory, the stationary point is the global optimum, the principal eigenvector of matrix . We do not implement this method in practice and choose the simpler and effective Algorithm 1. It would be practically impossible to explicitly compute (which is dense and of size , with the number of pixels).

Motion structure vs. Feature projection: Algorithm 1 could also be written as a power iteration method, with a slightly different . Therefore the actual Algorithm 1 is also guaranteed to converge to the leading eigenvector of , which can be factored as the product , where can be seen as a motion structure matrix and is the feature projection matrix. Thus, at the point of convergence, the segmentation reaches an equilibirum between its motion structure in spacetime and its consistency with the features.

Convergence in practice: the role of initialization. Our experiments verify the theoretical result above. We observed that the method approaches the same point of convergence, regardless of the initialization. We considered different choices for the initialization

, ranging from uninformative masks such as isotropic Gaussian soft-mask placed in the center of each frame with varied standard deviations, randomly initialized mask or a uniform full white mask, to masks given by state of the art methods, such as ELM 

[15] and PDB [30]. In Figure 3 we present an example result, showing the evolution of the soft-segmentation masks over three iterations of our algorithm, when we start from a random mask. We observe that the main object of interest emerges from this initial random mask, as its soft-segmentation mask is visibly improved after each iteration. In Figure 2 we present more examples regarding the evolution of our soft-segmentation masks over several iterations.

Figure 3: Qualitative results of our method, over three iterations, when initialized with a random soft-segmentation mask and using only unsupervised features (i.e. color an motion along flow chains). Note that the main object emerges with each iteration of our algorithm.

In Figure. 4

we present quantitative results which confirm the theoretical insights. The performance evolves in terms of Jaccard index - J Mean, over seven iterations of our algorithm, towards the same common segmentation. Note, as expected, that convergence is faster for methods that start closer to the convergence point.

Figure 4: Quantitative results of our method considering different initial , but using the same unsupervised features (color and motion along flow chains). Initializations are (examples given on the right): initial soft-segmentation central Gaussians, with diverse standard deviations, random initial soft mask, uniform white mask and two supervised state of the art solutions [30, 15]. Note that, regardless of the initialization, the final metric converges towards the same value, as predicted theoretically in Sec. 3.1 Tests are performed on full DAVIS validation set.

Convergence in practice: the role of features. We further study experimentally the influence of features, which are expected to have a strong impact on the final result. Our tests clearly show that by adding strong, informative features the performance is boosted significantly. For these experiments we considered two different starting points : soft-segmentation masks provided by the method of Song [30] (PDB) and non-informative Gaussian soft-segmentation masks. Regarding the features used in , we also have two options: to use only motion vectors along the flow chains (as explained previously) or to combine (concatenate) the motion patterns with soft-segmentation values along the same path, as they are provided by [30]. The first choice is completely unsupervised, while the second is supervised, as method of Song is based on pre-trained features with supervised signal. For simplicity, we will refer to the two solutions as the unsupervised solution (motions-only features) and the supervised solution (soft-segmentation masks of [30] are considered as features, along with the motion features). We present results in Figure. 5. Note that regardless of the starting point, the method converges towards the global optimum, which is strictly dictated by the set of considered features.

To conclude, if we initialize with a poor solution, we expect to improve over iterations. However, if the initialization point is better than the point of convergence (in terms of the segmentation evaluation metric), then the solution will degrade over iterations. The main point is that the only way to bring the solution up, is to improve the features in

or the optical flow that defines the graph structure in (through optical flow chains). That is because the solution is unique and depends only on and .

Figure 5: Performance evolution over seven iterations of our algorithm, when we consider non-informative or informative initialization and vary the set of features. Tests are performed on full DAVIS validation set.

4 Experiments

We compare our proposed approach, GO-VOS, against state of the art solutions for video object segmentation, on three challenging datasets DAVIS [27], SegTrack v2 [18] and YouTube-Objects [12]. In all the experiments we have initialized the soft-segmentation masks with non-informative Gaussian soft-masks placed in the center of each frame. Unless otherwise specified, we have used only unsupervised features: motion and color cues. We present some qualitative results in Figure 6, in comparison to other methods on the DAVIS2016 dataset, which we present next.

Figure 6: Qualitative results of the proposed GO-VOS algorithm, in the fully unsupervised case. We also present results of four other approaches: PDB [30], ARP[14], ELM [15] and FST [25].

4.1 DAVIS dataset

Perazzi introduce in [27] a new benchmark dataset and evaluation methodology for video object segmentation tasks. The original dataset is composed of 50 videos (30 train + 20 test), each accompanied by accurate, per pixel annotations of the foreground region. The foreground may be composed of multiple connected objects (e.g. the bicyclist and its bicycle), but for the 2016 version of the dataset, those connected objects are considered as a single object and this is the version we test on. DAVIS is a challenging dataset as it contains many difficult cases such as appearance changes, occlusions and motion blur. Metric: For evaluation we compute both region-based (J Mean) and contour-based (F Mean) measures as established in [27]. J Mean is computed as the intersection over union between the estimated segmentation and the ground truth. F Mean is the F-measure of the segmentation contour points (for details see [27]). Results: In Table 1 we compare our method (GO-VOS) against both supervised and unsupervised methods on the task of unsupervised single object segmentation on DAVIS validation set. For the supervised formulation we consider the soft-segmentation masks provided by various state of the art solutions as additional features concatenated in feature matrix (Sec. 3.1). We highlight that our fully unsupervised GO-VOS achieves state of the art results among fully unsupervised methods and also improves over the ones that use supervised pretrained features.

Task Method J Mean F Mean sec/frame
Unsupervised Supervised features PDB[30] 77.2 74.5 0.05
ARP[14] 76.2 70.6 N/A
LVO[32] 75.9 72.1 N/A
FSEG[11] 70.7 65.3 N/A
LMP[31] 70.0 65.9 N/A
blueGO-VOS supervised + features of [30] 79.9 (green+2.7) 78.1 0.61
blueGO-VOS supervised + features of [14] 78.7 (green+2.5) 73.1 0.61
blueGO-VOS supervised + features of [32] 77.0 (green+1.1) 73.7 0.61
blueGO-VOS supervised + features of [11] 74.1 (green+3.5) 69.9 0.61
blueGO-VOS supervised + features of [31] 73.7 (green+3.7) 69.2 0.61
Unsupervised ELM[15] 61.8 blue61.2 20
FST[25] 55.8 51.1 4
CUT[13] 55.2 55.2 1.7
NLC[8] 55.1 52.3 12
blueGO-VOS unsupervised blue65.0 61.1 blue0.91
Table 1: Quantitative results of our method, compared with state of the art solutions on DAVIS dataset. Our proposed method achieves state of the art results, while also being the fastest among top unsupervised methods. [30] is much faster as it is based on a trained network and requires only a forward pass at test time. (black bold - best supervised; blue bold - best unsupervised)

4.2 SegTrack v2 dataset

The SegTrack dataset was originally introduced in [33] and further adapted for the task of video object segmentation in [18]. Last version of the dataset (v2) contains 14 videos with pixel level annotations for the main objects of interest (8 videos with one primary object and 6 videos with multiple objects). In contrast to the multiple object videos of DAVIS dataset, the multiple objects videos include objects that are separated from each other. SegTrack contains deformable and dynamic objects, with video at relative poor resolution - making it a very challenging dataset for video object segmentation. Metric: For evaluation we used the average intersection over union score. Results: In Table 2 we present quantitative results of our method, in the fully unsupervised case, and compare to published methods that test on all 14 videos in SegTrack v2. Our solution achieves the top score and is the second fastest. We did not include in comparison the solution of [8], as although it is designed and tested on SegTrack, they have tested only on 12 videos.

Task Method IoU sec/frame
Unsupervised Supervised features KEY [16] 57.3 120
FSEG [11] 61.4 N/A
LVO [32] 57.3 N/A
[20] 59.3 N/A
Unsupervised FST [25] 54.3 4
CUT [13] 47.8 1.7
HPP [9] 50.1 blue 0.35
blueGO-VOS unsupervised blue62.2 0.91
Table 2: Quantitative results of our method, compared with state of the art solutions on SegTrack v2 dataset. (black bold - best supervised; blue bold - best unsupervised)

4.3 YouTube-Objects dataset

YouTube-Objects (YTO) dataset [28] consists of videos collected from YouTube (720000 frames). It is also very challenging, and contains thousands of video shots (2511). However, on YTO ground truth annotations are only bounding boxes. Although we do not test against pixel level annotations, the tests are relevant on YTO considering the large number of videos and wide diversity. In the paper we show results on the latest version of the dataset (v2.2), which has more annotated boxes (6975), but we also provide results on v1.0 in supplementary material. Following the methodology of published works on YTO, we test our solution on the training set, which contains videos with only one annotated object. Metric: We used the CorLoc metric, computing the percentage of correctly localized object bounding boxes. A box is considered to be correct based on the PASCAL-criterion (IoU ). Results: In Table 3 we present the results on YTO v2.2 and compare against the published state of the art. All methods are fully unsupervised. We obtain the top average score, while outperforming the other methods on 5 out of 10 object classes.

Method aero bird boat car cat cow dog horse moto train avg sec/frame
[6] 75.7 56.0 52.7 57.3 46.9 blue57.0 48.9 44.0 27.2 56.2 52.2 blue0.02
HPP[9] 76.3 68.5 blue54.5 50.4 blue59.8 42.4 53.5 30.0 blue53.5 blue60.7 54.9 0.35
blueGO-VOS blueunsupervised blue79.8 blue73.5 38.9 blue69.6 54.9 53.6 blue56.6 blue45.6 52.2 56.2 blue58.1 0.91
Table 3: Quantitative results of our method (CorLoc metric), compared with state of the art solutions on YouTube-Objects v2.2 dataset. We have state of the art results for 5 out of 10 classes, and in average. [6] is much faster as it is based on a trained network and requires only a forward pass at test time. (blue bold - best solution)

4.4 Computation cost

Considering our problem formulation defined over a dense graph, with one node per every pixel in a video and with long range connections given by optical flow chains, one would expect the method to be memory expensive and slow. However, since we never construct the adjacency matrix and break the optimization into three steps which, one could show, are all in the number of pixels in the video, our algorithm is actually fast (less than 1 sec per frame, total computation time) - quicker than most published method on this task as shown in Tables 1, 2 and 3. In more detail: we require sec/frame for computing the optical flow and sec/frame for computing information related to matrices and . The optimization steps, per iteration total sec/frame, and we let it run for 7 iterations, resulting in a total of

sec/frame. We implement it in PyTorch and the runtime analysis is performed on a computer with specifications: Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz, GPU GeForce GTX 1080.

5 Conclusions

We presented an efficient solution for video object segmentation, defined as optimization in a the spacetime graph, with nodes at every pixel in the video and long-range links between nodes connected through optical flow chains. Our mathematical formulation enforces that the final segmentation forms a strong cluster defined by the motion structure of the object in video and it is also consistent with the features on nodes. The two forces, motion and features, are brought together into a single optimization problem which reaches, through our proposed algorithm, the global optimum in a few iterations. We show in extensive experiments on three challenging datasets, namely DAVIS2016, SegTrack v2 and YoutubeObjects v2.2, that our algorithm is relatively fast and accurate. GO-VOS outperforms other unsupervised and supervised methods on this challenging datasets.


  • [1] L. Bao, B. Wu, and W. Liu. Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 5977–5986, 2018.
  • [2] T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. In European conference on computer vision, pages 282–295. Springer, 2010.
  • [3] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 221–230, 2017.
  • [4] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1189–1198, 2018.
  • [5] J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang. Fast and accurate online video object segmentation via tracking parts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7415–7424, 2018.
  • [6] I. Croitoru, S.-V. Bogolin, and M. Leordeanu. Unsupervised learning from video to detect foreground objects in single images. In Proceedings of the IEEE International Conference on Computer Vision, pages 4335–4343, 2017.
  • [7] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015.
  • [8] A. Faktor and M. Irani. Video segmentation by non-local consensus voting. In BMVC, volume 2, page 8, 2014.
  • [9] E. Haller and M. Leordeanu.

    Unsupervised object segmentation in video by efficient selection of highly probable positive features.

    In Proceedings of the IEEE International Conference on Computer Vision, pages 5085–5093, 2017.
  • [10] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2462–2470, 2017.
  • [11] S. D. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. arXiv preprint arXiv:1701.05384, 2(3):6, 2017.
  • [12] V. Kalogeiton, V. Ferrari, and C. Schmid. Analysing domain shift factors between videos and images for object detection. IEEE transactions on pattern analysis and machine intelligence, 38(11):2327–2334, 2016.
  • [13] M. Keuper, B. Andres, and T. Brox. Motion trajectory segmentation via minimum cost multicuts. In Proceedings of the IEEE International Conference on Computer Vision, pages 3271–3279, 2015.
  • [14] Y. J. Koh and C.-S. Kim. Primary object segmentation in videos based on region augmentation and reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 6, 2017.
  • [15] D. Lao and G. Sundaramoorthi. Extending layered models to 3d motion. In Proceedings of the European Conference on Computer Vision (ECCV), pages 435–451, 2018.
  • [16] Y. J. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. In 2011 International conference on computer vision, pages 1995–2002. IEEE, 2011.
  • [17] M. Leordeanu, R. Sukthankar, and M. Hebert. Unsupervised learning for graph matching. International journal of computer vision, 96(1):28–45, 2012.
  • [18] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video segmentation by tracking many figure-ground segments. In Proceedings of the IEEE International Conference on Computer Vision, pages 2192–2199, 2013.
  • [19] J. Li, A. Zheng, X. Chen, and B. Zhou. Primary video object segmentation via complementary cnns and neighborhood reversible flow. In Proceedings of the IEEE International Conference on Computer Vision, pages 1417–1425, 2017.
  • [20] S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C.-C. Jay Kuo. Instance embedding transfer to unsupervised video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6526–6535, 2018.
  • [21] J. Luiten, P. Voigtlaender, and B. Leibe. Premvos: Proposal-generation, refinement and merging for the davis challenge on video object segmentation 2018. In The 2018 DAVIS Challenge on Video Object Segmentation-CVPR Workshops, 2018.
  • [22] K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. Video object segmentation without temporal information. arXiv preprint arXiv:1709.06031, 2017.
  • [23] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016.
  • [24] M. Meila and J. Shi. A random walks view of spectral segmentation. In AISTATS, 2001.
  • [25] A. Papazoglou and V. Ferrari. Fast object segmentation in unconstrained video. In Proceedings of the IEEE International Conference on Computer Vision, pages 1777–1784, 2013.
  • [26] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2663–2672, 2017.
  • [27] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Computer Vision and Pattern Recognition, 2016.
  • [28] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3282–3289. IEEE, 2012.
  • [29] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid.

    Epicflow: Edge-preserving interpolation of correspondences for optical flow.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1164–1172, 2015.
  • [30] H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam. Pyramid dilated deeper convlstm for video salient object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 715–731, 2018.
  • [31] P. Tokmakov, K. Alahari, and C. Schmid. Learning motion patterns in videos. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 531–539. IEEE, 2017.
  • [32] P. Tokmakov, K. Alahari, and C. Schmid. Learning video object segmentation with visual memory. arXiv preprint arXiv:1704.05737, 2017.
  • [33] D. Tsai, M. Flagg, A. Nakazawa, and J. M. Rehg. Motion coherent tracking using multi-label mrf optimization. International journal of computer vision, 100(2):190–202, 2012.
  • [34] Y.-H. Tsai, M.-H. Yang, and M. J. Black. Video segmentation via object flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3899–3908, 2016.
  • [35] P. Voigtlaender and B. Leibe.

    Online adaptation of convolutional neural networks for the 2017 davis challenge on video object segmentation.

    In The 2017 DAVIS Challenge on Video Object Segmentation-CVPR Workshops, volume 5, 2017.
  • [36] W. Wang, J. Shen, and F. Porikli. Saliency-aware geodesic video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3395–3402, 2015.
  • [37] S. Wug Oh, J.-Y. Lee, K. Sunkavalli, and S. Joo Kim. Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7376–7385, 2018.
  • [38] T. Zhuo, Z. Cheng, P. Zhang, Y. Wong, and M. Kankanhalli. Unsupervised online video object segmentation with motion property understanding. arXiv preprint arXiv:1810.03783, 2018.