Accurate video segmentation is an important step in many high-level computer vision tasks. It can provide for example window proposals for object detection[12, 26] or action tubes for action recognition [14, 13]. One of the key challenges in video segmentation is on handling the large amount of data. Traditionally, methods either build upon some fine-grained image segmentation  or supervoxel  method [11, 10, 21, 22, 34] or they consist in the grouping of priorly computed point trajectories (e.g. [24, 19]) and transform them in a postprocessing step into dense segmentations . The latter is well suited for motion segmentation applications, but has general issues with segmenting non-moving, or only slightly moving, objects. Indeed, image segmentation into small segments forms the basis for many high-level video segmentation methods like [10, 22, 34]. A key question when employing such preprocessing is the error it introduces. While state-of-the-art image segmentation methods [2, 18, 3] offer highly precise boundary localization, they usually suffer from low temporal consistency, i.e.,the superpixel shapes and sizes can change drastically from one frame to the next. This causes flickering effects in high-level segmentation methods.
In this paper, we present a low-level video segmentation method that aims at producing spatio-temporal superpixels with high temporal consistency in a bottom-up way. To this aim, we employ an affinity measure, that has recently been proposed for image segmentation . While other, learning-based methods such as  slightly outperform  on the image segmentation task, they can hardly be transferred to video data because their boundary detection requires training data that is currently not available for videos. However, in 
, boundary probabilities are learned in an unsupervised way from local image statistics, which can be transferred to video data.
To generate hierarchical segmentations, we build upon an established method for low-level image segmentation 
and make it applicable to video data. More specifically, we build an affinity matrix according to for each entire video over space and time.
Solving for the eigenvectors and eigenvalues of the resulting affinity matrices would require an enormous amount of computational resources. Instead, we show that solving the eigensystem for small temporal windows is sufficient and even produces results superior to those computed on the full system. To generate spatio-temporal segmentations from these eigenvectors according to what has been proposed in for images, we need to generate small spatio-temporal segments from the eigenvectors. We do so by extending the oriented watershed transform  to three dimensions. This is substantial since apart from inferring object boundaries within a single frame it also allows us to predict where the same objects are located in the next frame. We achieve this without the need to compute optical flow. Instead, temporal consistency is maintained simply by the local affinities computed between frames and the smoothness within the resulting eigenvectors.
Once we have estimated the spatio-temporal boundary probabilities, we can apply the ultrametric contour map approach from on the three-dimensional data.
We show that the proposed low-level video segmentation method can compete with high-level learning-based approaches  on the VSB100 [11, 28] video segmentation benchmark. In terms of temporal consistency, measured by the region metric VPR (volume precision recall), we outperform the state of the art.
1.1 Related work
An important key to reliable image segmentation is the boundary detection. Most recent methods compute informative image boundaries with learning-based methods, either using random forests[8, 17]4, 5, 30]. Provided a sufficient amount of training data, these methods improve over spectral analysis-based methods [2, 18] that defined the state of the art before. However, the output of such methods provides a proxy for boundary probabilities but does not provide a segmentation into closed boundaries.
Actual segmentations can be built from these boundaries by the well-established oriented watershed and ultrametric contour map approach  as in [2, 18, 3] or, as recently proposed, by minimum cost lifted multicuts .
Our proposed algorithm is most closely related to the image segmentation method from , where the PMI-measure has been originally defined. The advantage of this measure is that is does not rely on any training data but estimates image affinities from local image statistics. We give some details of this approach in section 3.
The use of supervoxels and supervoxel hierarchies has been strongly promoted in recent video segmentation methods [32, 7, 31, 16]. These supervoxels provide small spatio-temporal segments built from basic image cues such as color and edge information. While  tackle the problem of finding the best supervoxel hierarchy flattening,  build a graph upon supervoxels to introduce higher level knowledge. Similarly [10, 21, 22, 34] propose to build graphs upon superpixel segmentations and use learned [21, 22] information or multiple high-level cues  to generate state-of-the-art video segmentations.
In , an attempt towards temporally consistent superpixels has been made on the bases of highly optimized image superpixels  and optical flow . Similar to the proposed method,  try to make use of advances in image segmentation for video segmentation. However, their result still is a (temporally somewhat more consistent) frame-wise segmentation that is processed into a video segmentation by a graph-based method. In the benchmark paper , a baseline method for temporally consistent video segmentation has been proposed. From a state-of-the-art hierarchical image segmentation  computed on one video frame, the segmentation is propagated to the remaining frames by optical flow . The relatively good performance of this simple approach indicates that low-level cues from the individual video frames have high potential to improve video segmentation over the current state of the art.
In  an extension of the method from  to video data has been proposed. In this work, the temporal link is established by optical flow 
and the pixel-wise eigensystem is solved for the whole video based on heavy gpu parallelization. Temporally consistent labelings are computed from the eigenvectors by direct spectral clustering, thus avoiding to handle the problem of temporal gradients.
In contrast, our method neither needs precomputed optical flow nor does it depend on solving the full eigensystem. Instead, temporal consistency is established by an in-between-frame evaluation of the point-wise mutual information . Further, we extend the oriented watershed approach from  to the spatio-temporal domain so we can directly follow their approach in computing the ultametric contour map . In this setup, we can show that solving the eigensystem for temporal windows even improves over segmentations computed from solving the full eigensystem.
2 Method Overview
The proposed method is a video segmentation adaptation of a pipeline that has been used in several previous works on image segmentation [2, 18, 3] with slight variations. The key steps are given in Fig. 2. We start from the entire video sequence and compute a full affinity matrix at multiple scales. Eigenvectors are computed for these scales within small overlapping temporal windows of three frames. On the three-dimensional spatio-temporal volumes of eigenvectors, spatial and temporal boundaries can be estimated. These can be fed into the ultrametric contour map hierarchical segmentation  adapted for three-dimensional data.
3 Point-wise mutual information
We follow 
in defining the point-wise mutual information (PMI) measure employed for the definition of pairwise affinities. Let the random variablesand denote a pair of neighboring features. In , the joint probability is defined as a weighted sum of the joint probability of features and occurring with a Euclidean distance of :
Here, is a normalization constant and the weighting function
is a Gaussian normal distribution with mean-value two. The marginals of the above distribution are used to defineand . To define the affinity of two neighboring points, the direct use of this the joint probability has the disadvantage of being biased by the frequency of occurrence of and , i.e. if a feature occurs frequently in an image, the feature will have a relatively high probability to co-occur with any other feature. The PMI corrects for this unbalancing:
In , the parameter is optimized on the training set of the BSD500 image segmentation benchmark . We stick to their resulting parameter choice of .
The crucial part of the affinity measure from  that makes it easily applicable to unsupervised boundary detection is that is learned specifically for every image from local image statistics. More specifically, for 10000 random sample locations per image, features and with mutual distance
are sampled. To model the distribution, kernel density estimation is employed.
3.1 Spatio-temporal affinities
According statistics could be computed on entire videos and used to generate segmentations. However, there is a strong reason not to compute the over all video frames: image statistics can change drastically during a video or image sequence, e.g. when new agents enter the scene, the camera moves or the illumination changes. To be robust towards these changes, we chose to estimate per video frame and use these for the computation of affinities within this frame within this frame and in-between this frame and the next. This is justified by the assumption that changes in local statistics are temporally smooth.
Thus, within every frame , affinities of its elements are computed according to the estimated
for color and local variance of every pixel to every pixel within a radius ofpixels. Within every frame and its successor , affinities are computed according to for every pixel in frame to every pixel in frame within spatial distance of pixels, and, for every pixel in frame to every pixel in frame within the same distance. The result is a sparse symmetric spatio-temporal affinity matrix as given in Fig. 2.
4 Spectral Boundary Detection
Given an affinity matrix , spectral clustering can be employed to generate boundary probabilities [2, 18] and segmentations [9, 10, 22] according to a balancing criterion, more precisely, approximating the normalized cut
with and .
Approximate solutions to the normalized cut are induced by the first eigenvectors of the normalized graph Laplacian , where is the diagonal degree matrix of computed by .
However, the computation of eigenvectors for large affinity matrices rapidly becomes expensive both in terms of computation time and memory consumption. To keep the computation tractable, we can reduce the computation to small temporal windows and employ the spectral graph reduction  technique.
Spectral Graph Reduction
Spectral graph reduction  is a means of solving a spectral clustering or normalized cut problem on a reduced set of points. In this setup, the matrix defines the edge weights in a graph . Given some pre-grouping of vertices by for example superpixels or must-link constraints,  specify how to set these weights in a new graph where represents the set of vertex groups and the set of edges in between them such that the normalized cut objective does not change. They show on low-level image segmentation as well as on high-level video segmentation the advantages of this method.
Since we want to remain as close to the low-level problem as possible, we employ a setup similar to the one proposed in  for image segmentation. More specifically, we compute superpixel at the finest level produced by  for every frame, which builds upon learned boundary probabilities from . In order not to lose accuracy in boundary localization,  proposed to keep single pixels in all regions with high gradients and investigate the trade-off between pixels and superpixels that is necessary. Similarly, we keep single pixels in all regions with high boundary probability.
Multiscale Approach and Boundary Detection
Since it has been shown in the past that spectral clustering based methods benefit from multi-scale information, we build affinity matrices also for videos spatially downsampled by factor 2 and factor 4. In this case, no pixel pre-grouping is necessary. For all three scales, we solve the eigensystems individually and compute the smallest 20 eigenvalues and according eigenvectors (compare Fig. 2
). Note that these eigenvectors are highly consistent over the temporal dimension. We upsample these vectors to the highest resolution and compute oriented edges indirections, with the standard oriented edge filters in 8 sampled spatial orientations and only one temporal gradient, i.e. there is no mixed spatio-temporal gradient. Depending on the frame-rate, using finer orientation sampling would certainly make sense. However, on the VSB100 dataset [11, 28], we found that this simple setup works best. Examples of our extracted boundary estimates are given in Fig. 3. Visually, the estimated boundaries look reasonable and are temporally highly consistent. They form the key to the final hierarchical segmentations.
Evaluation of Temporal Boundaries
On the BVSD  dataset, benchmark annotations for occlusion boundaries were provided. This data can be used as a proxy to evaluate our temporal boundaries. Occlusion boundaries are object boundaries that occlude other parts of the scene - as opposed to within-object boundaries. In , the importance of motion cues for such occlusion boundaries has been pointed out. In fact, our temporal boundaries indicate boundaries separating regions within one frame that will undergo occlusion or disocclusion between this frame and the next. Thus, they can only provide part of the necessary information for object boundary detection. To extract this motion cue from our data, we apply a pointwise multiplication of the spatial boundaries at frame with the temporal boundaries between frame and its two neigh boring frames. Thus, if an object does not move in one of the frames, the respective edges are removed. Examples of the resulting motion boundary estimates are given in Fig. 3.
When we evaluate on the closed boundary annotations of this benchmark with the benchmark parameters from BSDS500 , we get a surprisingly low f-measure score of 0.34 with the best common threshold for the whole dataset, 0.41 if we allow individual thresholds per sequence. The reason might be the relatively low spatial localization accuracy of our boundaries. We ran all our experiments on the half resolution version of the VSB100 benchmark such that, to evaluate on the annotations from , boundary estimates need to be upsampled.
5 Closed Spatio-temporal Contours
Given spatio-temporal boundary estimates, closed regions could be generated by different methods such as region growing, agglomerative clustering , watersheds  or the recently proposed minimum cost lifted multicuts . In  a mathematically sound and widely used (e.g. in [2, 27, 9, 18, 3]) setup for the generation of hierarchical segmentations from boundary probabilities and an initial fine-grained segmentation is given. The therein defined Ultrametric Contour Map provides for a duality between the saliency of a contour and the scale of its disappearance from the hierarchy.
The approach from  can directly be applied to three-dimensional data. The difference is that the region contours are now two-dimensional curves that meet each other in one-dimensional curves or points. Each one-dimensional curve is common to at least three contours. As in the two-dimensional case, every contour is separating exactly two regions.
Samples from the resulting Ultrametric Contour Maps can be seen in Fig. 4. The brightness of the contour, displayed in hot color maps, indicate the saliency of a contour, i.e. its hierarchical level in the segmentation. Over all frames of the videos, the resulting closed contours have consistent saliency.
6 Experiments and Results
We compute PMI-based affinity matrices on color and local variance within all frames and between every frame and its successive frame as described in section 3.1 for three different scales (1, 0.5 and 0.25). For scale 1, we employ spectral graph reduction , reducing the number of nodes by factor 12-15. At each scale , we solve the eigenvalue problems of the normalized graph Laplacians corresponding to
for overlapping temporal windows with stride 1 to generate the first 20 eigenvectors. The best choice of the temporal window size is not obvious because of the eigenvector leakage problem, also mentioned in. In the spectrally reduced graph, spatial leakage is probably low , so we we hope for an accordingly low temporal leakage and choose a larger temporal window of size 5, while we solve the eigenvalue problem for smaller temporal windows of length 3 for scales 0.5 and 0.25. The resulting eigenvectors are resampled to the original resolution. The average oriented gradients on these eigenvectors for the multiscale boundary estimates we use. We compare the 3D ultrametric contour maps computed from these boundary estimates to those computed on only the original scale with spectral graph reduction. For the original scale, we also compare to the results we get by solving the eigensystem on the full video without temporal windows.
dataset. While the proposed method performs worse than the state of the art in terms of boundary precision and recall (BPR), we outperform all competing methods on the region metric VPR.
|VSB100: general benchmark|
|Galasso et al. ||0.52||0.56||0.44||0.45||0.51||0.42|
|Grundmann et al. ||0.47||0.54||0.42||0.52||0.55||0.52|
|Ochs and Brox ||0.14||0.14||0.04||0.25||0.25||0.12|
|Xu et al. ||0.40||0.48||0.33||0.45||0.48||0.44|
|Galasso et al. ||0.62||0.65||0.50||0.55||0.59||0.55|
|Khoreva et al.  SC||0.64||0.70||0.61||0.63||0.66||0.63|
Segmentation Propagation 
|IS - Arbelaez et al. ||0.61||0.65||0.61||0.26||0.27||0.16|
|Oracle & IS - Arbelaez et al. ||0.61||0.67||0.61||0.65||0.67||0.68|
Proposed MS TW
. It consists of 40 train and 60 test sequences with a maximum length of 121 frames. Human segmentation annotations are given for every 20th frame. Two evaluation metrics are relevant in this benchmark, denoted as boundary precision and recall (BPR) and the volume precision recall (VPR). The BPR measures the accuracy of the boundary localizations per frame. Image segmentation methods usually perform well on this measure, since temporal consistency is not taken into account. The VPR is a region metric. Here, exact boundary localization is less important, while the focus lies on the temporal consistency. This is the measure on which we expect to perform well.
|VSB100: motion subtask|
|Galasso et al. ||0.34||0.43||0.23||0.42||0.46||0.36|
|Ochs and Brox ||0.26||0.26||0.08||0.41||0.41||0.23|
|Khoreva et al. ||0.45||0.53||0.33||0.56||0.63||0.56|
Segmentation Propagation 
|IS - Arbelaez et al. ||0.47||0.43||0.35||0.22||0.22||0.13|
|Oracle & IS - Arbelaez et al. ||0.47||0.34||0.35||0.59||0.60||0.60|
Proposed MS TW
The results of our PMI-based video segmentation are given in Fig. 5 in terms of BRP and VPR curves for 51 different levels of segmentation granularity, which is the standard for the VSB100 video segmentation benchmark. In terms of BPR, all our results remain below the state of the art. There can be several reasons for this behavior: (1) Competitive methods that are based on per image superpixels usually compute these superpixels on the highest possible resolution [10, 22], while we start from the half resolution version of the benchmark data. (2) The number of eigenvectors we compute might be too low. In our current setup, we compute the first 20 eigenvectors per matrix. The optimal number to be chosen here varies strongly depending on the structure of the data as well as the employed affinities. (3) Most importantly, by the definition of our boundary detection method, we force these boundaries to be spatially consistent. However, spatial consistency is not at all required for this measure. If our boundaries show some amount of temporal smoothness, this might cause the boundaries to be slightly shifted in an individual frame.
On the VPR, the proposed method benefits from the temporal consistency it optimizes and outperforms all previous methods in the three aggregate measures ODS (meaning, we choose one global segmentation threshold for the whole dataset), in OSS, which allows to choose the best threshold per sequence, and average precision (AP). Respective numbers are given in Tab. 1. This gives rise to the conclusion that the proposed segmentations are indeed temporally consistent.
In Fig. 5, we also plot the result we get if we only use boundary estimates from the original resolution without using multiscale information (depicted in black). As we expect, the segmentation quality remains below the quality we get with the multiscale approach. However, we did not know a priori what to expect from solving the eigensystem for the entire videos without applying temporal windows (dashed black line). To do so actually requires large amounts of memory (Gb) for most sequences. The results actually remain clearly below those computed with temporal windows. In fact, the eigenvectors we compute on the small temporal windows show high temporal consistency while those computed on the whole video are subject to temporal leakage of the eigenvectors, meaning that values in the eigenvectors within a region can change smoothly throughout the sequence, resulting in decreased discrimination power.
The best results in Tab. 1 are those from the oracle case, where the individual image segments from  are temporally linked based on the ground truth. The numbers indicate the best result one could achieve on the benchmark, starting from the given image segmentation.
Results on Motion Segmentation
Since our segmentations claim to be temporally consistent even under significant motion in videos, we also performed the evaluation on the motion subtask of VSB100. For this motion subtask, only a subset of the videos, showing significant motion, is evaluated. Non-moving objects within these videos are not taken into account. Results are reported in Tab. 2 and Fig. 6. As for the general benchmark, our results are outperformed by the state of the art on the BPR but improve over the state of the art in terms of temporal consistency, measured by the region metric VPR.
We have proposed a method for computing temporally consistent boundaries in videos. To this end, the method builds spatio-temporal affinities based on point-wise mutual information at multiple scales. On the video segmentation benchmark VSB100, the resulting hierarchy of spatio-temporal regions outperforms state-of-the-art methods in terms of temporal consistency as measured by the region metric VPR. We believe that the coarser hierarchy level can help extract high-level content of video. The finer hierarchy levels can serve as temporally consistent spatio-temporal superpixels for learning based video segmentation or action recognition.
We acknowledge funding by the ERC Starting Grant VideoLearn.
-  P. Arbelaez. Boundary extraction in natural images using ultrametric contour maps. In CVPR workshop, 2006.
-  P. Arbeláez, M. Maire, C. C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE TPAMI, 33(5):898–916, 2011.
-  P. Arbeláez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In CVPR, 2014.
G. Bertasius, J. Shi, and L. Torresani.
Deepedge: A multi-scale bifurcated deep network for top-down contour
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  G. Bertasius, J. Shi, and L. Torresani. High-for-low and low-for-high: Efficient boundary detection from deep object features and its applications to high-level vision. CoRR, abs/1504.06201, 2015.
-  T. Brox and J. Malik. Large displacement optical flow: descriptor matching in variational motion estimation. IEEE TPAMI, 33(3):500–513, 2011.
-  J. Chang, D. Wei, and J. W. F. III. A Video Representation Using Temporal Superpixels. In IEEE Computer Vision and Pattern Recognition Conference on Computer Vision, 2013.
-  P. Dollár and C. L. Zitnick. Structured forests for fast edge detection. In ICCV, 2013.
-  F. Galasso, R. Cipolla, and B. Schiele. Video segmentation with superpixels. In ACCV, 2012.
-  F. Galasso, M. Keuper, T. Brox, and B. Schiele. Spectral graph reduction for efficient image and streaming video segmentation. In CVPR, 2014.
-  F. Galasso, N. Nagaraja, T. Cardenas, T. Brox, and B.Schiele. A unified video segmentation benchmark: Annotation, metrics and analysis. In ICCV, 2013.
-  R. Girshick. Fast R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), 2015.
-  G. Gkioxari, R. Girshick, and J. Malik. Actions and attributes from wholes and parts. In ICCV, 2015.
-  G. Gkioxari and J. Malik. Finding action tubes. In CVPR, 2015.
-  R. C. Gonzalez and R. Woods. Digital Image Processing 2nd Edition. 2002.
-  M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graph based video segmentation. In IEEE CVPR, 2010.
-  S. Hallman and C. C. Fowlkes. Oriented edge forests for boundary detection. In CVPR, 2015.
-  P. Isola, D. Zoran, D. Krishnan, and E. H. Adelson. Crisp boundary detection using pointwise mutual information. In ECCV, 2014.
-  M. Keuper, B. Andres, and T. Brox. Motion trajectory segmentation via minimum cost multicuts. In ICCV, 2015.
-  M. Keuper, E. Levinkov, N. Bonneel, G. Lavoue, T. Brox, and B. Andres. Efficient decomposition of image and mesh graphs by lifted multicuts. In ICCV, 2015.
-  A. Khoreva, F. Galasso, M. Hein, and B. Schiele. Learning must-link constraints for video segmentation based on spectral clustering. In GCPR, 2014.
-  A. Khoreva, F. Galasso, M. Hein, and B. Schiele. Classifier based graph construction for video segmentation. In CVPR, 2015.
-  P. Ochs and T. Brox. Object segmentation in video: a hierarchical variational approach for turning point trajectories into dense regions. In ICCV, 2011.
-  P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE TPAMI, 36(6):1187 – 1200, Jun 2014.
On estimation of a probability density function and mode.Ann. Math. Statist., 33(3):1065–1076, 09 1962.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  N. Sundaram and K. Keutzer. Long term video segmentation through pixel level spectral clustering on gpus. In ICCV Workshops, 2011.
-  P. Sundberg, T. Brox, M. Maire, P. Arbelaez, and J. Malik. Occlusion boundary detection and figure/ground assignment from optical flow. In CVPR, 2011.
-  L. Vincent and P. Soille. Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE TPAMI, 13:583–598, 1991.
-  S. Xie and Z. Tu. Holistically-nested edge detection. CoRR, abs/1504.06375, 2015.
-  C. Xu and J. Corso. Evaluation of super-voxel methods for early video processing. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012.
-  C. Xu, S. Whitt, and J. Corso. Flattening supervoxel hierarchies by the uniform entropy slice. In Proceedings of the IEEE International Conference on Computer Vision, 2013.
-  C. Xu, C. Xiong, and J. Corso. Streaming hierarchical video segmentation. In Proceedings of European Conference on Computer Vision, 2012.
-  S. Yi and V. Pavlovic. Multi-cue structure preserving MRF for unconstrained video segmentation. CoRR, abs/1506.09124, 2015.