Video segmentation is an important pre-processing step for many high-level video applications such as action recognition 2], or 3D reconstruction 
. A more compact representation not only reduces the subsequent processing space and time requirements, but also provides sets of visual segments that contain meaningful cues for higher-level computer vision tasks. However, generating supervoxels from videos is a significantly more difficult task than superpixel segmentation from images, due to the heavy computational cost and the extra temporal dimension. Specifically, well delineated spatio-temporal video segments can be used for tracking bounded regions, foreground moving objects, or semantic understanding. For example, locating the movement of hands is helpful for gesture or action recognition, and separating foreground/background can pin-point the region-of-interest for detecting moving objects. Therefore, these spatio-temporal segments should be temporally consistent in order to be beneficial for these computer vision tasks.
For video segmentations that are initialized from superpixels, the main goal is to consider the connections between neighboring superpixels and to decide which ones belong to the same spatio-temporal cluster. The connections are usually represented as a spatio-temporal graph, where the nodes are the superpixels and the edges connect superpixels that are adjacent to each other. The edges are weighted based on the similarity distances between pairs of superpixels. Previous work [4, 5] proposed a variety of features corresponding to a wide range of low and mid-level image cues from superpixels. For example, the within-frame similarities were computed from boundary magnitude, color, texture, and shape, and the temporal connections were defined by the direction of optical flow or motion trajectories. Importantly, the aforementioned features that were used for video segmentation encode only local information, extracted from within each superpixel. One would expect improved performance when combining local and global features, if the appropriate global features per superpixel were extracted.
The geodesic distance has been shown to be effective for image segmentation problems [9, 10] but its applications in the video domain have been limited [11, 10, 12, 13]. In this work, we propose a complete methodology for the use of geodesic distance histogram features in the video segmentation problem. The histogram feature describes the superpixel-of-interest by the distribution of the geodesic distances from it, to all other superpixels in the same frame. The representation compactly encodes global similarity relations between segments. Thus, we want to use per-frame geodesic distance information to associate superpixels both within and across frames. However, the nature of this global representation, poses several challenges that need to be addressed, in order to successfully use geodesic distance histograms for video segmentation:
The feature needs to be robust across frames in order to perform useful superpixel association. That means if a superpixel has a unique representation in one frame, its representation in the next frame should be also unique, in order to facilitate matching.
For relatively small segments, their similar relationship to global context can dwarf distinctive neighborhood information, which might make them hard to differentiate.
The feature does not encode any spatial relationships between segments. Such relationships often offer constrains that allow otherwise similar segments to be distinguished from each other.
In this paper, we address these issues in order to derive a geodesic histogram feature that is appropriate for video segmentation tasks. In essence, we introduce the necessary local information in the global representation, in order to disambiguate associations across frames. For a given superpixel, we first extract the soft boundary map of the frame where it belongs, then we compute geodesic distances from the superpixel-of-interest to all other superpixels in the same frame using the boundary scores. If we were performing per frame segmentation, a 1D histogram of these scores would suffice . However, due to motion, this 1D histogram is not robust across frames. As observed previously , a 2D joint histogram of intensity and geodesic distance is much more robust. To encode more spatial information into the feature, we compute multiple geodesic histograms in a spatial pyramid . Finally, we weigh the bins with respect to their spatial distance from the superpixel-of-interest, in order to favor potentially discriminative neighborhood information. We show in experiments that when we add our complete geodesic histogram feature into existing frameworks, the resulting segmentations are greatly improved, especially in 3D segmentation accuracy and temporal consistency. The feature is also fast to compute, without increasing significantly processing time for the existing frameworks. The geodesic histogram features are added into two state-of-the-art video segmentation frameworks that are based on superpixel clustering, and tested on two popular datasets using standard 3D segmentation benchmarks.
The rest of paper is organized as follows: Section 2 discusses related work. Section 3 discusses the motivation, computation, and analysis of the proposed geodesic histogram features. Implementation details are described in Section 3.4. Section 4 presents the experimental results. Section 5 concludes the paper and discusses other possible applications.
2 Related Work
Many video segmentation works propose diverse features to capture various kinds of information in order to estimate the similarity between the components of the video. Appearance can be represented by features based on color[5, 15], texture , and soft boundaries . Motion related features have also been utilized often, including short-term motion features based on optical flow [18, 19] and long-term motion features based on trajectories [20, 21, 22, 23]. Superpixel shape is used to compute the similarities among superpixels across frames . Some works discuss the choice of features to use  as well as the method to incorporate various kinds of features into affinity matrices .
Geodesic distances provide appearance-based similarity estimates. Geodesic distances have been applied widely on segmentation related problems on images [9, 13, 10]. A feature based on geodesic distance for matching images of deformed objects has been introduced in . The authors showed that the geodesic distance could be invariant to object deformations, by encoding pixels as color histograms on the surrounding pixels that have the same geodesic distances. The geodesic distance is also used to propose object segments on images , which is based on the correlation between the object boundary and the change in the geodesic distance transform. Several video segmentation methods have employed geodesic distance for various purposes. The salient object segmentation framework uses a geodesic distance in each frame to estimate the objectness of superpixels  on a per frame basis. Further work further proposes a spatio-temporal geodesic distance  that extends image segmentation to video segmentation. However, the proposed spatio-temporal distance has to be constrained to be temporally non-decreasing to preserve the metric property, thus limiting the robustness of the method.
In this paper, we propose a feature based on geodesic distance to estimate the similarity between the superpixels in the video. We consider the frame-wise distribution of the geodesic distances, i.e., the histogram of geodesic distances from each superpixel to all other superpixels in the same frame. This representation compactly encodes the relative similarity distances between the segment containing the superpixel-of-interest to all the other segments on the frame. This global information therefore serves as a complement to the set of to the set of appearance, motion, and shape-based features which only encode information from the inner region of the superpixel-of-interest.
3 Geodesic Distance Histogram Feature
Given a frame of the video, let be the set of superpixels: . The frame is then represented by a non-negative, undirected graph , where each value in is associated with a pair of neighboring superpixels in , and the edge weight is computed as the boundary strength between the two superpixels. The geodesic distance between any two superpixels is defined as the weight of the shortest path between the two superpixels in .
Given a superpixel on a frame, the geodesic distance between and all other superpixels in the same frame is computed and pooled into a geodesic distance histogram. This histogram contains the global information of the frame with respect to in terms of geodesic distance distribution, and can be used for computing pair-wise superpixel similarity both within and across frames.
3.1 1D Geodesic Distance Histogram.
The simplest approach is to use an 1D histogram to describe the distribution of the geodesic distances, where a bin of the histogram represents the number of superpixels with a particular geodesic distance. This is similar to the concept of critical level sets , where each critical level defines a group of superpixels having their geodesic distances less than a certain threshold. Each bin of the histogram is then associated with a region in the image.
In order to keep our feature relatively constant across frames, the value of each bin should stay approximately the same. This means that the regions associated with each bin also remain relatively stable. Considering the superpixel (in red) shown in Fig. 2(a), two regions corresponding to the first two bins of the histogram are visualized in Fig. 2(b). The first bin collects the votes of all superpixels with the lowest geodesic distance interval, forming the region indicated by the leftmost arrow. However, the region corresponding to the second bin is the combination of superpixels from different semantic regions. The value of the second bin is therefore not robust since these regions could potentially move in different ways, and end up voting for different bins in subsequent frames.
3.2 2D Intensity-Geodesic Distance Histogram.
|(a)The superpixel-of-interest||(b)1D Histogram||(c)2D Histogram|
We incorporate the intensity feature as an additional cue to complement the geodesic distance, on order to constrain bins to correspond to individual regions instead of disparate groups of regions. Thus the histogram becomes a 2D table where each cell is voted for by the superpixels that have a particular pair of geodesic distance and intensity. The joint distribution of intensity-geodesic distance was originally proposed in, where the joint distribution was expected to be stable and informative under a wide range of deformations.
Fig. 2(c) visualizes the intensity-geodesic distance histogram of a superpixel-of-interest (shown in red in Fig. 2(a)). Notice that the second bin of the 1D histogram equals to the sum of all cells in the second row of the 2D histogram, and the region from the second bin in the 1D histogram is now separated into multiple smaller regions corresponding to these cells. This is a desired effect given that each of the cells in the 2D histogram contains superpixels from the same semantic region as the 1D case. We also visualized the cell with the highest value in Fig. 2(c), which corresponds to the superpixels within the entire grass field. Such a region is likely to be stable across frames and remain connected. This implies that as long as the intermediate boundaries remain the same, these regions would still contribute to the same cells in the histogram.
To compute the similarity distance between two histograms, we can use the distance or the Earth Mover’s Distance. Following , the distance between two 2D histograms and with size is defined by:
The Earth Mover’s Distance (EMD) is computed as the sum of the 1D EMDs at each intensity bin of the 2D histogram.
Fig. 3 visualizes the similarity values computed based on 1D and 2D feature histograms from the superpixel-of-interest in Fig. 2(a) on a later video frame. In the color scheme, higher similarity is represented by the warmer color. The figure shows that the 1D histogram is less robust than the 2D histogram: there are multiple regions having similar 1D histograms with the superpixel-of-interest, and the superpixel with the highest 1D histogram similarity is in the background. In contrast, the superpixel with the highest similarity using the 2D histogram falls within the same upper-body region, a desirable result.
3.3 Spatial Information
Pooling methods such as histograms discard spatial information, such as image distance relationships or local neighborhood patterns. We encode spatial cues in two ways: 1) by embedding spatial distances into the voting weight of each superpixel, and 2) by adopting a commonly used spatial pyramid scheme .
3.3.1 Spatial distance voting weight
For a given superpixel , its histogram feature is constructed by its intensity and geodesic distances to all other pixels in the same frame. To take the spatial location of these other superpixels into account, the geodesic distances are weighted by the spatial distance of those superpixels to . In particular, the weighting of superpixel to the histogram bins of superpixel in frame is defined by:
where is the area and is the Euclidean distance between two superpixels’ center locations.
The area component normalizes the influence of superpixels of different sizes. The exponential ensures that nearby superpixels contribute more to the geodesic histogram of . This is especially helpful for superpixels that belong to smaller segments, for which most other superpixels have large geodesic distances, that would dominate the histogram. Hence two small regions that are locally different would have very similar histograms. The parameter of the exponential controls the trade-off between global and local information.
3.3.2 Spatial pyramid histogram
Inspired by the popularity of spatial pyramids , we incorporated the pyramid scheme into the construction of our feature histogram to encode more spatial information into the features. We implemented two scales of the spatial pyramid: 1x1 and 2x2 grids over a given frame. A histogram is extracted from each cell of the grid. Histograms from the same scale are concatenated.
3.4 Implementation Details
Our features are constructed from the intensity and boundary probability maps. For more robust boundary extraction, we also experiment with two different boundary map methods: spatial edge maps using structured forests, and motion boundary maps using the method proposed in .
Given the combined edge map and the superpixel graph, the geodesic distance feature for each superpixel is computed using Dijkstra’s algorithm in , with the cost of a path being the accumulated boundary scores between one superpixel to another.
We empirically set the intensity dimension of the feature histogram at 13 bins, and the geodesic dimension at 9 bins.
In this section, we describe our experiments using the geodesic histogram features for video segmentation. We incorporated our features into two existing frameworks that are based on different clustering algorithms: spectral clustering  and parametric graph partitioning 
. Spectral clustering performs dimensionality reduction on an affinity matrix based on eigenvalues, while parametric graph partitioning directly performs the clustering on the superpixel graph by modelingaffinity matrices probabilistically. Also, the method in  generates coarse-to-fine hierarchical segmentation results, while  only outputs a single level of segmentation.
The experiments were conducted on the Segtrack V2  and Chen’s Xiph.org  datasets, covering a wide range of scenarios for evaluating video segmentation algorithms. We evaluate our segmentation results using the metrics proposed in , including 3D Accuracy (AC), 3D Under-segmentation Error (UE), 3D Boundary Recall (BR), and 3D Boundary Precision (BP). All experiments were conducted with the exact same set of initial superpixels and other parameter settings.
4.1 Video Segmentation Using Spectral Clustering
We first evaluate the performance of the framework by adding our feature to spectral clustering . We use the same 6 features as : short term temporal, long term temporal, spatio temporal appearance, spatio temporal motion, across boundary appearance, and across boundary motion. The affinity matrix was computed by combining the 6 affinity matrices computed from each feature. We combined the original computed affinity matrix with the geodesic histogram features in order to preserve the algorithm settings and superpixel configurations. The similarity distances based on our features were computed using the distance.
Fig. 4 shows the evaluation results of spectral clustering with and without our feature on Segtrack v2 and Chen Xiph.org datasets. We tested four settings of our feature: (i) 2D histogram using only spatial edge maps to compute geodesic distances and without spatial distance voting weight (2D - 0), (ii) 2D histogram using spatial edge maps and spatial distance voting weight with (2D - 0.02 ), (iii) 2D histogram using both spatial edge and motion boundary maps with (2D + 0.02) and, (iv) 2D histograms with spatial pyramid (2D + 0.02 sp). Compared to the baseline, our feature significantly improved segmentation performance. The improvement was most significant in 3D accuracy: increased by 5% for Segtrack v2 and 10% for Chen Xiph.org. For Segtrack v2 dataset, our feature was able to improve the segmentation results on all four metrics. For Chen Xiph.org dataset, the feature gave a strong boost to 3D accuracy and 3D boundary precision. For all settings tested, we noticed that motion boundary maps did not affect performance much. Given that motion boundary map generation requires optic flow computation, which can be time consuming, its omission might result in faster implementations. The spatial distance voting weights had a strong impact on the results and clearly improved segmentation.
In addition to these improvements, Fig. 6 shows that the average temporal length of supervoxels consistently increased for all parameter settings of our feature by 10% for Segtrack v2 dataset and 5% for Chen Xiph.org dataset, showing that the segmentation results acquired better temporal consistency. Having both longer supervoxels and improved segmentation metrics indicate that our feature provides additional information for more reliable temporal consistency. This is significant, since connecting more corresponding superpixels temporally is a crucial and challenging part of the video segmentation task.
An interesting qualitative example is shown in Fig. 7, showing the segmentation results for video “soldier” with only two clusters. The second row visualizes the two clusters generated by  using the 6 predefined features with only local information, only capturing the lower leg of the moving soldier. In contrast, the segmentation results improved with the addition of our geodesic feature. The global information that is encoded by our feature seems to have provided better information to the spectral clustering algorithm to segment the main object out of the background. Another qualitative example is shown in the 4th and 5th row of Fig. 1. The segment of the baseline shown in the 4th row shows some under-segmentation over the main moving object. This issue however, is less pronounced with our feature.
4.2 Video Segmentation Using Parametric Graph Partitioning.
Parametric Graph Partitioning (PGP)  is a recent graph-based unsupervised method that generates a single level of video segmentation. The method models edge weights by a mixture of Weibull distributions, and requires that an -norm based similarity distance to be utilized. Therefore, we conduct experiments in this section using Earth Mover’s Distance as in . The baseline is the setting originally proposed in  which uses four feature types: intensity, the hue of the HSV color space, the AB component of LAB color space, and gradient orientation. We did not use the motion feature since it did not contribute significantly toward PGP performance as suggested in the original paper.
Tables 1 and 2 report the quantitative evaluation of PGP with and without our feature on the two datasets. We evaluated the 1D histogram feature on the Chen Xiph.org dataset, shown in Table 2. While PGP with the 1D feature outperforms the baseline in general, the benchmarks of 3 out of 8 videos decreased. On the other hand, the 2D feature significantly improved the segmentation performance of PGP. For the Segtrack v2 dataset, quantitative results in Table 1 show clear improvements of our feature for PGP, as well as the additional benefits from the spatial pyramid configuration.
Two example cases of PGP are shown in Fig. 8, and the 2nd and 3rd row of Fig. 1. For the over-segmented scenario in Fig.1, the water was unfavorably divided into many spurious segments by the PGP baseline. Adding our feature did not only help merging the background into one segment, but also enhanced temporal consistency and boundary awareness. Given the under-segmented baseline result on the lower part of the tree shown in Fig.8, our feature helped to segment the entire tree and also reduced over-segmentation in other parts of the video.
4.3 Feature Extraction Running Time
All experiments were conducted on an Intel Core i7 CPU with 3.5 Ghz, and 16 Gb of memory. When adding our feature into the framework of , the average additional running time was increased by 67 seconds on a 85-frame video using the default parameter settings, which is a just small fraction of the total running time of several hours. The additional running time increase for the PGP framework was on average 48 seconds, with 300 initial superpixels per frame. These results show that the computational cost of our feature is low, and adds very little overhead to existing frameworks.
In this paper, we introduced a novel feature for video segmentation based on geodesic distance histograms. The histogram is computed as a spatially-organized distribution of accumulated boundary costs between superpixels, which is a representation that includes more global information than conventional features. We validated the efficacy of our feature by adding it into two recent frameworks for video segmentation using spectral clustering and parametric graph partitioning, and showed that the proposed feature improved the performance of both frameworks in 3D video segmentation benchmarks, as well as the temporal consistency of the resulting supervoxels. We believe that the encoded global information can be further applied to other video related tasks such as moving object tracking, object proposals, and foreground background segmentation.
Acknowledgement. Partially supported by the Vietnam Education Foundation, NSF IIS-1161876, FRA DTFR5315C00011, the Stony Brook SensonCAT, the SubSample project from the DIGITEO Institute, France, and a gift from Adobe Corporation
-  Taralova, E.H., De la Torre, F., Hebert, M.: Motion Words for Videos. In: Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I. Springer International Publishing, Cham (2014) 725–740
-  Jain, A., Chatterjee, S., Vidal, R.: Coarse-to-fine semantic video segmentation using supervoxel trees. In: ICCV, IEEE Computer Society (2013) 1865–1872
-  Kundu, A., Li, Y., Dellaert, F., Li, F., Rehg, J.M.: Joint Semantic Segmentation and 3D Reconstruction from Monocular Video. In: Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI. Springer International Publishing, Cham (2014) 703–718
Khoreva, A., Galasso, F., Hein, M., Schiele, B.:
Classifier based graph construction for video segmentation.
In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015) 951–960
-  Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. (2010) 2141–2148
-  Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: 2013 IEEE International Conference on Computer Vision. (2013) 2192–2199
-  Yu, C.P., Le, H., Zelinsky, G., Samaras, D.: Efficient video segmentation using parametric graph partitioning. In: The IEEE International Conference on Computer Vision (ICCV). (2015)
-  Galasso, F., Cipolla, R., Schiele, B.: Video Segmentation with Superpixels. In: Computer Vision – ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part I. Springer Berlin Heidelberg, Berlin, Heidelberg (2013) 760–774
-  Krähenbühl, P., Koltun, V.: Geodesic Object Proposals. In: Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V. Springer International Publishing, Cham (2014) 725–739
-  Bai, X., Sapiro, G.: A geodesic framework for fast interactive image and video segmentation and matting. In: 2007 IEEE 11th International Conference on Computer Vision. (2007) 1–8
-  Wang, W., Shen, J., Porikli, F.: Saliency-aware geodesic video object segmentation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015) 3395–3402
-  Price, B.L., Morse, B., Cohen, S.: Geodesic graph cut for interactive image segmentation. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. (2010) 3161–3168
-  Ling, H., Jacobs, D.W.: Deformation invariant image matching. In: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1. Volume 2. (2005) 1466–1473 Vol. 2
-  Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). Volume 2. (2006) 2169–2178
-  Cheng, H.T., Ahuja, N.: Exploiting nonlocal spatiotemporal structure for video segmentation. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. (2012) 741–748
-  Leung, T., Malik, J.: Representing and recognizing the visual appearance of materials using three-dimensional textons. Int. J. Comput. Vision 43 (2001) 29–44
-  Galasso, F., Keuper, M., Brox, T., Schiele, B.: Spectral graph reduction for efficient image and streaming video segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2014)
-  Galasso, F., Iwasaki, M., Nobori, K., Cipolla, R.: Spatio-temporal clustering of probabilistic region trajectories. In Metaxas, D.N., Quan, L., Sanfeliu, A., Gool, L.J.V., eds.: ICCV, IEEE Computer Society (2011) 1738–1745
-  Tsai, Y.H., Yang, M.H., Black, M.J.: Video segmentation via object flow. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
-  T.Brox, J.Malik: Object segmentation by long term analysis of point trajectories. In: European Conference on Computer Vision (ECCV). Lecture Notes in Computer Science, Springer (2010)
-  Lezama, J., Alahari, K., Sivic, J., Laptev, I.: Track to the future: Spatio-temporal video segmentation with long-range motion cues. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. (2011)
-  Palou, G., Salembier, P.: Hierarchical video representation with trajectory binary partition tree. In: Computer Vision and Pattern Recognition (CVPR), Portland, Oregon (2013)
-  Brox, T., Malik, J.: Object segmentation by long term analysis of point trajectories. In: Proceedings of the 11th European Conference on Computer Vision: Part V. ECCV’10, Berlin, Heidelberg, Springer-Verlag (2010) 282–295
-  Chen, A.Y.C., Corso, J.J.: Propagating multi-class pixel labels throughout video frames. In: Image Processing Workshop (WNYIPW), 2010 Western New York. (2010) 14–17
-  Dollár, P., Zitnick, C.L.: Structured forests for fast edge detection. In: ICCV, International Conference on Computer Vision (2013)
-  Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: Learning to detect motion boundaries. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
-  Xu, C., Corso, J.J.: Evaluation of super-voxel methods for early video processing. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. (2012) 1202–1209