Actor-Action Semantic Segmentation with Grouping Process Models

12/30/2015 ∙ by Chenliang Xu, et al. ∙ University of Michigan 0

Actor-action semantic segmentation made an important step toward advanced video understanding problems: what action is happening; who is performing the action; and where is the action in space-time. Current models for this problem are local, based on layered CRFs, and are unable to capture long-ranging interaction of video parts. We propose a new model that combines these local labeling CRFs with a hierarchical supervoxel decomposition. The supervoxels provide cues for possible groupings of nodes, at various scales, in the CRFs to encourage adaptive, high-order groups for more effective labeling. Our model is dynamic and continuously exchanges information during inference: the local CRFs influence what supervoxels in the hierarchy are active, and these active nodes influence the connectivity in the CRF; we hence call it a grouping process model. The experimental results on a recent large-scale video dataset show a large margin of 60 demonstrates the effectiveness of the dynamic, bidirectional flow between labeling and grouping.



There are no comments yet.


page 2

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advances in modern high-level computer vision have helped usher in a new era of capable, perceptive physical platforms, such as automated vehicles. As the performance of these systems improves, the expectations of their capabilities and tasks will also increase, commensurately, with platforms moving from the highways into our homes, for example. The need for these platforms to understand not only

what action is happening, but also who is doing the action and where the action is happening, will be increasingly critical to extracting semantics from video and, ultimately, to interacting with humans in our complex world. For example, a home kitchen robot must distinguish and locate adult-eating, dog-eating and baby-crying in order to decide how to prepare and when to serve food.

Despite the recent successes of aspects of this problem, such as action recognition [9, 30, 33, 15, 19, 37, 38], action segmentation [22, 11], video object segmentation [8, 20, 46, 28, 21, 26], the collective problem had not been codified until [40], which posed a new actor-action semantic segmentation task on a large-scale YouTube video dataset called A2D. This dataset contains seven classes of actors including both articulated (e.g. baby, cat and dog) and rigid (e.g. car and ball) ones, and eight classes of actions (e.g. flying, walking and running). The task is to label each pixel in a video as a pair of actor and action labels or a null actor/action; one third of A2D videos contain multiple actors and actions.

Figure 1: Grouping Process Model with Bidirectional Inference. The local CRF at the segment-level starts with a coarse video labeling to influence what supervoxels in a hierarchy are active (grouping cue). The active supervoxels, in turn, influence the connectivity in the CRF, thus refining the labels (labeling cue). This process is dynamic and continuous. The left side shows an example video with its segment-level segmentation and its iteratively refined labels. The right side shows a supervoxel hierarchy and its active nodes.

This task is challenging—their benchmarked leading method, the trilayer model

, only achieves a 26.46% per-class pixel-level accuracy for joint actor-action labeling. This model builds a large three-layer CRF on video supervoxels, where random variables of actor, actor-action, and action labels are defined on each layer, respectively, and connects layers with two sets of potential functions that capture conditional probabilities (e.g. conditional distribution of action given a specific actor class). Although their model accounts for the interplay of actors and actions, the interactions of the two sets of labels are restricted to the local CRF neighborhoods, which, based on the low absolute performance they achieve, is insufficient to solve this unique actor-action problem for three reasons.

First, we believe the pixel-level model must be married to a secondary process that captures instance-level or video-level global information, such as action recognition, in order to properly model the actions. Lessons learned from images strongly supports this argument—the performance of semantic image segmentation on the MSRC dataset seems to hit a plateau [31] until information from secondary processes, such as context [16, 25], object detectors [17] and a holistic scene model [43], are added. However, to the best of our knowledge, there is no method in video semantic segmentation that directly leverages the recent success in action recognition.

Second, these two sets of labels, actors and actions, exist at different levels of granularities. For example, suppose we want to label adult-clapping in a video. The actor, adult, can probably be recognized and labeled by looking only at the lower body, e.g. legs. However, in order to recognize and label the clapping action, we have to either localize the acting parts of the human body or simply look at the whole actor body for recognizing the action.

Third, actors and actions have different orientations along space and time dimensions in a video. Actors are more space-oriented—they can be fairly well labeled using only still images, as in semantic image segmentations [32, 43], whereas actions are more space-time-oriented. Although one can possibly identify actions by still images alone [42], there are strong distinctions between different actions along the time dimension. For example, running is faster and has more repeated motion patterns than walking for a given duration; and walking performed by a baby is very different compared to an adult, although they may easily confuse a spatially trained object detector, such as DPM [7], without more complex spatiotemporal modeling.

Our method overcomes the above limitations in two ways: (1) we propose a novel grouping process model (GPM) that adaptively groups segments together during inference, and (2) we incoporate video-level recognition into segment-level labeling thru multi-label labeling costs and the grouping process model. The GPM is a dynamic and continuous process of information exchange of the labeling CRFs and a supervoxel hierarchy. The supervoxel hierarchy provides a rich multi-resolution decomposition of the video content, where object parts, deformations, identities and actions are retained in space-time supervoxels across multiple levels in the hierarchy [41, 11, 27]. Rather than using object and action proposals as separate processes, we directly localize the actor and action nodes in a supervoxel hierarchy by the labeling CRFs. During inference, the labeling CRFs influence what supervoxels in a hierarchy are active, and these active supervoxels, in turn, influence the connectivity in the CRF, thus refining the labels.

This bidirectional inference for GPM is dynamic and iterative as shown in Fig. 1

and can be efficiently solved by graph cuts and binary linear programming. We show that the GPM can be effectively combined with video-level recognition signals to efficiently influence the actor-action labelings in video segmentation. Throughout the entire inference process, the actor and action labels exchange information at various levels in the supervoxel hierarchy, such that the multi-resolution and space-time orientations of the two sets of labels are explicitly explored in our model.

We conduct thorough experiments on the large-scale actor-action video dataset (A2D) [40]. We compare the proposed method to the previous benchmarked leading method, the trilayer model, as well as two leading semantic segmentation methods [16, 14] that we have extended to the actor-action problem. The experimental results show that our proposed method, which is driven by the grouping process model, outperforms the second best method by a large margin of 17% per-class accuracy (60% relative improvement) and over 10% global pixel accuracy, which demonstrates the effectiveness of our modeling.

2 Related Work

Our paper is closely related to Xu et al. [40], where the actor-action semantic segmentation problem is first proposed. Their paper demonstrates that inference jointly over actors and actions outperforms inference independently over them. They propose a trilayer model that achieves the state-of-the-art performance on the actor-action semantic segmentation problem. However, their model only captures the interactions of actors and actions in a local CRF pairwise neighborhood, whereas our method considers the interplays at various levels of granularities in space and time introduced by a supervoxel hierarchy.

Supervoxels have demonstrated potential to capture object boundaries, follow object parts over time [39], and localize objects and actions [11, 27]. Supervoxels are used as higher-order potentials for human action segmentation [22] and video object segmentation [12]. Different from the above works, we use a supervoxel hierarchy to connect bottom-up pixel labeling and top-down recognition, where supervoxels contain clear actor-action semantic meaning. We also use the tree slice concept for selecting supervoxels in a hierarchy as in [41], but the difference is that our model selects the tree slices in an iterative fashion, where the tree slice also modifies the pixel-level groupings.

Our work also differs from the emerging works in action localization, action detection, and video object segmentation for two reasons. First, our segmentation contains clear semantic meanings of actor and action labels, whereas most existing works in action localization and detection do not [11, 24, 36, 18]. Second, we consider multiple actors performing actions in a video and explicitly model the types of actors, whereas existing works assume one human actor [23, 44, 34, 45] or do not model the types of actors at all [8, 20, 47, 46]. Although there have been some works on action detection [34], this remains an open challenge.

We relate our work to AHRF [16] and FCRF [14] in Section 4 after presenting the new model.

3 Grouping Process Model

Grouping Process Model (GPM) is a dynamic and continuous process of information exchange during inference: the local CRF influences what supervoxels in a hierarchy are active, and these active supervoxels, in turn, influence the connectivity in the CRF. Here, we give its general form, and Fig. 1 shows an overview. We define the detailed potentials adapted to the actor-action problem in Sec. 5.

Segment-level. Without loss of generality, we define as a video with voxels or a video segmentation with segments. A graph is defined over the entire video, where the neighborhood structure of the graph is induced by the connectivities in the voxel lattice or segmentation graph over space-time in a video. We define a set of random variables where the subscript corresponds to a certain node in and each takes some label from a label set. The GPM is inherently a labeling CRF, but it leverages a supervoxel hierarchy to dynamically adjust its non-local grouping structure.

Supervoxel Hierarchy. Given a coarse-to-fine supervoxel hierarchy generated by a hierarchical video segmentation method, such as GBH [10], we extract a supervoxel tree111We add one virtual node as root to make it a tree if the segmentation at the coarsest level contains more than one segment., denoted as with total nodes in the tree, by ensuring that each supervoxel at a finer level segmentation has one and only one parent at its coarser level (Sec. 6 details the tree extraction process in the general case). We define a set of random variables on supervoxel nodes in the entire tree , where takes a binary label to indicate whether the th supervoxel node is active or not. Being a segmentation hierarchy, each supervoxel connects to a set of segment nodes by their overlapping in voxel lattice , thus we have each connecting to a set of random variables at the segment level, denoted as .

Supervoxel hierarchies, such as [10, 5], are built by iteratively recomputing and merging finer supervoxels into coarser ones based on appearance and motions, where the body parts of an actor and its local action are contained at the finer levels in the hierarchy and the identity of the actor and its long-ranging action are contained at the coarser levels. But going too coarse will cause oversegmentation with the background and going too fine will lose the meaningful actions. Therefore it is challenging to locate the supervoxels in a hierarchy that best describe the actor and its action. Instead, our GPM uses the evidence from a second source—the segment-level CRF, to locate the supervoxels supported by the labeling . Once the supervoxels are selected, they provide strong labeling cues to the segment-level labeling—these segment-level nodes are from a same actor or a same action, thus they can be fully connected to refine the labeling.

The objective of GPM is to find the best labeling of and to minimize the following energy:


where and encode the potentials at the segment-level and the supervoxel hierarchy, respectively; and are conditional potential functions defined as directional edges in Fig. 1. To keep the discussion general, we do not define the specific form of here, it can be any segment-level CRF, such as [16, 14, 31]. We define the other terms next.

3.1 Labeling Cues from Supervoxel Hierarchy

Given an active node in the supervoxel hierarchy, we use it as a cue to refine the segment-level labelings and we define the energy of this process as:


Here, has the form:


where is a constant parameter to be learned. penalizes any two nodes in the field that contain different labels. Eq. 2 will change the graph structure in by fully connecting the nodes inside, and has clear semantic meaning—-this set of nodes in at the segment-level are linked to the same supervoxel node and hence they are from the same object, taking evidences from the appearance and motion features used in a typical supervoxel segmentation method.

3.2 Grouping Cues from Segment Labeling

If the selected supervoxels are too fine, they are subject to losing object identity and long-ranging actions; if they are too coarse, they are subject to oversegmenting with the background. Therefore, we set the selected supervoxels to best reflect the segment-level labelings while also respecting a selection prior. Given some video labeling at the segment-level, we select the nodes in the supervoxel hierarchy that best correspond to this current labeling:


where denotes the size of a supervoxel in terms of video voxels and is a parameter to be learned that encodes a prior of the node selection in the hierarchy. is defined as the entropy of the labeling field connected to :


where and is an indicator function. Intuitively, the first term in Eq. 4 pushes down the selection of nodes in the hierarchy such that they only include the labeling field that has same labels, and the second term pulls up the node selection, giving penalties for going down the hierarchy.

3.3 Valid Active Nodes By The Tree Slice

The active nodes in define what groups of segments the GPM will enforce during labeling; hence the name grouping process model. However, not all active node sets are permissible: since we seek a single labeling over the video, we enforce that each node in (each segment) is associated with one and only one active group in . This notion was introduced in [41] by way of a tree slice: on every root-to-leaf path in the tree one and only one node in is active.

We follow [41] to define a matrix that encodes all root-to-leaf paths in . is one row in , and it encodes the path from the root to th leaf with s for nodes on the path and s otherwise. We define the energy to regulate as:


where is the total number of leaves (also the number of such root-to-leaf paths), denotes dot product, and is a large constant to penalize an invalid tree slice. The tree slice selects supervoxel nodes to form a new video representation that has a one-to-one mapping to the video 3D lattices .

4 Bidirectional Inference for GPM

In this section, we show that we can use an iterative bidirectional inference schema to efficiently solve the objective function defined in Eq. 1—given the segment-level labeling, we find the best supervoxels in the hierarchy; and given the selected supervoxels in the hierarchy, we regroup the segment-level labeling.

The Video Labeling Problem. Given a tree slice , we would like to find the best . Formally, we have:


This equation can have a standard CRF form depending on how is defined. The higher-order energy we defined in can be decomposed to a locally fully connected CRF, and its range is constrainted by such that the inference is inexpensive even without Gaussian kernels [14].

The Tree Slice Problem. Given the current labeling , we would like to find the best . Formally, we have:


We use binary linear programming to optimize Eq. 8, and thus we rewrite the problem to have the following form:


where . Note that this optimization is substantially simpler than that proposed by the original tree slice paper [41], which incorporated quadratic terms in a binary quadratic program. We use a standard solver (IBM CPLEX) to solve the binary linear programming problem.

Iterative Inference. The above two conditional inferences are iteratively carried out, as depicted in Fig. 1. To be specific, we initialize a coarse labeling by solving Eq. 7 without the second term, then we solve Eq. 8 and 7 in an iterative fashion. Each round of the tree slice problem enacts an updated set of grouped segments, which are then encouraged to be assigned the same label during the subsequent labeling process. Although we do not include a proof of convergence in this paper, we notice that the solution converges after a few rounds.

Relation to AHRF. The associative hierarchical random field (AHRF) [16] implicitly pushes up the inference nodes towards higher-levels in the segmentation tree , whereas our model (GPMs) explicitly models the best set of active nodes in the segmentation tree by the means of a tree slice. AHRF defines a full random field on the hierachy; our model leverages the hierarchy to adaptively group at the pixel level. Our model is hence more scalable to videos. GPMs assume that the best representations of the video content exist in a tree slice rather than enforcing the agreement across different levels as in AHRF. For example, a video of long jumping often contains running

in the beginning. The running action exists and has a strong classifier signal at a fine-level in a supervoxel hierarchy, but it quickly diminishes when one goes to a higher level in the hierarchy where supervoxels capture longer range time dimension in the video and would then favor the jumping action.

Relation to FCRF. The fully-connected CRF (FCRF) in [14] imposes a Gaussian mixture kernel to regularize the pairwise interactions of nodes. Although our model fully connects the nodes in each for a given iteration of inference, we explicitly take the evidence from the supervoxel groupings rather than a Gaussian kernel. The energy in Eq. 4 restricts the selected supervoxels to avoid overmerging. Although a more complex process, in practice, our inference is efficient. It takes on the order of seconds for a typical video with a few thousand label nodes and a few hundred supervoxel nodes.

5 The Actor-Action Problem Modeling

Typical semantic segmentation methods [40, 35] train classifiers at the segment-level. In our case, these segment-level classifiers capture the local appearance and motion features of the actors’ body parts; they have some ability to locate the actor-action in a video, but these predictions are noisy since they do not capture the actor-whole or leverage any context information in the video. Video-level classifiers, as a secondary process, capture the global information of actors performing actions and have good prediction performance at the video-level. However, they are not able to localize where the action is happening. These two streams of information captured at the segment-level and at the video-level are complementary to each other. In this section, we implement these two streams together in a single model, leveraging the grouping process model as a means of marrying the video-level signal to the segment level problem.

Let us first define notation, extending that from Sec. 3 where possible. We use to denote the set of actor labels (e.g. adult, baby and dog) and to denote the set of action labels (e.g. eating, walking and running). The segment-level random fields now take two labels—for the th segment, is a label from the actor set and is a label from the action set. We define as the joint product space of the actor and action labels. We define a set of binary random variables on the video-level, where denotes the th actor-action label is active at the video-level. They represent the video-level multi-label labeling problem. Again, we have the set of binary random variables defined on the supervoxel hierarchy as in Sec. 3.

Therefore, we have the total energy function of the actor-action semantic segmentation defined as:


where the term now models the joint potentials of the segment-level labeling field and the video-level labeling , which is slightly different from its form in Eq. 2. We have two new terms, and , from the video-level, where is the th coordinate in . We explain these new terms next.

5.1 Segment-Level CRF

At the segment-level, we use the same bilayer actor-action CRF model from [40] to capture the local pairwise interactions of the two sets of labels:


where and encode separate potentials for a random variable to take the actor and action labels, respectively. is a potential to measure the compatibility of the actor-action tuples on segment , and and capture the pairwise interactions between segments, which have the form of a contrast sensitive Potts model [3, 31]. We use the publicly available code from [40] to capture the local pairwise interactions of the two sets of labels.

5.2 Video-Level Potentials

Rather than a uniform penalty over all labels [6], we use the video-level recognition signals as global multi-label labeling costs to impact the segment-level labeling. We define the unary energy at the video-level as:


where is the video-level classification response for a particular actor-action label, and Sec. 6 describes its training process. Here, is a parameter to control response threshold, and is a large constant parameter. In other words, to minimize Eq. 12, the label only when the classifier response .

We define the interactions between the video-level and the segment-level:


where is an indicator function to determine whether the current labeling at the segment-level contains a particular label or not:


Here, is another indicator function to determine whether a particular label is supported at the video-level or not:


where maps a label in the joint actor-action space to the actor space. is a constant cost for any label that exists in but not supported at the video-level. We define and similarly. To make the cost meaningful, we set . In practice, we observe that these labeling costs from video-level potentials help the segment-level labeling to achieve a more parsimonious-in-labels result that enforces more global information than using local segments along (see results in Table 1).

5.3 The GPM Potentials

Figure 2: Actor-action video labeling is refined by GPM. First row shows a test video car-jumping with its labelings. The second row shows the supervoxel hierarchy and the third row shows the active nodes with their dominant labels.

The energy terms and involved in the tree slice problem are defined the same as in Sec. 3. Now, we define the new labeling term:


Here, has the form:


where denotes the dominant actor label in the segment-level labeling field that connected to , and we define similarly. This new term selectively refines the segmentation where the majority of the segment-level labelings agree with the video-level multi-label labeling.

We show in Fig. 2 how this GPM process helps to refine the actor’s shape (the car) in the segmentation labeling. The initial labelings from propose a rough region of interest, but they do not capture the accurate boundaries and shape. After two iterations of inferences, the tree slice selects the best set of nodes in the GBH hierarchy that represents the actor, and they regroup the segment-level labelings such that the labelings can better capture the actor shape. Notice that the car body in the third column merges with the background, but our full model (fourth column) overcomes the limitation by selecting different parts from the hierarchy to yield the final grouping segmentation.

5.4 Inference

The inference of the actor-action problem defined in Eq. 10 follows the bidirectional inference described in Sec. 4. The tree slice problem can be efficiently solved by binary linear programming. The video labeling problem could be solved using loopy belief propagation. However, given the fact that the CRFs are defined over two sets of labels, the actors and actions, this inference problem would be very expensive. Here, we derive a way to solve it efficiently using graph cuts inference with label costs [1, 2, 6]. We show this conceptually in Fig. 3 and rewrite Eq. 11 as:


where we define the new unary as:


and the pairwise interactions as:


We can rewrite Eq. 16 in a similar way, and they satisfy the submodular property according to the triangle inequality [13]. The label costs can be solved as in [6].

Figure 3: Visualization of two nodes of the bilayer model in our efficient inference.


We manually explore the parameter space based on the pixel-level accuracy in a heuristic fashion. We first tune the parameters involved in the video-level labeling, then those involved in the segment-level labeling, and finally, those involved in GPM by running the bidirectional inference as in Sec. 


6 Experiments

We evaluate our method on the recently released A2D dataset [40] and use their benchmark to evaluate the performance; this is the only dataset we are aware of that incorporates actors and actions together. We compare with the top-performing trilayer model benchmark, and two strong semantic image segmentation methods, AHRF [16] and FCRF [14]. For AHRF, we use the public available code from [16] as it contains a complete pipeline from training classifiers to learning and inference. For FCRF, we extend it to use the same features as our method.

Table 1: The overall performance on the A2D dataset, where the performance is calculated for all test videos, single actor-action videos and multiple actor-action videos. The top two rows are intermediate results of full model (sub-parts of the energy). The middle three rows are comparison methods. The bottom two rows are our full model with different supervoxel hierarchies for the grouping process.

Data Processing. We experiment with two distinct supervoxel trees: one is extracted from the hierarchical supervoxel segmentations generated by GBH [10], where supervoxels across mutliple levels natively form a tree structure hierarchy, and the other one is extracted from mutliple runs of a generic non-hierarchical supervoxel segmentation by TSP [4]. To extract a tree structure from the non-hierarchical video segmentations, we first sort the segmentations by the number of supervoxels they contain. Then we enforce the supervoxels in the finer level segmentation to have one and only one parent supervoxel in the coarser level segmentation, such that the two supervoxels have the maximal overlap in the video pixel space. We use four levels from a GBH hierarchy, where the number of supervoxels varies from a few hundred to less than one hundred. We also use four different runs of TSP to construct another segmentation tree where the final number of nodes contained in the tree varies from 500 to 1500 at the fine level, and from 50 to 150 at the coarse level.

We also use TSP to generate the segments for the base labeling CRF. We extract the same set of appearance and motion features as in [40] (we use their code) and train one-versus-all linear SVM classifiers on the segments for three sets of labels: actor, action, and actor-action pair, separately. At the video-level, we extract improved dense trajectories [38]

, and use Fisher vectors

[29] to train linear SVM classifiers at the video-level for the actor-action pair. We use the bidirectional inference and learning methods described in Sec. 4 and follow the train/test splits used in [40]. The output of our system is a full video pixel labeling. We evaluate the performance on sampled frames where the ground-truth is labeled.

Figure 4: Example video labelings of the actor-action semantic segmentations for all methods. (a) - (c) are videos where most methods get correct labelings; (d) - (g) are videos where only our GPM models get the correct labelings; (h) - (g) are difficult videos in the dataset where the GPM models get partially correct labelings. Colors used are from the A2D benchmark [40].
Table 2: The performance on individual actor-action labels using all test videos. The leading scores for each label are in bold font.

Results and Comparisons. We follow the benchmark evaluation in [40] and evaluate performance for joint actor-action and separate individual tasks. Tab. 1 shows the overall results of all methods in three different calculations: when all test videos are used; when only videos containing single-label actor-action are used; and when only videos containing multiple actor-action labels are used. Roughly one-third of the videos in the A2D dataset have multiple actor-action labels. Overall, we observe that our methods (both GPM-TSP and GPM-GBH) outperform the next best one, the trilayer method, by a large margin of 17% average per-class accuracy and more than 10% global pixel accuracy over all test videos. The improvement of global pixel accuracy is consistent over the two sub-divisions of test videos, and the improvement of average per-class accuracy is larger on videos that only contain single-label actor-action. We suspect that videos containing multiple-label actor-action are more likely to confuse the video-level classifiers.

We also observe that the added grouping process in GPM-TSP and GPM-GBH consistently improves the average per-class accuracy over the intermediate result () on both single-label and multiple-label actor-action videos. There is a slight decrease on the global pixel accuracy. We suspect the decrease mainly comes from the background class, which contributes a large portion of the total pixels in evaluation. To verify that, we also show the individual actor-action class performance in Tab. 2 when all test videos are used. We observe that GPM-GBH has the best performance on majority classes and improves on all classes except dog-crawling, which further shows the effectiveness of the grouping process. The performance of our method using the GBH hierarchy is slightly better than our method using the TSP hierarchy. We suspect that this is due to the GBH method’s greedy merging process that complements the Gaussian process in TSP, such that the resulting segmentation complements the segment-level TSP segmentation we used.

Figure 4 shows the visual comparison of video labelings for all methods, where (a)-(c) show cases where methods output correct labels and (d)-(g) show cases where our proposed method outperforms other methods. We also show failure cases in (h) and (i) where videos contain complex actors and actions. For example, our method correctly labels the ball-rolling but confuses the label adult-running as adult-walking in (h); we correctly label adult-crawling but miss the label adult-none in (i).

7 Conclusion

Our thorough experiments on the A2D dataset show that when the segment-level labeling is combined with secondary processes, such as our grouping process models and video-level recognition signals, the semantic segmentation performance increases dramatically. For example, GPM-GBH improves almost every class of actor-action labels compared to the intermediate result without the supervoxel hierarchy, i.e., without the dynamic grouping of CRF labeling variables. This finding strongly supports our motivating argument that the two sets of labels, actors and actions, are best modeled at different levels of granularities and that they have different space-time orientations in a video.

In summary, our paper makes the following contributions to the actor-action semantic segmentation problem:

  1. [labelsep=5pt, labelwidth=0pt,leftmargin=12pt,itemsep=0ex, parsep=0pt, topsep=0pt, partopsep=0pt]

  2. A novel model that dynamically combines segment-level labeling with a hierarchical grouping process that influences connectivity of the labeling variables.

  3. An efficient bidirectional inference method that iteratively solves the two conditional tasks by graph cuts for labeling and binary linear programming for grouping allowing for continuous exchange of information.

  4. A new framework that uses video-level recognition signals as cues for segment-level labeling thru multi-label labeling costs and the grouping process model.

  5. Our proposed method significantly improves performance (60% relative improvement over the next best method) on the recently released large-scale actor-action semantic video dataset [40].

Our implementations as well as the extended versions of AHRF and FCRF will be released upon publication.

Future Work. We set two directions for our future work. First, although our model is able to improve the segmentation performance dramatically, the opportunity of this joint modeling to improve video-level recognition is yet to be explored. Second, our grouping process does not incorporate semantics in the supervoxel hierarchy; we believe this would further improve results.


  • [1] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(9):1124–1137, 2004.
  • [2] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 23(11):1222–1239, 2001.
  • [3] Y. Y. Boykov and M.-P. Jolly. Interactive graph cuts for optimal boundary region segmentation of objects in nd images. In IEEE International Conference on Computer Vision, 2001.
  • [4] J. Chang, D. Wei, and J. W. F. III. A video representation using temporal superpixels. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2013.
  • [5] J. J. Corso, E. Sharon, S. Dube, S. El-Saden, U. Sinha, and A. Yuille. Efficient multilevel brain tumor segmentation with integrated bayesian model classification. Medical Imaging, IEEE Transactions on, 27(5):629–640, 2008.
  • [6] A. Delong, A. Osokin, H. N. Isack, and Y. Boykov. Fast approximate energy minimization with label costs. International journal of computer vision, 96(1):1–27, 2012.
  • [7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
  • [8] D. Giordano, F. Murabito, S. Palazzo, and C. Spampinato. Superpixel-based video object segmentation using perceptual organization and location prior. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [9] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12):2247–2253, 2007.
  • [10] M. Grundmann, V. Kwatra, M. Han, and I. Essa. Efficient hierarchical graph-based video segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2010.
  • [11] M. Jain, J. Van Gemert, H. Jégou, P. Bouthemy, C. Snoek, et al. Action localization with tubelets from motion. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [12] S. D. Jain and K. Grauman. Supervoxel-consistent foreground propagation in video. In European Conference on Computer Vision, 2014.
  • [13] V. Kolmogorov and R. Zabin. What energy functions can be minimized via graph cuts? Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(2):147–159, 2004.
  • [14] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems, 2011.
  • [15] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In IEEE International Conference on Computer Vision, 2011.
  • [16] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Associative hierarchical random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6):1056–1077, 2014.
  • [17] L. Ladickỳ, P. Sturgess, K. Alahari, C. Russell, and P. H. Torr. What, where and how many? combining object detectors and crfs. In European Conference on Computer Vision, 2010.
  • [18] T. Lan, Y. Wang, and G. Mori. Discriminative figure-centric models for joint action localization and recognition. In IEEE International Conference on Computer Vision, 2011.
  • [19] I. Laptev. On space-time interest points. International Journal of Computer Vision, 64(2):107–123, 2005.
  • [20] Y. J. Lee, J. Kim, and K. Grauman. Key-segments for video object segmentation. In IEEE International Conference on Computer Vision, 2011.
  • [21] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video segmentation by tracking many figure-ground segments. In IEEE International Conference on Computer Vision, 2013.
  • [22] J. Lu, R. Xu, and J. J. Corso. Human action segmentation with hierarchical supervoxel consistency. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [23] S. Ma, L. Sigal, and S. Sclaroff. Space-time tree ensemble for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [24] S. Ma, J. Zhang, N. Ikizler-Cinbis, and S. Sclaroff. Action recognition and localization by hierarchical space-time segments. In IEEE International Conference on Computer Vision, 2013.
  • [25] R. Mottaghi, X. Chen, X. Liu, N.-G. Cho, S.-W. Lee, S. Fidler, R. Urtasun, and A. Yuille. The role of context for object detection and semantic segmentation in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [26] P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6):1187–1200, 2014.
  • [27] D. Oneata, J. Revaud, J. Verbeek, and C. Schmid. Spatio-temporal object detection proposals. In European Conference on Computer Vision, 2014.
  • [28] A. Papazoglou and V. Ferrari. Fast object segmentation in unconstrained video. In IEEE International Conference on Computer Vision, 2013.
  • [29] X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognition with stacked fisher vectors. In European Conference on Computer Vision, 2014.
  • [30] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In IEEE International Conference on Pattern Recognition, 2004.
  • [31] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In European Conference on Computer Vision, 2006.
  • [32] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. International Journal of Computer Vision, 81(1):2–23, 2009.
  • [33] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01, 2012.
  • [34] Y. Tian, R. Sukthankar, and M. Shah. Spatiotemporal deformable part models for action detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2013.
  • [35] J. Tighe and S. Lazebnik. Superparsing: scalable nonparametric image parsing with superpixels. International Journal of Computer Vision, 2012.
  • [36] D. Tran and J. Yuan. Max-margin structured output regression for spatio-temporal action localization. In Advances in Neural Information Processing Systems, 2012.
  • [37] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 2013.
  • [38] H. Wang and C. Schmid. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, 2013.
  • [39] C. Xu and J. J. Corso. Evaluation of super-voxel methods for early video processing. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  • [40] C. Xu, S.-H. Hsieh, C. Xiong, and J. J. Corso. Can humans fly? action understanding with multiple classes of actors. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [41] C. Xu, S. Whitt, and J. J. Corso. Flattening supervoxel hierarchies by the uniform entropy slice. In IEEE International Conference on Computer Vision, 2013.
  • [42] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei. Human action recognition by learning bases of action attributes and parts. In IEEE International Conference on Computer Vision, 2011.
  • [43] J. Yao, S. Fidler, and R. Urtasun.

    Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation.

    In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  • [44] G. Yu and J. Yuan. Fast action proposals for human action detection and search. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [45] J. Yuan, Z. Liu, and Y. Wu. Discriminative subvolume search for efficient action detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • [46] D. Zhang, O. Javed, and M. Shah. Video object co-segmentation by regulated maximum weight cliques. In European Conference on Computer Vision, 2014.
  • [47] Y. Zhang, X. Chen, J. Li, C. Wang, and C. Xia. Semantic object segmentation via detection in weakly labeled video. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.