The goal of this paper is to localize and recognize actions such as ‘kicking’, ‘hand waving’ and ‘salsa spin’ in video content. The recognition of actions has witnessed tremendous progress in recent years thanks to advanced video representations based on motion and appearance e.g.(Laptev, 2005; Dollar et al., 2005; Wang et al., 2013, 2015a; Simonyan and Zisserman, 2014). However, determining the spatiotemporal extent of an action has appeared considerably more challenging. Early success came from an exhaustive evaluation of possible action locations e.g. (Ke et al., 2005; Lan et al., 2011; Tian et al., 2013). Such a sliding cuboid is tempting, but owing to large number of possible locations demands a relatively simple video representation, e.g. (Dalal and Triggs, 2005; Kläser et al., 2008). Moreover, the rigid cuboid shape does not necessarily capture the versatile nature of actions well. We propose an approach for action localization enabling flexible spatiotemporal subvolumes, while still allowing for modern video representations.
Tran and Yuan pioneered the prediction of flexible spatiotemporal boxes around actions (Tran and Yuan, 2011, 2012). They first obtain for each individual frame the most likely spatial locations containing the action, before determining the best temporal path or action proposal through the box search space (Tran and Yuan, 2011, 2012). Surprisingly, the initial spatial classification is frame-based and ignores motion characteristics for action recognition. More recently both Gkioxari and Malik (2015) and Weinzaepfel et al. (2015)
overcome this limitation by relying on a two-stream convolutional neural network based on appearance and two-frame motion flow. While proven effective, these works need to determine the locations in each frame with supervision, and for each action class separately, making them less suited for action localization challenges requiring hundreds of actions. Rather than separating the spatial from the temporal analysis and relying on region-level class-specific supervision, we prefer to analyze both spatial and temporal dimensions jointly to obtain action proposals in an unsupervised manner and avoid supervision until classification. Such an approach is easier to scale to hundreds of classes. Moreover, the same set of proposals can be used for applications requiring different encodings or classification schemes.
We are inspired by a method for object detection in static images called selective search (Uijlings et al., 2013). The algorithm generates box proposals for possible object locations by hierarchically merging adjacent super-pixels from (Felzenszwalb and Huttenlocher, 2004) , based on similarity criteria for color, texture, size and fill. The approach does not require any supervision, making it suited to evaluate many object classes with the same set of proposals. The small set of object proposals is known to result in both high recall and overlap with the ground-truth (Hosang et al., 2016)
. Moreover, by separating the localization from the recognition, selective search facilitates modern encodings, such as Fisher vectors of(Sánchez et al., 2013) in (van de Sande et al., 2014) and convolutional neural network features in (Girshick et al., 2016). Following the example set by selective search for object detection, we introduce unsupervised spatiotemporal proposals for action localization by relying on video-specific appearance and motion properties derived from super-voxels.
Brox and Malik (2010) realized earlier that temporally consistent segmentations of moving objects in a video can be obtained without supervision. They propose to cluster long term point trajectories and show that these lead to better segmentations than two-frame motion fields. Both Chen and Corso (2015) and van Gemert et al. (2015) build on the work of Brox and Malik (2010) and propose action proposals by clever clustering the improved dense trajectories of Wang and Schmid (2013). Their approaches are known to be very effective for untrimmed videos where temporal localization is essential. We adopt the use of long term trajectories for temporal refinement and pruning of our action proposals, but we do not restrict ourselves exclusively to improved dense trajectories as representation for action classification.
Our first out of three contributions is to generalize the selective search strategy for unsupervised action proposals in videos. We adopt the general principle designed for static images and repurpose it for video. We consider super-voxels instead of super-pixels to produce spatiotemporal shapes. This directly gives us 2D+t sequences of bounding boxes, without the need to address the problem of linking boxes from one frame to another, as required in other approaches (Tran and Yuan, 2011, 2012; Gkioxari and Malik, 2015; Weinzaepfel et al., 2015). We refer to our action proposal as Tubelets in this paper, and summarize their generation in Figure 1.
Our second contribution is explicitly incorporating motion information in various stages of the analysis. We introduce independent motion evidence as a feature to characterize how the action motion deviates from the background motion. By analogy to image descriptors such as the Fisher vector (Sánchez et al., 2013), we encode the singularity of the motion in a feature vector associated with each super-voxel. We use the motion as an independent cue to produce super-voxels segmenting the video. In addition, motion is used as a merging criterion in the agglomerative grouping of super-voxels leading to better Tubelets.
A preliminary version of this article appeared as Jain et al. (2014). The current version adds as third contribution, the spatiotemporal refinement and pruning of Tubelets. The spatiotemporal refinement includes temporal sampling and smoothing the irregular shaped Tubelets. This post-processing considerably improves the performance while keeping the number of proposals manageable. Where Chen and Corso (2015) and van Gemert et al. (2015) derive their proposals directly and exclusively from the improved dense trajectories, we use the trajectories to refine our unsupervised action proposals from super-voxels. In addition to this technical novelty, the current paper adds: i) detailed experimental evaluation of motion-based segmentation for better proposals, leading to large gains in both proposal quality and action localization, ii) apart from UCF Sports and MSR-II we also consider the much larger UCF101 dataset, iii) revised experiments for all three datasets considering both the quality of the proposal as well as their suitability for action localization using modern video representations (Sánchez et al., 2013; Szegedy et al., 2015), and iv) a new related work section, which will be discussed next.
2 Related work
|2D Detect and track||3D spatio-temporal volume|
|Human detector||Generic detector||Cuboid||Trajectory||Voxels|
|-||Puscas et al. (2015)||Chen et al. (2014)||Oneata et al. (2014a)|
Lan et al. (2011)
Wang et al. (2014)
|Tian et al. (2013)||Raptis et al. (2012)|
|Cube||Kläser et al. (2012)||Tran and Yuan (2012)||
Ke et al. (2005)
Yuan et al. (2009)
Cao et al. (2010)
Derpanis et al. (2013)
|BoW||Ma et al. (2013)||Tran and Yuan (2011)||
Mosabbeb et al. (2014)
Chen and Corso (2015)
Jain et al. (2014)
Soomro et al. (2015)
|Fisher||Yu and Yuan (2015)||van Gemert et al. (2015)||This paper|
|CNN||Gkioxari and Malik (2015)||
Jain et al. (2015a)
|CNN+Cube||Weinzaepfel et al. (2015)|
|CNN+BoW||Jain et al. (2015b)|
We discuss action recognition and action localization. In Table 1 we link action recognition representations with action localization methods and use it to structure our discussion of related work.
2.1 Action recognition
Part-based Action recognition by parts typically exploits the human actor. Correctly recognizing the human pose improves performance Jhuang et al. (2013). A detailed pose model can make fine-grained distinctions between nearly similar actions Cheron et al. (2015). Pose can be modeled with poselets Maji et al. (2011) or as a flexible constellation of parts in a CRF Wang and Mori (2011). For action recognition in still images where motion is not available the human pose can play a role Delaitre et al. (2010) as modeled in a part-based latent SVM (Felzenszwalb et al., 2010). In our work we make no explicit assumptions on the pose, and use generic local video features.
Cube Local video features are typically represented by a 3D cube. The seminal work of (Laptev, 2005) on Spatio-Temporal Interest Points (STIPs) detects points that are salient in appearance and motion and then uses a cube of Gaussian derivative filter responses to represent the interest points. An alternative representation is HOG3D Kläser et al. (2008) which extends the 2D Histogram of Oriented Gradients (HOG) of Dalal and Triggs (2005) to 3D. Instead of using sparse salient points, the work of Dollar et al. (2005) shows that using denser sampling improves results. Replacing dense points with dense trajectories (Wang et al., 2015a) and flexible track-aligned feature cubes with motion boundary features yields excellent performance. The improved trajectories take into account the camera motion compensation, which is shown to be critical in action recognition (Jain et al., 2016; Piriou et al., 2006; Wang and Schmid, 2013). In our work we build on these dense trajectories as well.
Bag of Words To arrive at a global representation over all local descriptors, BoW represents a cube descriptor by a prototype. The frequency of the prototypes aggregated in a histogram is a global video representation. The BoW representation is simple and offers good results (Everts et al., 2014; Wang et al., 2011). We consider BoW as one of our representations for action localization as well.
Fisher Vector Where BoW records prototype frequency counts, the Fisher vector (Sánchez et al., 2013) and the VLAD (Jégou et al., 2012) model the relation between local descriptors and prototypes in the feature space of the descriptor. This more sophisticated variant of BoW outperforms BoW (Jain et al., 2013; Oneata et al., 2013, 2014b). Because of the good performance we also consider the Fisher vector as a representation.
CNNsDeep learning on visual data with CNNs (Convolutional Neural Networks) has revolutionized static image recognition Krizhevsky et al. (2012). For action recognition in videos, the work of Simonyan and Zisserman (2014) separate video in two channels: a network on static RGB and a network on hand-crafted optical flow. In Wang et al. (2015b) CNN features are used as a local feature in dense trajectories using a Fisher vector. Long term motion can be modeled by recurrent networks Ng et al. (2015). The distinction between motion and static objects is analyzed in Jain et al. (2015b) and extended by Jain et al. (2015a) for action recognition without using any video training data. Instead of separating static and motion, 3D convolutional networks combine both Tran et al. (2015). Due to excellent performance we also adopt CNN features as a representation for action localization.
2.2 Action localization
2D Human detector Spatiotemporal action localization can be realized by running a human detector on each frame and tracking the detections. In Kläser et al. (2012) a sliding window upper-body HOG detector per frame is tracked by optical flow feature points for spatial localization. Temporal localization is achieved with a sliding window on track-aligned HOG3D features. HOG3D features are also used in Lan et al. (2011) albeit in BoW, where the 2D person detector is treated as a latent variable and an undirected relational graph inspired by a latent SVM is used for classification. Similarly, the human pose is used by Wang et al. (2014) in a relational dynamic poselet model using cuboids to model a mixture of parts. In Ma et al. (2013) dynamic action parts are extended by incorporating static parts using 2D segments. Segments are grouped to tracks and represented in a hierarchical variant of BoW. In our work we do not make the assumption that an action has to be performed by a human. Our method is equally applicable to actions by groups, animals, or vehicles.
2D generic detector By replacing the human detector with a generic detector the types of actions can be extended beyond a human actor. This can be done by finding the best path trough fixed positions in a frame using HOG/HOF directly (Tran and Yuan, 2012) or through BoW (Tran and Yuan, 2011). Instead of fixed positions, Gkioxari and Malik (2015)classify object proposals with a two-stream CNN and track overlapping proposals with a high classification score. The work of Weinzaepfel et al. (2015) uses a similar two-stream CNN approach, adding a HOG/HOF/MBH-like cube descriptor at the track level and add temporal localization with a sliding window. The need for strong supervision is removed by Puscas et al. (2015) where generic CNN feature are linked through dense trajectory tracks to yield action proposals that could be used for action localization. Similarly, our work requires no supervision for obtaining action proposals, and we experimentally show that these proposals give good results. In addition, we do not first treat a video as a collection of static frames where temporal relations are added as an separate second step. Instead, we respect the 3D spatiotemporal nature of video from the very beginning.
3D Trajectory The strength of 3D dense trajectories Wang et al. (2015a) for action recognition spilled over to action localization. In Raptis et al. (2012) mid-level clusters of trajectories are grouped and matched with a graphical model. The work of Mosabbeb et al. (2014) groups trajectories to parts which are used in a BoW in an unsupervised manner using low-rank matrix completion and subspace clustering. Similarly, BoW on space-time graph clusters is used by Chen and Corso (2015)
and a Fisher vector on trajectories is used on hierarchical clusters invan Gemert et al. (2015) for action localization. These methods specifically target the strength of dense trajectories. Instead, our approach does not commit itself to a single representation.
3D Cuboid The 3D nature of video is respected by building on space-time cuboids for action localization. Such cuboids are a natural extension of 2D patches to 3D. Ke et al. (2005) offer a 3D extension of the seminal face detector of Viola and Jones (2004) using 3D cuboids with optical flow features. The work of Yuan et al. (2009) and Cao et al. (2010) exploit the efficient branch and bound method (Lampert et al., 2008) in 3D. In Tian et al. (2013) the deformable part-based model (Felzenszwalb et al., 2010) is generalized to 3D, an efficient sliding window approach in 3D is proposed by Derpanis et al. (2013) and ordinal regression (Kim and Pavlovic, 2010) is extended by Chen et al. (2014). Instead of using cuboids, which are rigid in time and space, we choose a more delicate approach using 3D voxels.
3D Voxels As a 3D generalization of 2D image segmentation the voxels from video segmentation methods (Xu and Corso, 2012) offer flexible and fine-grained tools for action proposals. In extension of Manen et al. (2013), the work of Oneata et al. (2014a) groups voxels together for action proposals using minimal training. Such action proposals could be used for action localization. This is done by Soomro et al. (2015) who use a supervised CRF to model foreground-background relationships for proposals and action localization. Instead, our proposal method is unsupervised and thus class agnostic. This is beneficial as this makes our algorithm independent on the number of action classes. This paper is an extension of Jain et al. (2014), where 3D voxels are grouped to proposals based on features such as color, texture and motion. The proposals have successfully been used for action localization using objects Jain et al. (2015b) and in a zero-shot setting Jain et al. (2015a). We will discuss the mechanics of our unsupervised action proposals next.
3 Unsupervised action proposals: Tubelets
In this section we present our approach to obtain action proposals from video in an unsupervised manner, we call the spatiotemporal proposals Tubelets. The three stages of the Tubelet generation process are shown in Figure 2. We first introduce in Subsection 3.1 our motion model based on evidence of independent motion. This motion cue is used in the first two stages of the process. In Subsection 3.2, we discuss the first stage, super-voxel segmentation, to generate an initial set of super-voxels from video. For this we rely on an off-the-shelf video segmentation as well as our proposed independent motion evidence. In Subsection 3.3 we detail the second stage of super-voxel grouping, where we iteratively group the two most similar super-voxels into a new one. The similarity score is computed using multiple grouping functions, each leading to a set of super-voxels. A super-voxel is tightly bounded by a rectangle in each frame it appears. The temporal sequence of bounding boxes forms our action proposal, a Tubelet. In Subsection 3.4, we introduce spatiotemporal refinement and pruning of Tubelets. This enhances the proposal quality, especially for temporal localization, while at the same time keeping the number of proposals feasible to use computationally expensive features and memory demanding encodings for action localization.
3.1 Evidence of independent motion
Since we are concerned with action localization, we need to aggregate super-voxels corresponding to the action of interest. Most of the points in such super-voxels would deviate from the background motion caused by moving camera and usually assumed to be dominant motion. In other words, the regions corresponding to independently moving objects do not, usually, conform with the dominant motion in the frame. The dominant frame motion can be represented by a 2D parametric motion model. Typically, an affine motion model of parameters , , or a quadratic (perspective) model with 8 parameters can be used, depending on the type of camera motion and the scene layout likely to occur:
where is the velocity vector supplied by the motion model at point in the image domain . In this paper, we use the affine motion model for all the experiments.
We formulate the evidence that a point undergoes an independent motion (i.e., an action related motion) at time step . Let us introduce the displaced frame difference at point and at time step for the motion model of parameter : . Here, will be close to if point only undergoes the background motion due to camera motion. At every time step
, the global parametric motion model can be estimated with a robust penalty function as
where is the robust function. To solve (1), we use the publicly available Motion2D software by (Odobez and Bouthemy, 1995), where is defined as the Tukey function. produces a maximum likelihood type estimate: the so-called M-estimate (Huber, 1981). Indeed, if we write for a given function , supplies the usual maximum likelihood estimate. Since we are looking for action related moving points in the image, we want to measure the deviation to the global (background) motion. This is in spirit of the Fisher vectors by (Perronnin and Dance, 2007)
, where the deviation of local descriptors from a background Gaussian mixture model is encoded to produce an image representation.
Let us consider the derivative of the robust function . It is usually denoted as and corresponds to the influence function (Huber, 1981). More precisely, the ratio accounts for the influence of the residual in the robust estimation of the model parameters. The higher the influence, the more likely the point conforms to the global motion. Conversely, the lower the influence, the less likely the point approves to the global motion. This leads us to define the independent motion evidence as:
where is the ratio normalized within .
3.2 Super-voxel segmentation
To generate an initial set of super-voxels, we rely on a third-party graph-based video segmentation by (Xu and Corso, 2012). We choose their graph-based segmentation over other methods in (Xu and Corso, 2012) because it is more efficient w.r.t. time and memory. The graph-based segmentation is about 13 times faster than the slightly more accurate hierarchical version (Xu and Corso, 2012).
As an alternative to the off-the-shelf video segmentations, each video frame is represented with the corresponding map, , of independent motion of pixels. This encodes motion information in the segmentation. We show video frames and their maps in Figure 3(a) and 3(b). We post-process the independent motion or maps by applying a morphological closing operation (dilation followed by erosion) to obtain denoised maps, which we refer to as maps, displayed in Figure 3(c). Applying the graph-based video segmentation of (Xu and Corso, 2012) on sequences of these denoised maps partitions the video into super-voxels with independent motion. Three examples of results obtained this way are shown in Figure 3(d). The first column shows a frame from action ‘Swing-Bench’, where the action of interest is highlighted by map itself and then clearly delineated by segmenting the maps. Second column shows an example from action ‘Running’. Here the segmentation does not give an ideal set of initial super-voxels but the map has useful information to be exploited by our motion feature based merging criterion (described in Subsection 3.3). An example of ‘Hand Waving’ is shown in the last column. The resulting super-voxels are more adapted and aligned to the action sequences. This alternative for initial segmentation is also more efficient, about 4 times faster than graph-based segmentation on the original video and produces 8 times fewer super-voxels. Unlike graph-based video segmentation on original frames this alternate set of initial super-voxels exploits motion information. The two are complementary and together lead to much better proposal quality as shown later in our experiments.
3.3 Super-voxel grouping
Having defined our ways to segment a video sequence into super-voxels, we are now ready to present our method for grouping super-voxels into Tubelets. The grouping is done in two steps. In the first step, initial super-voxels are grouped iteratively to create new super-voxels. A grouping function computes the similarity between any two super-voxels and the successive groupings of the most similar pairs lead to a new set of super-voxels. Each grouping function leads to a set of super-voxel. In the second step, the super-voxel sets produced by multiple grouping functions are again grouped by union. This united set of super-voxels is then enclosed by boxes in each frame to yield the Tubelets.
We iteratively group super-voxels in an agglomerative manner. Starting from the initial set of super-voxels, we hierarchically group them until the video becomes a single super-voxel. At each iteration, a new super-voxel is produced from two super-voxels, which are then not considered any more in subsequent iterations. This iterative merging algorithm is inspired by the selective search method proposed for localization in images by (Uijlings et al., 2013).
Formally, we produce a hierarchy of super-voxels that are represented as a tree: The leaves correspond to the initial super-voxels while the internal nodes are produced by the merge operations. The root node is the whole video and the corresponding super-voxel is produced in the last iteration. Since this hierarchy of super-voxels is organized as a binary tree, it is straightforward to show that additional super-voxels are produced by the algorithm. Out of these super-voxels, those which are very small or contain no motion at all are discarded at this point. This usually leaves much fewer number of super-voxels depending upon the grouping function used.
For selection of the two super-voxels to be grouped, we rely on similarities computed between all the neighboring super-voxels that are still active. We employ five complementary similarity measures in our grouping functions to compare super-voxels, in order to decide which should be merged. They are fast to compute. Four of these measures are adapted from selective search in image: The measures based on Color, Texture, Size and Fill were computed for super-pixels (Uijlings et al., 2013). We revise them for super-voxels. As our objective is not to segment the objects but to delineate the actions or actors, we additionally employ a motion-based similarity measure based on our independent motion evidence to characterize a super-voxel. The grouping function is defined as any one of the similarity measures or sum of multiple of them. Next we present the five similarity measures for super-voxels: motion, color, texture, size and fill.
Similarity by motion ():
We define a motion representation of super-voxels from
maps capturing the relevant motion information. This motion representation is also efficient to compute. We consider the binarized version ofmaps obtained by setting all non-zero values to . At every pixel , we count the number of pixels (including ) in its neighborhood that are set to (i.e. pixels likely to be related to actions). In a subvolume of pixels, this count value ranges from 0 to 75. A motion histogram of these values, denoted by , is computed over the super-voxel . Intuitively, this histogram captures both the density and the compactness of a given region with respect to the number of points belonging to independently moving objects.
Now, two super-voxels, and , represented by motion histograms are compared as follows. The motion histograms are first -normalized and then compared with histogram intersection, . The histograms are efficiently propagated through the hierarchy of super-voxels. Denoting with the super-voxel obtained by merging the super-voxels and , we have:
where denotes the number of pixels in super-voxel . The size of the new super-voxel is .
Similarity by color () and texture ().
In addition to motion, we also consider similarity based on color and texture. Both and are identical to the histograms considered for selective search in images (Uijlings et al., 2013), be it that we compute them on super-voxels rather than super-pixels. The histograms are computed from color and intensity gradient for each given super-voxel:
The color histogram captures the HSV components of the pixels included in a super-voxel;
encodes the texture or gradient information of a given super-voxel.
The method of similarity computation and the process of merging for color and texture is the same as for motion: Describe each super-voxel with a histogram and compare the two by histogram intersection.
Similarity by size () and fill ().
The similarity aims at merging smaller super-voxels first:
where is the size of the video (in pixels). This tends to produce super-voxels, and therefore Tubelets, of varying sizes in all parts of the video (recall that we only merge contiguous super-voxels).
The last similarity measure measures how well super-voxels and fit into each other. We define to be the tight bounding cuboid enveloping and . The similarity is given by:
After each merge, we compute the new similarities between the resulting super-voxel and its neighbors. As illustrated in the following two figures. Figure 4 illustrates the method on a sample video. Each color represents a super-voxel and after every iteration a new super-voxel is added and two are removed. After iterations, observe that two Tubelets (blue and dark green) emerge around the action of interest in the beginning and the end of the video, respectively. At iteration 1,720, the two corresponding super-voxels are merged. The novel Tubelet (dark green) resembles the yellow ground-truth sequence of bounding-boxes. This exhibits the ability of our method to group super-voxels both spatially and temporally. Also importantly, it shows the capability to sample an action proposal with boxes having very different aspect ratios. This is unlikely to be coped by sliding-subvolumes or even approaches based on efficient sub-window search. Figure 5 depicts another example, with a single frame considered at different stages of the algorithm. Here the initial super-voxels (second image in first row) are spatially more decomposed because the background is cluttered both in appearance and in motion (spectators cheering). Even in such a challenging case our method is able to group the super-voxels related to the action of interest.
3.4 Pruning and spatiotemporal refinement of Tubelets
. We apply two types of pruning to reduce the number of proposals leading to a more compact set of Tubelet action proposals with minimal impact on the recall.
Motion pruning: The first type of pruning is based on the amount of motion. Long videos that have much background clutter due to unrelated actors/objects, usually result in many irrelevant Tubelet proposals. We filter them based on their motion content, which we quantify by the number of motion trajectories (Wang and Schmid, 2013). For each video, we rank the Tubelet proposals based on the number of trajectories, keep the top proposals and the top ten percent of the rest. This is to ensure that at least a minimal number of proposals are retained from each video.
Overlap pruning: The second type of pruning is based on mutual overlaps of the action proposals. Many proposals have very high alignment or overlaps between them, all practically representing the same part of the video. To eliminate such redundant proposals we keep only one in a set of many highly overlapping ones. It is particularly useful when there is a large number of action proposals per video.
|BaseballPitch||Billiards||HighJump||Soccer penalty||Tennis swing|
A super-voxel and therefore a Tubelet capturing an actor/object can continue to extend further even after the action is completed as shown in the top row of Figure 6. Tubelets are generated from super-voxels that generally follow an object or an actor and hence can be irregular in shape spatially, sometimes leading to sudden changes in the size of consecutive bounding boxes. We propose to handle the above two problems of weak temporal localization and non-smooth spatial localization by temporal and spatial refinement.
Temporal refinement: In order to deal with the overly long Tubelets we propose to temporally sample or segment them. For this we devise a method that can segment each proposal into smaller sub-sequences with tighter temporal boundaries, without increasing the total number of proposals too much. This temporal refinement is applied to one proposal at a time. Consider an action proposal of boxes (i.e., extending over frames) and box has trajectories passing through it (where ). Now, we represent each box by two values, (a) relative location and (b) relative motion content . Here, is the maximum number of trajectories passing through any of the boxes. The boxes that have similar relative location and relative motion content are grouped together by clustering, such that the initial proposal is segmented into about fifteen sub-sequences. Then, very short proposals with temporal length less than thirty are filtered out. In practice, this increases the number of proposals by a factor ten. Therefore, we precede and follow temporal sampling by Overlap pruning, to restrict the total number of proposals. The impact of temporal refinement is shown in the second row of Figure 6
Spatial refinement: We apply spatial refinement of proposals, to steer the super-voxels closer to the shape of the action rather than the objects/actor and also to avoid sudden changes in sizes of bounding boxes and thus have smoother sequence of boxes. First, to align the boxes closer to action we modify them such that they are not void of motion trajectories at the boundaries. In each box, the minimum and maximum of and coordinates of intersecting trajectories are computed and the box is restricted to
. Second, we apply weighted linear regression on width, height,and coordinates of the top left corner of the boxes. This is done over a local span of a few frames, typically a fifth of the proposal length. The impact of spatial refinement after temporal refinement is shown in the last row of Figure 6.
4 Datasets and Evaluation Criteria
This dataset consists of 150 videos of actions extracted from sports broadcasts with realistic actions captured in dynamic and cluttered environments (Rodriguez et al., 2008). This dataset is challenging due to many actions with large displacement and intra-class variability. Ten action categories are represented, for instance ‘diving’, ‘swinging bench’, ‘horse riding’, etc. We use the disjoint train-test split of videos (103 for training and 47 for testing) suggested by (Lan et al., 2011). The ground truth is provided as sequences of bounding boxes enclosing the actors. The area under the ROC curve (AUC) is the standard evaluation measure used, and we follow this convention.
MSR-II and KTH.
This dataset consists of 54 videos recorded in a crowded environment with many people moving in the background. Each video contains multiple actions of three types: ‘boxing’, ‘hand clapping’ and ‘hand waving’. An actor appears, performs one of these actions, and walks away. A single video has multiple actions (5-10) of different types, making the temporal localization challenging. Bounding subvolumes or cuboids are provided as the ground truth. Since the actors do not change their location, it is equivalent to a sequence of bounding boxes. The localization criterion is subvolume-based, so we follow (Cao et al., 2010) and use the tight subvolume or cuboid enveloping Tubelet. Precision-recall curves and average precision (AP) are used for evaluation (Cao et al., 2010). As standard practice, this dataset is used for cross-dataset experiments with KTH (Schüldt et al., 2004) as training set.
The UCF101 dataset by (Soomro et al., 2012) is a large action recognition dataset containing 101 action categories of which 24 are provided with localization annotations, corresponding to 3,204 videos. Each video contains one or more instances of same action class. It has large variations (camera motion, appearance, scale, etc.) and exhibits much diversity in terms of actions. Three train/test splits are provided with the dataset, we perform all evaluations on the first split with 2,290 videos for training and 914 videos for testing. Mean average precision is used for evaluation.
Example frames of some of the action classes are shown in Figure 7 for each dataset.
4.2 Evaluation criteria for action proposals
To evaluate the quality of action proposals, we compute the upper bound on the localization accuracy, as previously done to evaluate the quality of object proposals (Uijlings et al., 2013), by the Mean Average Best Overlap (MABO) and maximum possible recall. In this subsection, we extend these measures from objects in images to actions in videos. This requires measuring the overlap between two sequences of boxes instead of two boxes.
Overlap or localization score.
In a given video of frames comprising instances of different actions, the ground truth sequence of bounding boxes is given by . If there is no action of instance in frame , then . From the action proposals, the proposal formed by a sequence of bounding boxes is denoted as, . Let be the overlap between the two bounding boxes in frame, , which is computed as intersection-over-union. The localization score between ground truth Tubelet and a Tubelet is given by:
where is the set of frames where at least one of , is not empty. This criterion generalizes the one proposed by (Lan et al., 2011) by taking into account the temporal axis.
Mean Average Best Overlap (MABO).
The Average Best Overlap (ABO) for a given class is obtained by computing for each ground-truth annotation , the best localization from the set of action proposals :
The mean ABO (MABO) summarizes the performance over all the classes.
Maximum possible recall (Recall).
Another measure for quality of proposals is maximum possible recall. It is computed as the fraction of ground-truth actions with best overlap of greater than the overlap threshold () averaged over action classes. We compute it with a very stringent localization threshold .
Note that adding more proposals can only increase the MABO and Recall (scores are maintained if added proposals are not better). So, both MABO and Recall must be considered jointly with the number of proposals.
An instance of action, , is considered to be correctly localized by an action proposal, , if the action is correctly predicted by the classifier and also the overlap/localization score is greater than the overlap threshold, i.e., .
|Segmenting||MABO||Recall||# Super-voxels||Time (secs)|
5 Experiments: Quality of Tubelets
In this section, we first analyze and evaluate the three stages of Tubelet extraction on the training set of the UCF Sports dataset. The initial step, super-voxel segmentation, is discussed in Subsection 5.1. Then, we evaluate different grouping functions over the initial set of super-voxels in Subsection 5.2 and also show that segmenting iMotion maps is complementary to segmenting input video frames. In Subsection 5.3, we evaluate the impact of spatiotemporal refinement and pruning on all three datasets. Finally, in Subsection 5.4 we compare Tubelets with the state-of-the-art. We evaluate Tubelets with modern representations for action localization in Section 6.
5.1 Super-voxel segmentation
Here, we evaluate the graph-based segmentation of video and the graph-based segmentation of maps. We set parameters as follows: = 0.5, merging threshold of two nodes, , minimum segment size , bigger and would mean larger (and hence fewer) segments. In Table 2, we compare the segmentation methods based on MABO, Recall, number of super-voxels and computation time. Segmentation of maps leads to better results on all respects with higher MABO and Recall, fewer initial super-voxels and lower computation time. However, initial super-voxels from video segmentation are also important, as we will see in the next experiment.
|Single grouping function|
|Multiple grouping functions|
|Single grouping function|
|Multiple grouping functions|
5.2 Super-voxel grouping
We evaluate super-voxel groupings in Table 3 and Table 4 for video and segmentations respectively. Nine grouping functions are considered that use one or more of the five similarity measures defined in Section 3.3: Motion, Color, Texture, Size and Fill. Five of these use only one similarity measure, while the other four use multiple similarities. Here, All-but-motion is Color+Texture+Size+Fill and All is Motion+Color+Texture+Size+Fill, the rest are self-explanatory. We first evaluate these 9 grouping functions in both the tables. In Table 3, the best performing groupings are the ones that involve the similarity measure: Motion, Motion+Size+Fill and All. Motion needs only 299 proposals per video to achieve a MABO of 56.2% and Recall of 64.3%. Note that it is much lower than the number of initial super-voxels (862) by the graph-based video segmentation. This is because brings most of the motion content in fewer super-voxels and the majority of the resulting super-voxels are too small or have zero-motion, and hence are discarded.
After trying several combinations on the training set of UCF Sports, we select 5 best grouping functions: Motion, Fill, Motion+Size+Fill, All-but-motion and All. Grouping the super-voxels from the five selected functions into a Union set, significantly increases the MABO and Recall to 62.0% and 74.7% respectively. Considering that a common localization score threshold () used in the literature is 0.2 (Lan et al., 2011; Tian et al., 2013), these MABO values and Recall at are very promising. Thus obtained set of Tubelets with input video segmentation and Union set, , is from now on referred to as .
Super-voxel groupings with segmentation of maps are evaluated in Table 4. Here, the grouping functions containing the similarity measure again prove to be the most successful, though not as much as in the case of video segmentation. It is because by segmenting maps motion information is already utilized to some extent. Fill also leads to good MABO and Recall with just 155 proposals. The union set, , achieves a good MABO of 56.8% and Recall of 77.0%, which even outperforms the Recall obtained with video segmentation by 2.3%. Although the best MABO with segmentation of maps is lower than that for video segmentation, the number of proposals required is only 624 on average, which is lower than the 3,254 proposals from video segmentation. This is a considerable reduction, which is in particular useful for long videos where the number of proposals can be high. Moreover segmenting maps is faster, which is again of interest when operating on longer videos. This set of Tubelets obtained by segmenting iMotion maps and Union set, , is from here on referred to as .
After analyzing segmentations from input video and maps separately, we now combine the Tubelets from both, resulting proposal set denoted by . As reported in Table 5, the MABO increases up to 69.5% and Recall reaches 93.6%. This is an improvement of 7% in MABO and 16% in Recall over the individual best of video and segmentations. The experiments till this point are conducted on training set of UCF Sports. This validates the set of grouping functions, , and that the two Tubelet sets and complement each other for localizing actions. We fix this setting for the experiments to follow.
5.3 Pruning and spatiotemporal refinement
In this section, we evaluate the impact of pruning and spatiotemporal refinement on the quality of action proposals of UCF Sports, MSR-II and UCF101. The validation for grouping functions and segmentation is already done on the training set of UCF Sports. Now, we report results when considering all the videos of these three datasets, to be comparable with the numbers reported by other methods. Before moving to results, we provide the implementation details of pruning and spatiotemporal refinement.
For motion pruning we set , so that at least fifty proposals are retained from each video. Also, motion pruning is only applied to , since proposals from are expected to have enough motion content. Overlap pruning is similar to non-maximum suppression, but applied without classification scores and therefore can affect the recall. To minimize its impact on Recall, we set a high overlap threshold of for overlap based pruning. For spatial refinement, we set equal to of the frame width.
|+ Motion pruning||69.3||93.5||884|
|+ Overlap pruning||67.5||90.5||289|
|+ Spatial refinement||67.5||91.9||289|
In Table 6, we evaluate the impact of pruning and spatial refinement on MABO, Recall and the average number of proposals per video for UCF Sports dataset. The results for for all 150 videos of UCF Sports is similar to that on its train set. Now, with motion pruning there is no loss of MABO and Recall while only % of original proposals are used. Further, with overlap pruning number of proposals further goes down to % of original number with a small loss in MABO and Recall. Finally, with spatial refinement of Tubelets there is small improvement of Recall. Altogether, with pruning and spatial refinement we are able to decrease the number action proposals by a factor 12 with only a modest loss in MABO and Recall.
|+ Motion pruning||36.7||5.1||6,560|
|+ Temporal refinement||46.0||35.2||7,287|
|+ Spatial refinement||48.9||47.4||7,287|
The MSR-II dataset has untrimmed videos with multiple instances of different types of actions in the same video. This poses additional challenges for temporal localization, which is experimentally illustrated in Table 7. The table reports MABO and Recall for Tubelet set after motion pruning for spatiotemporal localization and also spatial-only localization. Overlap score for spatiotemporal case is computed according to Equation 6 as done in all other results. For spatial localization, we compute only for the frames where ground-truth proposal is present, i.e., we do not penalize overlap score for temporal misalignment. MABO doubles and the Recall shoots from 2.2% to 81.3% for spatial-only localization, which means that our Tubelets very well locate the actions spatially but extends to the frames where there in no action of interest. This is due the tendency of super-voxels to continue to cover the actor even when the action is completed. We overcome this limitation by temporal refinement.
In Table 8, in addition to pruning and spatial refinement, we also report for temporal refinement to improve temporal localization. First, motion pruning maintains the MABO and Recall while reducing the number of proposals to only a quarter of initial number. This pruning needs to precede temporal refinement to limit the number of proposals. Second, temporal refinement leads to a massive improvement of 30.1% in Recall and 9.3% in MABO. Note that temporal refinement also includes overlap pruning to filter-out newly added very similar proposals. Also, to limit the number of proposals temporal refinement is exclusively applied to ‘ + Motion pruning’, which means only overlap pruning is applied to ‘ + Motion pruning’. Finally, with spatial refinement another huge improvement of % is achieved in Recall along with % improvement in MABO.
Overall, we achieve an improvement of 12% of MABO and 42.3% of Recall while decreasing the number of proposals by about 72% compared to the initial set, . The gain due to temporal refinement is easy to understand for this dataset of untrimmed videos. However, we also get impressive boost by spatial refinement that is much more than we get for the other two datasets. We attribute this to the exploitation of information from motion trajectories, which is paramount for MSR-II as noted before in van Gemert et al. (2015); Chen and Corso (2015).
In Table 9, we report the impact of pruning and spatial refinement on MABO, Recall and the average number of proposals per video for UCF101 dataset. Motion pruning also works well on the 3,204 videos of UCF101, compressing the number of proposals by a factor of four, while maintaining MABO and Recall. Further, with overlap pruning number of proposals further goes down to % of original number with a small loss in MABO and Recall. With favourable spatial refinement, eventually, final set of Tubelets achieve same performance as by , but with about 10 times fewer proposals.
|+ Motion pruning||41.7||32.5||1,298|
|+ Overlap pruning||40.9||30.6||472|
|+ Spatial refinement||42.3||32.8||472|
5.4 Comparison with state-of-the-art methods
In Table 10, we compare our Tubelets with alternative unsupervised action proposals from the literature. With a relatively small set 289 proposals we outperform all the other approaches on UCF Sports. On MSR-II, we outperform the previous best approach of van Gemert et al. (2015). It is interesting to note the improvement in MABO and Recall over the initial version of our approach (Jain et al., 2014), indicating the value of spatiotemporal refinement and pruning. On UCF101, we achieve MABO and Recall comparable to the method of van Gemert et al. (2015), be it that we need five times less proposals. Overall, Tubelets provides state-of-the-art quality while balancing the number of proposals. Next we evaluate the action localization abilities of Tubelets when combined with modern representations.
|Jain et al. (2014)||62.7||78.7||1,642|
|Oneata et al. (2014a)||55.6||68.1||3,000|
|van Gemert et al. (2015)||64.2||89.4||1,449|
|Jain et al. (2014)||34.8||3.0||4,218|
|van Gemert et al. (2015)||47.9||44.3||6,706|
|van Gemert et al. (2015)||40.0||35.5||2,299|
6 Experiments: Action localization
In this section we evaluate our approach for action localization UCF Sports, MSR-II and UCF101. For positive training examples, we use the ground-truth and our Tubelets that have localization score greater than with the ground-truth. Negative samples are randomly selected by considering Tubelets whose overlap with ground-truth is less than . This scheme is followed for UCF Sports and UCF101. In case of MSR-II cross-dataset evaluation is employed, the training samples consist of the clips from KTH dataset while testing is performed on the Tubelets from the videos of MSR-II. We apply power normalization followed by normalization before training with a linear SVM. One round of retraining on “hard-negatives” was enough as additional rounds did not improve performance further. Again there is no retraining in case of MSR-II, only initial classifier trained on videos from KTH dataset are used.
We first give details of the representations used to encode each Tubelet and show their impact on the UCF Sports dataset. Then, we compare our action localization results with the state-of-the-art methods on each of the three datasets.
6.1 Tubelet representations
We capture motion information by the four local descriptors computed along the improved trajectories (Wang and Schmid, 2013). To represent the local descriptors, we use bag-of-words or Fisher vectors. A Tubelet is assigned the trajectories that have more than half of there points inside the Tubelet. For the third representation we use features from a Convolutional Neural Network layer and average pool them over the frames. Below we explain these three representations.
Bag of words (BoW)
. The local descriptors are vector quantized and pooled into a bag-of-words histogram. We set the vocabulary size to . This is the least expensive (and expressive) of the three representations.
Fisher vectors (FV)
. We first apply PCA on the local descriptors and reduce the dimensionality by a factor of two. Then 256,000 descriptors are selected at random from a training set to estimate a Gaussian Mixture Model with (= 128) Gaussians. Each video is then represented by dimensional Fisher vector, where is the dimension of the descriptor after PCA. Finally, we apply power and normalization to the Fisher vector as suggested in (Perronnin et al., 2010). The feature computation is reasonably efficient but the memory requirement would be a bottleneck if the number of proposals are high (e.g.). Fisher vectors have been used for temporal action localization by (Oneata et al., 2014b) and for spatiotemporal action localization by van Gemert et al. (2015).
Convolutional neural network (CNN)
. We use an in-house implementation of GoogLeNet (Szegedy et al., 2015)
, trained on ImageNet over 15k object categories(Jain et al., 2015b) without fine-tuning. The features are extracted from the fully-connected layer (before softmax2) of the network, which is a dimensional vector to represent a bounding box in a frame. Since a Tubelet is a sequence of bounding boxes, the final representation for it is obtained by averaging the feature vectors for the sampled frames (2 frames per second). Here, the memory requirement is limited, and feature computation is the costly operation, motivating the need for a compact set of action proposals.
. We now analyze the impact of the above three Tubelet representations on the UCF Sports dataset, following the process described in Section 4.2. Following popular practice, we use area under ROC curve (AUC) as the evaluation measure, as common for this dataset. Figure 8 compares the performance of the various Tubelet representations for a varying overlap threshold. We observe a clear improvement when moving from BoW to FV, to CNN and eventually the combination of FV and CNN, especially for higher thresholds ().
6.2 Comparison with state-of-the-art methods
We now compare our approach with state-of-the-art methods on the three datasets.
. In Figure 9, we compare the performance of our method with the best reported results in the literature. In (Jain et al., 2015b), the previous version of Tubelets were represented with FV and CNN features, hence for comparison we use Tubelets represented with FV+CNN. The boost over Jain et al. (2015b), relying on segmentation of video frames only, shows the importance of segmenting maps as well. Tubelets represented with FV+CNN is competitive to the methods of Gkioxari and Malik (2015) and Weinzaepfel et al. (2015) and outperforms all other approaches. Since van Gemert et al. (2015) uses only the FV representation, for fair comparison we also include Tubelets with a FV representation, which does better for most of the thresholds. Figure 11 shows some examples of action localizations from UCF Sports.
. This dataset is designed for cross-dataset evaluation. Following standard practice, we train on KTH dataset and test on MSR-II. While training for one class, the videos from other classes are used as the negative set. We use the FV representation to be more comparable with the competitive work of (van Gemert et al., 2015), which also generates action proposals in an unsupervised manner like Tubelets. In Table 11, we compare with several state-of-the-art methods; mean average precision (mAP) along with the APs for the three classes are reported. Following the usual practice on this dataset we report results for an overlap threshold of . Apart from Chen and Corso (2015), our approach outperforms all other methods by % of mAP or more. Chen and Corso (2015) very well utilizes information from motion trajectories and samples action proposals by clustering over a space-time trajectory graph. Motion trajectory based approaches are particularly well-suited for MSR-II dataset, as observed with our spatiotemporal refinement of Tubelets and also in (van Gemert et al., 2015). Similarly, the approach of Chen and Corso (2015) that is mainly focused on trajectories lead to excellent performance on MSR-II but its performance on UCF Sports is modest (Figure 9). Finally, compared to the Tubelets in Jain et al. (2014), we improve mAP by 24.5%. Again, we claim the importance of using both input video frames and maps for segmentation and spatiotemporal refinement of Tubelets. Figure 12 shows some examples of localizations for MSR-II.
|Cao et al. (2010)||17.5||13.2||26.7||19.1|
|Tian et al. (2013)||38.9||23.9||44.7||35.8|
|Jain et al. (2014)||46.0||31.4||85.8||54.4|
|Yuan et al. (2011)||64.9||43.1||64.9||55.3|
|Wang et al. (2014)||41.7||50.2||80.9||57.6|
|Yu and Yuan (2015)||67.4||46.6||69.9||61.3|
|Mosabbeb et al. (2014)||72.4||56.9||81.1||70.1|
|van Gemert et al. (2015)||67.0||78.4||74.1||73.2|
|Chen and Corso (2015)||94.4||73.0||87.7||85.0|
. UCF101 is much larger than the other two datasets, with 24 action classes, and is currently the most challenging dataset for classification of proposals. Again, we represent Tubelets with FV following (van Gemert et al., 2015). In Figure 10, we report mAPs for different overlap thresholds and compare Tubelets with three other approaches that report results on UCF101 dataset. Despite the use of human detection, the approach by Yu and Yuan (2015) is about % behind our method for an overlap threshold of . Weinzaepfel et al. (2015) uses bounding-box level action class supervision while generating proposals. Despite their additional supervision and use of two-stream CNN features, we achieve better mAP for 3 out of 4 overlap thresholds. The only other approach that uses proposals generated in an unsupervised manner, as we do, is APT by (van Gemert et al., 2015). Tubelets outperform their approach while requiring only about a fifth of proposals (see Table 10).
Figure 13 displays some examples of action localizations from UCF101. With 24 classes this dataset offers larger variety in types of actions. Poor localization (shown in red) mainly happens in case of multiple actors, when during the action one of the actors gets occluded (see ‘Salsa Spin’). Typically, in that case, Tubelets often encapsulates both actors together. However, the varying aspect ratios, diverse locations in the video frames, speed of action and multiple actors are well captured by our action proposal method.
We presented an unsupervised approach to generate proposals from super-voxels for action localization in videos. This is done by iterative grouping of super-voxels driven by both static features and motion features, motion being the key ingredient. We introduced independent motion evidence to characterize how the action related motion deviates from the background. The generated maps provide a more efficient alternative for segmentation. Moreover, -based features allow for effective and efficient grouping of super-voxels. Our action proposals, Tubelets, are action class independent and implicitly cover variable aspect ratios and temporal lengths. We showed, for the first time, the effectiveness of Tubelets for action localization in Jain et al. (2014). In this paper, maps are presented with further insights and the segmenting maps is shown complementary to segmenting input video frames. Additionally, we introduced spatiotemporal refinement and pruning of Tubelets. Spatiotemporal refinement overcomes the tendency of super-voxels to sometimes follow the actor even after the action is completed. This led to improved MABO and Recall scores, especially on the untrimmed videos of MSR-II (Table 8), while pruning kept the number of Tubelets limited. The impact of these and the other components of Tubelet generation are extensively evaluated in our experiments.
We evaluate our method for both action proposal quality and action localization. For action proposal quality, Tubelets beat all other existing approaches on the three datasets with much fewer number of proposals (Table 10). For action localization, our method leads to the best performance on UCF101 and second best on UCF Sports and MSR-II. The method of Chen and Corso (2015) gets best mAP for MSR-II but its performance on UCF Sports is rather modest. Similarly Weinzaepfel et al. (2015) does well on UCF Sports and UCF101 but being supervised in generating proposals is not easy to apply on MSR-II. Ours is the only method that delivers excellent performance on both the trimmed videos of UCF Sports and UCF101 as well as the untrimmed videos of MSR-II.
- Brox and Malik  Thomas Brox and Jitendra Malik. Object segmentation by long term analysis of point trajectories. In Proceedings of the European Conference on Computer Vision, Sep. 2010.
Cao et al. 
Liangliang Cao, Zicheng Liu, and Thomas S. Huang.
Cross-dataset action detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2010.
- Chen and Corso  Wei Chen and Jason J. Corso. Action detection by implicit intentional motion clustering. In Proceedings of the IEEE International Conference on Computer Vision, 2015.
- Chen et al.  Wei Chen, Caiming Xiong, Ran Xu, and Jason Corso. Actionness ranking with lattice conditional ordinal random fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 748–755, 2014.
- Cheron et al.  Guilhem Cheron, Ivan Laptev, and Cordelia Schmid. P-cnn: Pose-based cnn features for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, December 2015.
- Dalal and Triggs  Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2005.
- Delaitre et al.  Vincent Delaitre, Ivan Laptev, and Josef Sivic. Recognizing human actions in still images: a study of bag-of-features and part-based representations. In BMVC, 2010.
- Derpanis et al.  Konstantinos G Derpanis, Mikhail Sizintsev, Kevin J Cannons, and Richard P Wildes. Action spotting and recognition based on a spatiotemporal orientation analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(3):527–540, 2013.
- Dollar et al.  Piotr Dollar, Vincent Rabaud, Garrison Cottrell, and Serge Belongie. Behavior recognition via sparse spatio-temporal features. In Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Oct. 2005.
- Everts et al.  Ivo Everts, Jan C. van Gemert, and Theo Gevers. Evaluation of color spatio-temporal interest points for human action recognition. IEEE Transactions on Image Processing, 23(4):1569–1580, 2014.
- Felzenszwalb and Huttenlocher  Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2):167–181, 2004.
- Felzenszwalb et al.  Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
- Girshick et al.  Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Region-based convolutional networks for accurate object detection and semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1):142–158, 2016.
- Gkioxari and Malik  Georgi Gkioxari and Jitendra Malik. Finding action tubes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
- Hosang et al.  Jan Hosang, Rodrigo Benenson, Piotr Dollár, and Bernt Schiele. What makes for effective detection proposals? IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(4):814–830, 2016.
- Huber  Peter J. Huber. Robust statistics. Wiley, New York, 1981.
- Jain et al.  Mihir Jain, Hervé Jégou, and Patrick Bouthemy. Better exploiting motion for better action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2013.
- Jain et al.  Mihir Jain, Jan C. van Gemert, Hervé Jégou, Patrick Bouthemy, and Cees G. M. Snoek. Action localization by tubelets from motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2014.
- Jain et al. [2015a] Mihir Jain, Jan C van Gemert, Thomas Mensink, and Cees GM Snoek. Objects2action: Classifying and localizing actions without any video example. In Proceedings of the IEEE International Conference on Computer Vision, pages 4588–4596, 2015a.
- Jain et al. [2015b] Mihir Jain, Jan C. van Gemert, and Cees G. M. Snoek. What do 15,000 object categories tell us about classifying and localizing actions? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2015b.
- Jain et al.  Mihir Jain, Hervé Jégou, and Patrick Bouthemy. Improved motion description for action classification. Frontiers in ICT, 2:28, 2016.
- Jégou et al.  Hervé Jégou, Florent Perronnin, Matthijs Douze, Jorge Sánchez, Patrick Pérez, and Cordelia Schmid. Aggregating local descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9):1704–1716, 2012.
- Jhuang et al.  Hueihan Jhuang, Juergen Gall, Silvia Zuffi, Cordelia Schmid, and Michael J. Black. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision, December 2013.
- Ke et al.  Yan Ke, Rahul Sukthankar, and Martial Hebert. Efficient visual event detection using volumetric features. In Proceedings of the IEEE International Conference on Computer Vision, Oct. 2005.
- Kim and Pavlovic  Minyoung Kim and Vladimir Pavlovic. Structured output ordinal regression for dynamic facial emotion intensity prediction. In European Conference on Computer Vision, pages 649–662. Springer, 2010.
- Kläser et al.  Alexander Kläser, Marcin Marszalek, and Cordelia Schmid. A spatio-temporal descriptor based on 3d-gradients. In Proceedings of the British Machine Vision Conference, Sep. 2008.
- Kläser et al.  Alexander Kläser, Marcin Marszałek, Cordelia Schmid, and Andrew Zisserman. Human focused action localization in video. In Trends and Topics in Computer Vision, pages 219–233, 2012.
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
- Lampert et al.  Christoph H. Lampert, Matthew B. Blaschko, and Thomas Hofmann. Beyond sliding windows: Object localization by efficient subwindow search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2008.
- Lan et al.  Tian Lan, Yang Wang, and Greg Mori. Discriminative figure-centric models for joint action localization and recognition. In Proceedings of the IEEE International Conference on Computer Vision, Nov. 2011.
- Laptev  Ivan Laptev. On space-time interest points. International Journal of Computer Vision, 64(2):107–123, 2005.
- Ma et al.  Shugao Ma, Jianming Zhang, Nazli Ikizler-Cinbis, and Stan Sclaroff. Action recognition and localization by hierarchical space-time segments. In Proceedings of the IEEE International Conference on Computer Vision, pages 2744–2751, 2013.
Maji et al. 
Subhransu Maji, Lubomir Bourdev, and Jitendra Malik.
Action recognition from a distributed representation of pose and appearance.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3177–3184, 2011.
- Manen et al.  S. Manen, M. Guillaumin, and L. Van Gool. Prime Object Proposals with Randomized Prim’s Algorithm. In Proceedings of the IEEE International Conference on Computer Vision, 2013.
- Mosabbeb et al.  Ehsan Adeli Mosabbeb, Ricardo Cabral, Fernando De la Torre, and Mahmood Fathy. Multi-label discriminative weakly-supervised human activity recognition and localization. In ACCV, 2014.
- Ng et al.  Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4694–4702, 2015.
- Odobez and Bouthemy  Jean-Marc Odobez and Patrick Bouthemy. Robust multiresolution estimation of parametric motion models. Journal of Visual Communication and Image Representation, 6(4):348–365, Dec. 1995.
- Oneata et al.  Dan Oneata, Jakob Verbeek, and Cordelia Schmid. Action and Event Recognition with Fisher Vectors on a Compact Feature Set. In Proceedings of the IEEE International Conference on Computer Vision, Dec. 2013.
- Oneata et al. [2014a] Dan Oneata, Jerome Revaud, Jakob Verbeek, and Cordelia Schmid. Spatio-temporal object detection proposals. In Proceedings of the European Conference on Computer Vision, 2014a.
- Oneata et al. [2014b] Dan Oneata, Jakob Verbeek, and Cordelia Schmid. Efficient Action Localization with Approximately Normalized Fisher Vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014b.
- Perronnin and Dance  Florent Perronnin and Christopher R. Dance. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007.
- Perronnin et al.  Florent Perronnin, Jorge Sánchez, and Thomas Mensink. Improving the fisher kernel for large-scale image classification. In Proceedings of the European Conference on Computer Vision, Sep. 2010.
- Piriou et al.  Gwenaëlle Piriou, Patrick Bouthemy, and Jian-Feng Yao. Recognition of dynamic video contents with global probabilistic models of visual motion. IEEE Transactions on Image Processing, 15(11):3417–3430, 2006.
- Puscas et al.  Mihai Puscas, Enver Sangineto, Dubravko Culibrk, and Nicu Sebe. Unsupervised tube extraction using transductive learning and dense trajectories. In Proceedings of the IEEE International Conference on Computer Vision, 2015.
- Raptis et al.  Michalis Raptis, Iasonas Kokkinos, and Stefano Soatto. Discovering discriminative action parts from mid-level video representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2012.
- Rodriguez et al.  Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah. Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2008.
- Sánchez et al.  Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3):222–245, 2013.
- Schüldt et al.  Christian Schüldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: A local svm approach. In Proceedings of International Conference of Pattern Recognition, 2004.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
- Soomro et al.  Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, 2012. URL http://arxiv.org/abs/1212.0402.
- Soomro et al.  Khurram Soomro, Haroon Idrees, and Mubarak Shah. Action localization in videos through context walk. In Proceedings of the IEEE International Conference on Computer Vision, 2015.
- Szegedy et al.  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
- Tian et al.  Yicong Tian, Rahul Sukthankar, and Mubarak Shah. Spatiotemporal deformable part models for action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2013.
- Tran and Yuan  Du Tran and Junsong Yuan. Optimal spatio-temporal path discovery for video event detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2011.
- Tran and Yuan  Du Tran and Junsong Yuan. Max-margin structured output regression for spatio-temporal action localization. In Advances in Neural Information Processing Systems, Dec. 2012.
- Tran et al.  Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 4489–4497, 2015.
- Uijlings et al.  Jasper R. R. Uijlings, Koen. E. A. van de Sande, Theo Gevers, and Arnold W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
- van de Sande et al.  Koen E. A. van de Sande, Cees G. M. Snoek, and Arnold W. M. Smeulders. Fisher and vlad with flair. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
- van Gemert et al.  Jan C. van Gemert, Mihir Jain, Ella Gati, and Cees G. M. Snoek. APT: Action localization proposals from dense trajectories. In Proceedings of the British Machine Vision Conference, 2015.
Viola and Jones 
Paul A. Viola and Michael J. Jones.
Robust real-time face detection.International Journal of Computer Vision, 57(2):137–154, 2004.
- Wang and Schmid  Heng Wang and Cordelia Schmid. Action Recognition with Improved Trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Dec. 2013.
- Wang et al.  Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Action recognition by dense trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2011.
- Wang et al.  Heng Wang, Alexander Kläser, Cordelia Schmid, and Cheng-Lin Liu. Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1):60–79, 2013.
- Wang et al. [2015a] Heng Wang, Dan Oneata, Jakob Verbeek, and Cordelia Schmid. A robust and efficient video representation for action recognition. International Journal of Computer Vision, pages 1–20, 2015a.
- Wang et al.  Limin Wang, Yu Qiao, and Xiaoou Tang. Video action detection with relational dynamic-poselets. In Proceedings of the European Conference on Computer Vision, 2014.
- Wang et al. [2015b] Limin Wang, Yu Qiao, and Xiaoou Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4305–4314, 2015b.
- Wang and Mori  Yang Wang and Greg Mori. Hidden part models for human action recognition: Probabilistic versus max margin. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(7):1310–1323, 2011.
- Weinzaepfel et al.  Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Learning to track for spatio-temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, 2015.
- Xu and Corso  Chenliang Xu and Jason J. Corso. Evaluation of super-voxel methods for early video processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.
- Yu and Yuan  Gang Yu and Junsong Yuan. Fast action proposals for human action detection and search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
- Yuan et al.  Junsong Yuan, Zicheng Liu, and Ying Wu. Discriminative subvolume search for efficient action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2009.
- Yuan et al.  Junsong Yuan, Zicheng Liu, and Ying Wu. Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(9):1728–1743, 2011.