Spatio-temporal Action Recognition: A Survey

by   Amlaan Bhoi, et al.
University of Illinois at Chicago

The task of action recognition or action detection involves analyzing videos and determining what action or motion is being performed. The primary subject of these videos are predominantly humans performing some action. However, this requirement can be relaxed to generalize over other subjects such as animals or robots. The applications can range from anywhere between human-computer inter-action to automated video editing proposals. When we consider spatiotemporal action recognition, we deal with action localization. This task not only involves determining what action is being performed but also when and where itis being performed in said video. This paper aims to survey the plethora of approaches and algorithms attempted to solve this task, give a comprehensive comparison between them, explore various datasets available for the problem, and determine the most promising approaches.



page 3

page 5

page 6

page 8

page 12


Spatiotemporal Action Recognition in Restaurant Videos

Spatiotemporal action recognition is the task of locating and classifyin...

Action Recognition in Videos: from Motion Capture Labs to the Web

This paper presents a survey of human action recognition approaches base...

Unsupervised Action Localization Crop in Video Retargeting for 3D ConvNets

Untrimmed videos on social media or those captured by robots and surveil...

Towards Improving Spatiotemporal Action Recognition in Videos

Spatiotemporal action recognition deals with locating and classifying ac...

A Survey on Video Action Recognition in Sports: Datasets, Methods and Applications

To understand human behaviors, action recognition based on videos is a c...

ActAR: Actor-Driven Pose Embeddings for Video Action Recognition

Human action recognition (HAR) in videos is one of the core tasks of vid...

One-shot action recognition towards novel assistive therapies

One-shot action recognition is a challenging problem, especially when th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spatio-temporal action recognition, or action localization [1]

, is the task of classifying what action is being performed in a sequence of frames (or video) as well as localizing each detection both in space and time. The localization can be visualized using bounding boxes or masks. There has been an increased interest in this task in recent years due to the increased availability of computing resources as well as new advances in convolutional neural network architectures.

There are several approaches to tackle this task. Most of the approaches revolve around the following approaches: discriminative parts [2, 3], figure-centric models [4, 5, 6], deformable parts [1], action proposals [7, 8, 9, 10], graph-based [11, 12], 3D convolutional neural networks [13], and more. We examine each approach, list out advantages and disadvantages, and explore common subset of techniques used between many approaches. We then explore the various datasets available for this task and how they are a sufficient metric to evaluate this problem. Finally, we comment on which methods are promising going forward. The term spatio-temporal action recognition and action localization will be used interchangeably in this text as they refer to the same task. We should not confuse action localization with the similarly framed problem of temporal action detection which deals with determining only when an action occurs in a large video.

2 Problem Definition

We can broadly define the problem as: given a video where is the frame in the video, determine the action label , where is the set of action labels in the dataset, for that frame as well as a set of coordinates of the bounding box of the classified action .

An alternative formulation given by Weinzaepfel et al. [10] defines action localization as: given a video of T frames and a class where C is the set of classes, the task involves detecting if action occurs in the video and if yes, when and where. The output of a successful algorithm should output with the beginning and the end of the predicted temporal extent of action and the detected region in frame .

Every paper contains a slightly different formulation of the problem depending on the approach they have taken. We shall swiftly explore those definitions to see how this one task can be approached in different viewpoints (graphs, optical flow, etc).

3 Challenges

Spatio-temporal action recognition faces the usual challenges in faced in action recognition such as tracking the action throughout the video, localizing the time frame when the action occurs, and more. However, there are an additional set of challenges such as but not limited to:

  • Background clutter or object occlusion in video

  • Spatial complexity in scene with respect to number of candidate objects

  • Linking actions between frames in presence of irregular camera motion

  • Predicting optical flow of action

However, there is a more fundamental problem to consider with the traditional approach to action localization. We cannot treat the problem in a linear way of just classifying an action. Even object detection algorithms require region proposals to classify [14]. This can be made worse by the introduction of the temporal dimension. This would cause an exponential increase in the number of proposals which would render any such approach impractical for use.

4 Action Proposal Models

4.1 Action localization with tubelets from motion

Jain et al. [9] propose a method for spatio-temporal action recognition by proposing a selective search sampling strategy for videos. Their approach uses super-voxels instead of super-pixels. In this way, they directly obtain the sequences of bounding boxes which is called as tubelets. This removes the issue of linking bounding boxes between frames in a video. In addition to this, their method explicitly incorporates motion information by introducing independent motion evidence as a feature to characterize how the action’s motions deviates from background motion.

The pipeline of this method starts with super-voxel segmentation which is done through a graph-based method [15], iterative generation of additional tubelets, descriptor generation (BOW representation), and finally classification using BOW histograms of tubelets.

Super-voxel generation. Initial super-voxels are agglomeratively merged together based on similarity measures. We can imagine the merging as a tree with the individual super-voxels being the leaf of the tree and being merged all the way up to the root. This procedure produces additional super-voxels.

Tubelets Wherever super-voxels appear, they are tightly bounded by a rectangular bounding box. A sequence of these bounding boxes produces what is known as a tubelet. The algorithm, thus, produces tubelets.

4.1.1 Merging

Merging of super-voxels is based on five criterias that are sectioned into two parts: color, texture, motion:


where is the -normalized histogram for super-voxel (similarly for ) and is the number of super-voxels in . The second part is size, fill:


where is the size of the video (in pixels). The merging strategies can vary with any number of combinations of these criterias. An example of the merge operations is illustrated in Figure 1.

Figure 1: An example of Running action class where the first two images depict a video frame and initial super-voxel segmentation. The other four images represent segmentation after various merge operations.

4.1.2 Motion features

The authors defined independent motion evidence (IME) as:


where is the ratio normalized between [0, 1]. More details are available in their paper.

4.1.3 Results

The authors evaluated their model with ROC curve comparisons with other methods. The graphs can be found in their paper. They also recorded mean precision for the MSR-II dataset [16] with results shown in Table 1.

Method Boxing Handclapping Handwaving
Cao et al. 17.5 13.2 26.7
SDPM 38.9 23.9 44.7
Tubelets 46.0 31.4 85.8
Table 1: Results for Jain et al. [9]. Average precisions for MSR-II.

4.2 Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

Hou et al. [17] introduce a new architecture called tube convolutional neural networks or T-CNN which is a generalization of R-CNN [18] from 2D to 3D. Their approach first divides the videos into clips with 8 frames in each clip. This allows them to use a fixed-sized ConvNet architecture to process clips while mitigating the cost of GPU memory. As an input video is processed clip by clip, action tube proposals with various spatial and temporal sizes are generated for various clips. They need to be linked into a tube proposal sequence.

The authors introduce a new layer called Tube-of-Interest

(ToI) pooling layer. This is a 3D generalization of Region of Interest (RoI) pooling layer of R-CNN. ToI layers are used to produce fixed length feature vectors which solves the problem of variable length feature vectors. More details about ToI layers can be found in there paper.

Figure 2: Tube proposal network with all modules.

4.2.1 Tube Proposal Network

The TPN consists of 8 3DConv layers, 4 max-pooling layers, 2 ToI layers, 1 point-wise convolutional layer, and 2 fully-connected layers. The authors use a pre-trained C3D model


and fine-tune it for each dataset they used in experiments. When generalizing R-CNN, instead of using the 9 hand-picked anchor boxes, they decided to use K-Means clustering on training set to select 12 anchor boxes which is better adapted to different datasets. Each bounding box is assigned an ”actionness” probability determining if a box corresponds to an action or not (binary label). A positive bounding box proposal is determined by an Intersection-over-Union (IoU) overlap of more than 0.7.

4.2.2 Linking Tube Proposals

The primary problem of linking tube proposals is not every consecutive tube proposal may capture the entire action (think of occlusion or noise in middle clips). To solve this, the authors used two metrics when linking tube proposals: actionness and overlap scores. Each video proposal (link of tube proposals) is assigned a score defined as:


where denotes the actionness score of tube proposal from -th clip, measures the overlap between two linked proposals from clips and ( + 1), and is the total number of clips.

4.2.3 Action Detection

The input to the action detection module is a set of linked tube proposal sequences of varying lengths. This is where the ToI layer comes into action. The output of the ToI layer is atteched to two fully-connected layers and a dropout layer. The dimension of the last fully-connected layer is + 1 ( action classes and 1 background class).

4.2.4 Results

Hou et al. [17] evaluated and verified their approach on three trimmed video datasets (UCF-Sports [20], J-HMDB [21], UCF-101 [22]) and one un-trimmed video dataset – THU-MOS’14 [23]. The results for the UCF-Sports dataset is outlined in Table 2

There are many more approaches related to action proposal models that we did not get into detail of. These include Finding Action Tubes by Gkioxari and Malik [7], Fast action proposals for human action detection and search by Yu and Yuan [8], Learning to track for spatio-temporal action localization by Weinzaepfel et al. [10], and Human Action Localization with Sparse Spatial Supervision by Weinzaepfel et al. [24].

Diving Golf Kicking Lifting Riding Run SkateB. Swing SwingB. Walk mAP
Weinzaepfel et al. [10] 60.71 77.55 65.26 100.0 99.53 52.60 47.14 88.88 62.86 64.44 71.9
Peng and Schmid [25] 96.12 80.47 73.78 99.17 97.56 82.37 57.43 83.64 98.54 75.99 84.51
Hou et al. [17] 84.38 90.79 86.48 99.77 100.00 83.65 68.72 65.75 99.62 87.79 86.7
Table 2: Results for Hou et al. [17]. mAP for each class of UCF-Sports. The IoU threshold for frame m-AP is fixed to 0.5.

5 Figure-Centric Models

5.1 Discriminative figure-centric models for joint action localization and recognition

Lan et al. [5] approach spatio-temporal action recognition by combining bag-of-words style statistical representation and a figure-centric structural representation which mainly works like template matching. They treat the position of the human as a latent variable in a discriminative latent variable model and infer it while simultaneously recognizing an action. In addition, instead of simple bounding boxes, they actually learn discriminative cells within boxes for more robust detection. Due to the latent variable model, exact learning and inference is intractable. Thus, efficient approximate learning and inference algorithms are developed.

Method Accuracy
global bag-of-words 63.1
local bag-of-words 65.6
spatial bag-of-words with 63.1
spatial bag-of-words with 68.5
latent model with 63.7
Lan et al. [5] 73.1
Table 3: Results for Lan et al. [5]. Mean per-class action recognition accuracies.

5.1.1 Figure-Centric Video Sequence Model

The model jointly learns the relationship between action label and location of the person performing the action in each frame. The standard bounding box is divided into cells where each cell is either turned ”on” or ”off” depending on if it contains an action.

Each video I has an associated label . Suppose video I contains frames represented as , where denotes the -th frame of said video. Furthermore, the authors define the bounding box for each video as . The -th bounding box

is a 4-dimensional tensor representing

coordinates, height, and width of bounding box. The extracted feature vector is a concatenation of three vectors: i.e. . and represent the appearance feature which is defined as the k-means quantized HOG3D descriptors [26] and spatial locations of interest in bounding box . denotes the color histogram.

The authors use a scoring function inspired by the latent SVM model [27] which is defined as:


The definition of the unary potential function, pairwise potential function, and global action potential function can be found in the paper.

5.1.2 Learning

Given N training samples , the authors optimize over model parameter . They adopt the SVM framework for learning as:


5.1.3 Inference

The inference is simply solving the following optimization problem:


For a fixed , we can maximize and z as:


5.1.4 Results

The authors evaluate their model on the UCF-Sports dataset [20]. They achieved a 83.7% accuracy beating out previous methods. The complete results for with mean per-class action recognition accuracies can be found in Table 3

The other two closely related Figure-Centric approaches are Human focused action localization in video by Kläser et al. [4] and Explicit modeling of human-object interactions in realistic videos by Prest et al. [6].

6 Deformable Parts Models

6.1 Action recognition and localization by hierarchical space-time segments

Figure 3: Pipeline for hierarchial video frame segments extraction. [2]

Ma et al. [2] introduce a new representation called hierarchial space-time segments where the space-time segments of videos are organized into two-level hierarchy. The first level comprises the root space-time segments that may contain the whole human body. The second level comprises space-time segments that contain parts of the root. They present an unsupervised algorithm designed to extract time segments that preserve both static and non-static relevant space time segments as well as their hierarchial and temporal relationships. Their algorithm consists of three major steps:

  1. Apply hierarchical segmentation on each video frame to get a set of segment trees, each of which is considered as a candidate segment tree of the human body.

  2. Prune the candidates by exploring several cues such as shape, motion, articulated objects’ structure and global foreground color.

  3. Track each segment of the remaining segment trees in time both forward and backward.

Finally, using a simple linear SVM on the bag of hierarchial space-time segments representation, they achieved better or comporable perforamnce on previous methods.

Figure 4: Example of action localization results on UCF-Sports [20] and High-Five [28] datasets. [2]

6.1.1 Video Frame Hierarchial Segmentation

On each video frame, the authors compute the boundary map as described by [29] using three color channels and five motion channels including optical flow, unit normalized optical flow, and optical flow magnitude. The boundary map is then utilized to compute Ultrametic Contour Map (UCM) as described by [30]. By traversing the UCM, certain segments are removed to reduce redundancy. Then, the authors remove the root of the segment tree and get a set of segment trees where is the frame index. Each is considered a candidate segment tree of a human body and we denote where each is a segment and is the root segment.

6.1.2 Pruning Candidate Segment Trees

The pruning step should only prune irrelevant static segments. The decision to prune a candidate segment is made with information from all segments of the tree and not just local information. Thus, pruning is done at candidate level. The two methods used to perform tree pruning are with shape and color cues and using foreground map. Detailed representations are available in the original paper.

6.1.3 Extracting Hierarchial Space-Time Segments

After pruning, there is a set containing remaining candidate segment trees. To capture temporal information, for each , the authors track every segment to construct a space-time segment. The authors then propose a non-rigid region tracking method. The method revolves around predicting the region of next frame by flow and computing a flow prediction map as well as color prediction map . If a point (where is the bounding box of next frame inside region ) has color , then and :


The combined map is then scaled and quantized to contain integer values in range . By settings thresholds of integer values between 1 to 20, we get 20 binary maps. The size of every connected component is computed and one with most similar size to is selected as candidate. These space-time segments may contain same objects. To exploit this dense representation, we can group space-time segments together if they overlap over some threshold (authors used 0.7). Finally, for each track (groups of segments), bounding boxes are calculated on all spanned frames.

6.1.4 Action Recognition and Localization

For action recognition, a one-vs-all linear SVM is trained on all training videos’ BoW representations for multiclass classification resulting in the following rule:


where and are BoW representations of root and part space-time segments of test video respectively, and

and entries of trained separation hyperplane for roots and parts respectively,

is the bias term, and is set of action class labels.

For action localization, we find space-time segments that have positive contribution to classification of the video. Given a test video and set of root space-time segments and set of part space-time segments , denote and as set of code words corresponding to positive entries of and respectively. We compute set U as:


where function measures similarity between two space-time segments. Finally, the tracks are output which have at least one space-time segment in set as action localization results.

6.1.5 Results

Ma et al. [2] experimented on the UCF-Sports [20] and High Five [28] datasets. In action localization performance, they had an average of 10% increase in average IOU compared to previous methods.

subset of frames all frames
[31] [32] [5] Ma [31] [32] [5] Ma
dive 16.4 36.5 43.4 46.7 22.6 37.0 - 44.3
golf - - 37.1 51.3 - - - 50.5
kick - - 36.8 50.6 - - - 48.3
lift - - 68.8 55.0 - - - 51.4
ride 62.2 68.1 21.9 29.5 63.1 64.0 - 30.6
run 50.2 61.4 20.1 34.3 48.1 61.9 - 33.1
skate - - 13.0 40.0 - - - 38.5
swing-b - - 32.7 54.8 - - - 54.3
swing-s - - 16.4 19.3 - - - 20.6
walk - - 28.3 39.5 - - - 39.0
Avg. - - 31.8 42.1 - - - 41.0
Table 4: Results for Ma et al. [2]. Action localization results measured as average IOU on UCF Sports dataset.

6.2 Spatiotemporal Deformable Part Models for Action Detection

Tian et al. [1] extend the concept of deformable parts model from 2D to 3D similar to Ma et al. [2] but with some differences. The main difference is that this approach searches for a 3D subvolume considering parts both in space and time. SDPM also includes an explicit model to capture intra-class variation as a deformable configuration of parts. Finally, this approach shows effective resuls on action detection within a DPM framework without resorting to global BoW information, trajectories, or video segmentation.

The primary problem of generalizing DPM to 3D is that an action in a video may move spatially as frames progress. This is not a difficult problem in 2D as a static bounding box would cover most of the action parts. However, in videos, actions may move and a static learned bounding box will fail to cover the action across time. A naive approach would be to encapsulate the action with a large spatiotemporal box. However, that would drastically decrease the IOU of the prediction. The secondary problem is the difference between space and time. As the authors rightly point out, if an action size changes due to distance from camera, that does not mean the duration of the action changes as well. Thus, their feature pyramids employ multiple levels in space but not in time. Finally, they employ HOG3D feature d[26] for their effectiveness. The HOG3D descriptors are based on a histogram of oriented spatiotemporal gradients as a volumetric generalization of the HOG [33] descriptor.

Figure 5: SDPM for Lifting in UCF-Sports dataset with parts learned in each temporal stage. There are total of 24 parts in this SDPM.

6.2.1 Root filter

Following the DPM paradigm, the authors select a single bounding box for each video enclosing one cycle of given action. Volumes of other actions are treated as negative examples. Random volumes drawn from different scales of video are also added to negative samples to better discriminate action from background. The root filter is responsible for capturing the overall information of the action cycle by applying a SVM on the HOG3D features. An important aspect is to decide how to divide an action volume. Too few cells will decrease the overall discriminative power of the features while too many cells will prevent each cell from containing enough information to make it useful. The size of the spatial dimension in the root filter can be determined empirically. The authors used a 3x3xT size. This cannot be done for temporal dimension as an action may vary from a 5-30 seconds. Thus, the size of this filter must be determined automatically depending on the distribution of the action in the training set.

6.2.2 Deformable parts

The authors observed that extracting HOG3D features from part models at twice the resolution and with more cells in space enabled the learned parts to capture important details. A point to note is that parts selected by this model are allowed to overlap in space. After SVM, subvolumes with higher weights (more discriminative power for a given action type) are selected as parts. The authors divided action cells into 12x12xT cells to extract HOG3D features and each part occupies 3x3x1 cell. Then, they greedily selected N parts with highest weights that fill 50% of action cycle volume. Parts weights initialization is corresponding to the cell weights containted inside them. An anchor position for -th part is also determined. To address intra-class variation, the authors use a quadratic function to allow parts to shift within a certain spatiotemporal region.

6.2.3 Action detection with SDPM

Given a test video, SDPM builds a spatiotemporal feature pyramid by computing HOG3D features at different scales. Template matching during detection is done using a sliding window approach. Score maps for root and part filters are computed at every level of the pyramid using template matching. For level , the score map of each filter can be obtained by correlation of filter with features of test video volume ,


At level in feature pyramid, score of detection volume centered at is sum of the score of root filter on this volume and scores from each part filter on best possible subvolume:


where is the root filter and are part filters. and are features of a 3x3xT volume centered at and 3x3x1 volume centered at part location respectively, at level of feature pyramid. is the set of all different possible part locations and is corresponding deformation cost. Highest score is chosen at the end based on a threshold. A scanning search algorithm is employed instead of exhaustive search.

6.2.4 Results

The authors present their results on Weizmann [34], UCF Sports [20], and MSR-II [16] datasets. Without much surprise, SDPM achieves 100% accuracy on the Weizmann dataset as the challenge is easy (9 actions on static background). On the UCF-Sports dataset, the authors achieved an average classification accuracy of 75.2% which is higher than Ma et al. [2] (73.1%) but lower than Raptis et al. [35] (79.4%). On the MSR-II dataset, they outperformed model without parts as well as baselines.

7 Graph-Based Models

7.1 Action localization in videos through context walk

Soomro et al. [12]

take a different approach to action localization. As a brief summary, they over-segment videos into supervoxels, learn context relationships (background-background and background-foreground), estimate probability of supervoxel belonging to an action for each supervoxel to create a conditional distribution of an action over all supervoxels, use a

Conditional Random Field (CRF) to find action proposals in video, and use a SVM to obtain confidence scores. This context walk eliminates the need to use a sliding window approach and do an exhaustive search over an entire video. This is useful because most videos have under 20% of frames with actions in them.

7.1.1 Context Graphs for Training Videos

Assuming index of training videos for action is between range , where is number of training videos for action , the -th supervoxel in the -th video is represented by , where is number of supervoxels in video . Each supervoxel either belongs to foreground action or background. The authors now construct a directed graph for each training video across all action classes. Nodes in the graph are represnted by supervoxels while edges emanate from all nodes belonging to foreground.

Let each supervoxel u be represented by its spatiotemporal centroid, . The features associated with are given by , where is total number of features. Graphs and are represented by composite graph which contains all information necessary for action localization.

7.1.2 Context Walk in Testing Video

The model obtains about  200-300 supervoxels per video. The goal is to visit each supervoxel in sequence, referred to as a context walk. The initial supervoxel is selected randomly and similar supervoxels are found by nearest neighbor algorithm. The following function generates a conditional distribution over all supervoxels in testing video given only current supervoxel , features , and composite graph :


where computes similarity between features of current supervoxel in testing video. Skipping ahead to inference, the supervoxel with highest probability is selected in next step of context walk:


7.1.3 Measuring Supervoxel Action Specificity

The authors quantify discriminative supervoxels based on an action specificity score. If represents the ratio of number of supervoxels from foreground of action in cluster to all supervoxels from action in cluster, then, given appearance/motion descriptors d, if supervoxel belongs to cluster , its action specificity is quantified as:


where and are center and radius of -th cluster, respectively.

7.1.4 Inferring Action Locations using 3D-CRF

Once we have conditional distribution , we can merge supervoxels belonging to actions to create a continuous flow of supervoxels without any gaps. The authors use CRFs for this purpose. They minimize the negative log likelihood over all supervoxel labels a in the video:


where captures unary potential and depends on conditional distribution after steps and action specificity measured above. Both are normalized between 0 and 1.

Figure 6: Pipeline for method proposed by Soomro et al. [12].

7.1.5 Results

The approach is evaluated on UCF-Sports [20], sub-JHMDB [21], and THUMOS’13 [23] datasets. The biggest advantage of this method is the complexity. While SDPM by Tian et al. [1] and Tubelets by Jain et al. [9] have complexities and respectively, this work has complexity where is the number of classifier evaluations.

Method UCF-Sports sub-JHMDB
Wang et al. [36] 47% 36%
Wang et al. [36] (iDTF+FV) - 34%
Jain et al. [9] 53% -
Tian et al. [1] 42% -
Lan et al. [5] 38% -
Soomro et al. [12] 55% 42%
Table 5: Soomro et al. [12]. Comparison of methods at 20% overlap.

7.2 Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition

Besides optical flow and traditional pixel level information, there is a class of representations based on human skeleton and joints. These form conceptual graphs that can be used to classify actions. This paper by Yan et al. [11] uses those features to classify and localize actions.

Graph neural networks

are a recent paradigm that generalize convolutional neural networks to graphs of arbitrary structures. They have shown to perform well on tasks such as image classification, document classification, and semi-supervised learning.

[11] extend the idea of graph neural networks to Spatial-Temporal Graph Convolutional Networks (ST-GCN) which attempt to model, localize, and classify actions. The graph representation contains spatial edges that conform to natural connectivity of joints and temporal edges that connect to same joints across consecutive time steps. Besides ST-GCN, the authors also introduce several principles to design convolution kernels in ST-GCN. Finally, they evaluate their models on large scale datasets to demonstrate the approach’s effectiveness as we shall see.

7.2.1 Spatial Temporal Graph ConvNet

The overall pipeline expects skeleton based data obtained from a motion-capture device or pose estimation algorithm from videos. For each frame, there will be a set of join coordinates. Given these sequences of body joints, the model constructs a spatial temporal graph with joints as graph nodes and natural connectivities in both human body structures and time as graph edges.

7.2.2 Skeleton Graph Construction

The authors create an undirected spatial temporal graph on a skeleton sequence with joints and frames feature both intra-body and inter-frame connection. The node set includes all joints in skeleton sequence. As ST-GCN’s input, feature vector on node consists of coordinate vectors as well as estimation confidence on -th join on frame . The construction of the graph is divided into two steps: The joints within one frame are connected with edges according to human body structure. Then, each joint is connected to the same joint in the consecutive time step’s graph. Connections are naturally made without manual intervention. This also provides generalization capabilities with respect to different datasets. Formally, the edge set is composed of two subsets: consisting of intra-skeleton connection at each frame where is set of naturally connected human body joints. The second subset consists of inter-frame edges connecting same joints in consecutive frames and is expressed as .

7.2.3 Spatial Graph Convolutional Neural Network

Let us just consider graph CNN model within one single frame. At a single frame at time , there will be joint nodes along with skeleton edges . Given a convolution operator with kernel size , and input feature map with number of channels c, the output value of single channel at spatial location can be written as:


where sampling function enumerates neighbors of location . The weight function provides weight vector in -dimensional real space for computing inner product with sampled input feature vector of dimension . Standard convolution on image domain is achieved by encoding a rectangular grid in p(x). Please refer to the original paper for reformulation of sampling and weight functions on 2D image domains. Now, we can write a graph convolution as:


where normalizing term equals cardinality of corresponding subset. To model the temporal aspect of this graph, we simply use the same sampling function and labeling map . Because temporal axis is well-ordered, we directly modify label map for spatial temporal neighborhood rooted at to be:


where is label map for single frame case at .

7.2.4 Implementing ST-GCN

The implementation of graph convolution is the same as in Kipf and Welling [37]. The intra-body connections are represented by an adjacency matrix A

and identity matrix

I. Thus, in the single frame case,


where .

In the multiple subset case,


where similarity . Here, the authors set to avoid empty rows in .

The input is first fed to a batch normalization layer to normalize data. The ST-GCN model is composed of 9 layers of spatial temporal graph convolution operations. The first three layers have 64 output channels, second three layers have 128 channels, and last three layers have 256 output channels. These layers have 9 temporal kernel size. The ResNet mechanism is applied to each of these layers. There is also a random dropout of 0.5 after each layer to prevent overfitting. The 4-th and 7-th layer have strides 2 for pooling. Then, a global pooling is performed to achieve a 256 dimension feature tensor for each sequence. Lastly, the feature vector is fed to a Softmax layer for classification. The model is optimized using stochastic gradient descent with a learning rate of 0.01 and decayed by 0.1 after every 10 epochs.

7.2.5 Results

The authors tested the model on Kinetics human action dataset [38] and NTU-RGB+D [39] dataset. On the Kinetics dataset, ST-GCN achieved a 10.4% and 12.8% increase in Top-1 and Top-5 accuracies when compared to frame based methods. On the NTU-RGB+D dataset, they achieved a 1.9% and 3.5% increase on X-Sub and X-View accuracies when compared to all previous methods.

Top-1 Top-5
RGB Kay et al. [38] 57.0 77.3
Optical Flow Kay et al. [38] 49.5 71.9
Feature Enc. Fernando et al. [40] 14.9 25.8
Deep LSTM Liu et al. [41] 16.4 35.3
Temporal Conv. Kim and Reiter [42] 20.3 40.0
ST-GCN 30.7 52.8
Table 6: Results for Yan et al. [11]Action recognition performance on skeleton based models on Kinetics dataset. The first two methods are frame based methods.

8 3D Convolutional Neural Networks

8.1 A Closer Look at Spatiotemporal Convolutions for Action Recognition

Let us look into an approach which completely uses convolutional neural networks without any special feature representations. The paper by Tran et al. [43] introduces an even more advanced approach to action recognition with a demonstration of a new form of convolution. The method is aimed for action recognition only. However, it can be supplemented with additional features to enable action localization.

The authors mainly focus on the domain of residual learning for action recognition. They explore the existing types of 3D convolutions and namely introduce two new types of convolution. The first new convolution is a mixed convolution where early layers of the model perform 3D convolutions while later layers perform spatial or 2D convolutions over the learned features. This is called the MC or mixed convolution. The second new convolution is a complete decomposition of the 3D convolution into separate 2D spatial convolution and 1D temporal convolution. This is called the R(2+1)D convolution. This decomposition brings in two advantages. Firstly, the decomposition introduces an additional nonlinear rectification between two operations. This means, you double the number of nonlinearities compared to a network using full 3D convolutions for same number of parameters. Secondly, this facilitates optimization leading to lower training loss and lower testing loss. Let us now explore the various types of convolutions for videos.

8.1.1 Convolutional residual blocks for video

Within the framework of residual learning, there are several spatiotemporal convolution variants available. Let x denote input clip of size , where is number of frames in clip, and are frame height and width, and 3 refers to the RGB channels. Let be tensor computed by -th convolutional block. Then, the output of that block is:


where implements composition of two convolutions parameterized by weights

and application of ReLU functions.

R2D: 2D convolutions over the entire clip. 2D CNNs for video ignore temporal ordering and treat frames independently of channels. This is basically reshaping the input 4D tensor x into a 3D tensor of size . The output of -th block is also a 3D tensor. Each filter is 3D and has size , where d denotes spatial width and height. Even tho filter is 3D, it only convolves in 2D over the spatial dimensions. All temporal information of video is collapsed into single-channel feature maps. This prevents any sort of temporal reasoning.

f-R2D: 2D convolutions over frames. Another 2D CNN approach involves processing independently the frames via a series of 2D convolutional residual block. Same filterers are applied to all frames. No temporal modeling is performed on the convolutional layers and global spatiotemporal pooling layer at the end simply fuses information extracted independently from the frames. This architecture variant is referred to as f-R2D (frame-based R2D).

R3D: 3D convolutions. 3D CNNs [19] preserve temporal information and propagate it through the layers of the network. The tensor is 4D in shape and has size , where is number of filters used in -th block. Each filter is 4-dimensional and has size where denotes the temporal extent of the filter (the authors used ).

M and rM: mixed 3D-2D convolutions The intuition behind MC layers is that in early layers, motion modeling may be important while in later layers, motion or temporal modeling is not necessary. In the authors’ experiments of a 5 layer residual block, two variants come out where first three layers are 3D convolutions while last two are 2D. The other variant is just the opposite of it.

8.1.2 R(2+1)D: (2+1)D convolutions

Another hypothesis proposed by the authors is that full 3D convolutions can be more conveniently approximated by a 2D convolution followed by a 1D convolution. Thus, they designed a R(2+1)D architecture where the 3D convolutional filter of size is replaced with a (2+1)D block consisting of 2D convolutional filters of size and temporal convolutional filters of size

. The hyperparameter

determines dimensionality of intermediate sub-space where signal is projefcted between spatial and temporal convolutions. The authors choose so that number of parametrs in the block approximately equal to the 3D variant. This spatiotemporal decomposition can be applied to any 3D convolutional layer.

Figure 7: Residual network architectures for video classification. (a) R2D are 2D ResNets; (b) MCx are ResNets with mixed convolutions; (c) rMCx use reversed mixed convolutions; (d) R3D are 3D ResNets; and (e) R(2+1)D are ResNets with (2+1)D convolutions.

8.1.3 Results

The authors experimented their new architecture on the Kinetics [38] and Sports-1M [44] datasets. They also pre-trained the models on these two datasets and then finetuned them to UCF-101 [22] and HMDB51 [45] datasets.

The networks experimented with are the ResNet-18 and ResNet-34 network architectures. Frame input size is . The authors use one spatial downsampling of , and three spatiotemporal downsampling with convolutional striding of . For training, consecutive frames are randomly sampled. Batch normalization is applied to all convolutional layers and batch size is set to 32 per GPU. The initial learning rate is set to 0.01 and is decayed by 0.1 every 10 epochs. The R(2+1)D layer architecture reported an average 3% improvement from previous methods.

9 Action Localization Datasets

Let us explore common datasets used for action localization. This can help us understand what sort of features and data is available to explore algorithms and models upon. We shall explore seven datasets that are of different sizes, domains, and contain different features.

9.0.1 Kinetics Dataset

The first dataset we explore is the Kinetics dataset by Kay et al. [38]. The dataset is sourced from YouTube videos to encourage variation between videos in same and different action classes. There are 400 actions, minimum 400 video clips per action, and contains 306,245 videos in total. The way the dataset was built was through first curating action classes by merging different previous dataset action classes. Secondly, the videos were sourced from YouTube corpus. To collect the best videos, an aggregation of relevance feedback scores were used with multiple queries. Finally, human tagging was used to manually annotate the videos for accuracy and consistent.

9.0.2 Weizzman Dataset

The Weizzman dataset by Blank et al. [34] contains 90 low-resolution video sequences consisting of 10 different action classes. Some actions are: “running”, “walking”, “jumpingjack”, “jumping-forward-on-two-legs”, “jumping-in-placeon-two-legs”, “galloping-sideways”, “waving-two-hands”, “waving-one-hand”, “bending”.

9.0.3 UCF-101 Dataset

The UCF-101 by Soomro et al. [22] is arguably one of the most famous datasets for action recognition and action localization. As its name implies, it consists of 101 action classes, 13,320 clips, 4-7 clips per group, a mean clip length of 7.21 seconds, total duration of 1600 minutes, frame rate of 25 fps, and resolution of . The source of the videos is the YouTube corpus. It is an extension of the UCF-50 dataset.

9.0.4 UCF-Sports Dataset

The UCF-Sports dataset by Rodriguez et al. [20] is an video dataset containing 10 actions primarily in the sports domain. There are 150 clips, mean clip length of 6.39s, frame rate of 10fps, total duration of 958s, and at a resolution of . The maximum and minimum number of clips per class are 22 and 6 respectively. The videos are sourced through BBC and ESPN video corpus.

9.0.5 THUMOS’14 Dataset

Jiang et al. [23] released the THUMOS’14 dataset which contains 101 actions, 13,000 temporally trimmed videos, over 1000 temporally untrimmed videos, over 2500 negative sample videos, and bounding boxes for 24 action classes.

9.0.6 HMDB Dataset

Jhuang et al. [21] introduce the HMDB dataset containing 51 actions, 6849 clips, and each class contains at least 101 clips. Actions include laughing, talking, eating drinking, pull up, sit down, ride bike, etc.

9.0.7 Activity Net Dataset

Caba Heilbron et al. [46] curated the Activity Net dataset which contains 200 action classes, 100 untrimmed videos per class, 1.54 activity instances per video on average, and a total of 38,880 minutes of videos. This dataset was hosted as a challenge at CVPR 2018.

9.0.8 NTURGB-D Dataset

Finally, we look at the NTURGB-D dataset by Shahroudy et al. [39]. The dataset contains 56,880 action samples distributed across 60 actions with each video containing the following data:

  1. RGB videos

  2. depth map sequences

  3. 3D skeletal data

  4. infrared videos

Video samples are of resolution , depth map and IR videos have resolution , and 3D skeletal data have three dimensional locations of 25 major body parts.

10 Conclusion

In conclusion, we explored eight approaches to action localization. There are many more methods and techniques that are used to solve this problem. However, majority of them rely upon clever usage of the same types of features including RGB pixel values, optical flow, skeleton graphs, etc. Action proposal networks are effective but they are expensive and usually need an exhaustive search of the entire video. If the video length is large (e.g. CCTV camera footage spanning over several hours), these approaches turn out to be computationally infeasible. Figure centric models try to solve this problem. However, they require too much of manual feature construction to be automated on a large scale. Deformable parts models actually solve the problem of selective sampling and extracting segments for action localization. There is room for improvement there too. Graph based models also lower the amount of search and classification operations with optimizations. However, they require skeletal data which means a pose estimation algorithm is required as a pre-processing step. The accuracy of the pose estimation algorithm can also create bias towards the training and inference of the model. Finally, spatiotemporal convolutions provide an interesting proposition and possibly we can extend this technique while incorporating more features to solve the problem of action localization.


  • Tian et al. [2013] Y. Tian, R. Sukthankar, and M. Shah, “Spatiotemporal deformable part models for action detection,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on.    IEEE, 2013, pp. 2642–2649.
  • Ma et al. [2013] S. Ma, J. Zhang, N. Ikizler-Cinbis, and S. Sclaroff, “Action recognition and localization by hierarchical space-time segments,” in Computer Vision (ICCV), 2013 IEEE International Conference on.    IEEE, 2013, pp. 2744–2751.
  • Ke et al. [2007] Y. Ke, R. Sukthankar, and M. Hebert, “Event detection in crowded videos,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on.    IEEE, 2007, pp. 1–8.
  • Kläser et al. [2010] A. Kläser, M. Marszałek, C. Schmid, and A. Zisserman, “Human focused action localization in video,” in European Conference on Computer Vision.    Springer, 2010, pp. 219–233.
  • Lan et al. [2011] T. Lan, Y. Wang, and G. Mori, “Discriminative figure-centric models for joint action localization and recognition,” in Computer Vision (ICCV), 2011 IEEE International Conference on.    IEEE, 2011, pp. 2003–2010.
  • Prest et al. [2013] A. Prest, V. Ferrari, and C. Schmid, “Explicit modeling of human-object interactions in realistic videos,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 4, pp. 835–848, 2013.
  • Gkioxari and Malik [2015] G. Gkioxari and J. Malik, “Finding action tubes,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on.    IEEE, 2015, pp. 759–768.
  • Yu and Yuan [2015] G. Yu and J. Yuan, “Fast action proposals for human action detection and search,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1302–1311.
  • Jain et al. [2014] M. Jain, J. Van Gemert, H. Jégou, P. Bouthemy, and C. Snoek, “Action localization with tubelets from motion,” in CVPR-International Conference on Computer Vision and Pattern Recognition, 2014.
  • Weinzaepfel et al. [2015] P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Learning to track for spatio-temporal action localization,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 3164–3172.
  • Yan et al. [2018] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” arXiv preprint arXiv:1801.07455, 2018.
  • Soomro et al. [2015] K. Soomro, H. Idrees, and M. Shah, “Action localization in videos through context walk,” in Computer Vision (ICCV), 2015 IEEE International Conference on.    IEEE, 2015, pp. 3280–3288.
  • Diba et al. [2017] A. Diba, V. Sharma, and L. Van Gool, “Deep temporal linear encoding networks,” in Computer Vision and Pattern Recognition, 2017.
  • Ren et al. [2015] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • Xu and Corso [2012] C. Xu and J. J. Corso, “Evaluation of super-voxel methods for early video processing,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.    IEEE, 2012, pp. 1202–1209.
  • Zhang et al. [2016] J. Zhang, W. Li, P. Wang, P. Ogunbona, S. Liu, and C. Tang, “A large scale rgb-d dataset for action recognition,” in Proc. ICPR, 2016.
  • Hou et al. [2017] R. Hou, C. Chen, and M. Shah, “Tube convolutional neural network (t-cnn) for action detection in videos,” in IEEE International Conference on Computer Vision, 2017.
  • Girshick et al. [2014] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
  • Tran et al. [2015] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
  • Rodriguez et al. [2008] M. D. Rodriguez, J. Ahmed, and M. Shah, “Action mach a spatio-temporal maximum average correlation height filter for action recognition,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.    IEEE, 2008, pp. 1–8.
  • Jhuang et al. [2013] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 3192–3199.
  • Soomro et al. [2012] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  • Jiang et al. [2014] Y. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar, “Thumos challenge: Action recognition with a large number of classes,” 2014.
  • Weinzaepfel et al. [2017] P. Weinzaepfel, X. Martin, and C. Schmid, “Human action localization with sparse spatial supervision,” arXiv preprint arXiv:1605.05197, 2017.
  • Peng and Schmid [2016] X. Peng and C. Schmid, “Multi-region two-stream r-cnn for action detection,” in European Conference on Computer Vision.    Springer, 2016, pp. 744–759.
  • Kläser [2010] A. Kläser, “Learning human actions in video,” Ph.D. dissertation, PhD thesis, Université de Grenoble, 2010.
  • Yu and Joachims [2009] C.-N. J. Yu and T. Joachims, “Learning structural svms with latent variables,” in

    Proceedings of the 26th annual international conference on machine learning

    .    ACM, 2009, pp. 1169–1176.
  • Patron-Perez et al. [2010] A. Patron-Perez, M. Marszalek, A. Zisserman, and I. D. Reid, “High five: Recognising human interactions in tv shows.” in BMVC, vol. 1.    Citeseer, 2010, p. 2.
  • Leordeanu et al. [2012] M. Leordeanu, R. Sukthankar, and C. Sminchisescu, “Efficient closed-form solution to generalized boundary detection,” in European Conference on Computer Vision.    Springer, 2012, pp. 516–529.
  • Arbelaez et al. [2009] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “From contours to regions: An empirical evaluation,” 2009.
  • Tran and Yuan [2011] D. Tran and J. Yuan, “Optimal spatio-temporal path discovery for video event detection,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on.    IEEE, 2011, pp. 3321–3328.
  • Tran and Yuan [2012] ——, “Max-margin structured output regression for spatio-temporal action localization,” in Advances in neural information processing systems, 2012, pp. 350–358.
  • Dalal and Triggs [2005] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.    IEEE, 2005, pp. 886–893.
  • Blank et al. [2005] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in The Tenth IEEE International Conference on Computer Vision (ICCV’05), 2005, pp. 1395–1402.
  • Raptis et al. [2012] M. Raptis, I. Kokkinos, and S. Soatto, “Discovering discriminative action parts from mid-level video representations,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.    IEEE, 2012, pp. 1242–1249.
  • Wang et al. [2014] L. Wang, Y. Qiao, and X. Tang, “Video action detection with relational dynamic-poselets,” in European Conference on Computer Vision.    Springer, 2014, pp. 565–580.
  • Kipf and Welling [2016] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  • Kay et al. [2017] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
  • Shahroudy et al. [2016] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.
  • Fernando et al. [2015] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars, “Modeling video evolution for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5378–5387.
  • Liu et al. [2016] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal lstm with trust gates for 3d human action recognition,” in European Conference on Computer Vision.    Springer, 2016, pp. 816–833.
  • Kim and Reiter [2017] T. S. Kim and A. Reiter, “Interpretable 3d human action analysis with temporal convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).    IEEE, 2017, pp. 1623–1631.
  • Tran et al. [2017] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” arXiv preprint arXiv:1711.11248, 2017.
  • Karpathy et al. [2014] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732.
  • Kuehne et al. [2013] H. Kuehne, H. Jhuang, R. Stiefelhagen, and T. Serre, “Hmdb51: A large video database for human motion recognition,” in High Performance Computing in Science and Engineering ‘12.    Springer, 2013, pp. 571–582.
  • Caba Heilbron et al. [2015] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 961–970.