Log In Sign Up

DAP3D-Net: Where, What and How Actions Occur in Videos?

Action parsing in videos with complex scenes is an interesting but challenging task in computer vision. In this paper, we propose a generic 3D convolutional neural network in a multi-task learning manner for effective Deep Action Parsing (DAP3D-Net) in videos. Particularly, in the training phase, action localization, classification and attributes learning can be jointly optimized on our appearancemotion data via DAP3D-Net. For an upcoming test video, we can describe each individual action in the video simultaneously as: Where the action occurs, What the action is and How the action is performed. To well demonstrate the effectiveness of the proposed DAP3D-Net, we also contribute a new Numerous-category Aligned Synthetic Action dataset, i.e., NASA, which consists of 200; 000 action clips of more than 300 categories and with 33 pre-defined action attributes in two hierarchical levels (i.e., low-level attributes of basic body part movements and high-level attributes related to action motion). We learn DAP3D-Net using the NASA dataset and then evaluate it on our collected Human Action Understanding (HAU) dataset. Experimental results show that our approach can accurately localize, categorize and describe multiple actions in realistic videos.


page 1

page 6

page 7

page 8


Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework

Action recognition from videos, i.e., classifying a video into one of th...

Technical Report: Disentangled Action Parsing Networks for Accurate Part-level Action Parsing

Part-level Action Parsing aims at part state parsing for boosting action...

RMS-Net: Regression and Masking for Soccer Event Spotting

The recently proposed action spotting task consists in finding the exact...

Unsupervised Human Action Detection by Action Matching

We propose a new task of unsupervised action detection by action matchin...

Learning to Localize Actions from Moments

With the knowledge of action moments (i.e., trimmed video clips that eac...

Egocentric Activity Recognition and Localization on a 3D Map

Given a video captured from a first person perspective and recorded in a...

Long Activity Video Understanding using Functional Object-Oriented Network

Video understanding is one of the most challenging topics in computer vi...

1 Introduction

Human action analysis [4, 55, 29, 7, 1] is a popular research area in computer vision and has many applications such as video surveillance [54, 60, 18], robotics and multimedia search and retrieval [57, 31, 58]. Detailed description of human actions in videos requires to solve three main problems: (1) Where in the video do the actions occur? (2) What categories do the actions belong to? and (3) How are these actions performed? Most of the previous studies, however, only focus on one or two of the problems separately (such as action categorization [49, 43, 50, 27, 37, 15, 23], localization [17, 44, 10, 30, 53, 48] or motion attributes learning [25, 36]), and thus they cause poor generalization and high complexity to integrally describe actions with rich information in detail.

To jointly consider the above three problems, in this paper, we target to develop a new approach which can automatically parse actions in videos. Specifically, we are interested in describing each individual action in a video with its corresponding location, category and motion attributes, simultaneously as shown in Fig. 1. For localization of an individual action, we aim to output the accurate bounding box, in which an action occurs; for classification, we aim to categorize each action into a class; and for attributes learning, we describe how each action is performed with detailed motion information. In fact, the three problems are inter-correlated and should be tackled together. However, little effort has been devoted to action parsing in videos with complex scenes. In particular, there is no big enough aligned action data with corresponding motion attributes for model learning. Although some recent works [61, 25] annotated some action data with bounding boxes and attributes, the size and diversity of the data in their works are limited.

Thus, in this paper, we first contribute a new Numerous-category Aligned Synthetic Action dataset, i.e., NASA. It contains 200,000 action clips with over 300 categories. Moreover, each clip is assigned with 33 attributes in two hierarchical levels. All the video clips in NASA are synthesized using Poser10111Details of Poser 10 in, which is a professional software program used to effectively generate virtual action data from motion capture of real humans. Thus, the motions of synthetic data can be visually very close to realistic human actions. To the best of our knowledge, NASA is the largest aligned action dataset to date.

From the perspective of modeling our action parsing problem, we have been motivated by region-based convolutional neural network (R-CNN) [9, 8, 33] for object detection problems, since deeply learned models have proved to achieve better results than conventional methods [52, 47, 6, 5]. Furthermore, in [36, 35]

, learning attributes from complex scene images via deep nets can also produce significant improvement over traditional methods. However, most of current deep learning based approaches focus on 2D images. For spatio-temporal action data, some previous works

[43, 46, 13, 15] have proposed 3D deep convolutional neural networks to recognize actions. Among them, [46] produces better results with 3D convolutions on both spatial and temporal dimensions. Inspired by all these works, we aim to design a multi-task 3D convolutional neural network for effective Deep Action Parsing (DAP3D-Net) in videos. Specifically, in the training phase, action localization, classification and attributes learning can be jointly optimized via DAP3D-Net. Once model training is completed, given an upcoming test video, we can describe each individual action in the video simultaneously as: where the action occurs, what the action is and how the action is performed. Different from [46, 13, 43] using original video data as the deep net inputs, to better learn motion information for action parsing, in our method, we adopt motion channels (calculated from optical flows) in addition to appearance (intensity information) as the input of DAP3D-Net. The promising action parsing results achieved by DAP3D-Net provide potential functionalities of outdoor/indoor video surveillance systems for public security and personal healthcare applications, e.g., anomalous event detection and abnormal behavior monitoring for elderly. The main contributions of our work can be highlighted as follows:

(1) Our deep model DAP3D-Net jointly optimizes action localization, classification and attributes learning. With such a multi-task scheme, DAP3D-Net can solve the problem of where, what and how actions occur simultaneously for human action parsing in videos.

(2) In order to train DAP3D-Net, we introduce a new large-scale aligned action dataset, NASA, with 200K well labeled video clips. Additionally, to further evaluate the effectiveness of DAP3D-Net, a realistic Human Action Understanding (HAU) dataset has also been collected with the locations, categories and attributes of all actions annotated.

2 Related Work

Since little work has been done on action parsing in videos for simultaneously solving the problems of action localization, categorization and attributes learning, in this section, we mainly review some related work on action detection and action attributes modeling. Action detection can be regarded as a combination of action localization and categorization. In [39], a weakly supervised model with multiple instance learning was applied for action detection. In [51], a dynamic-poselets method was introduced. Branch-and-bound algorithm [58] was proposed to reduce the action detection complexity. There also exist some sub-volume [22, 16, 38] based action detection methods. Besides, the cross-dataset action detection [3] and the spatio-temporal deformable part models based action detection [45] have also been proposed in previous studies. Additionally, action detection via fast proposals was developed in [57].

For action attributes modeling, Liu et al. [25] used high-level semantic attributes to represent human actions in videos and further constructed more descriptive models for the action recognition task. A similar idea has also been applied in [59, 42] for improved action categorization. Moreover, a convolutional multi-task learning method [61] has been adopted for action recognition from low-level features with attribute regularization. In [62], a robust learning framework using relative attributes was developed for human action recognition. Additionally, action attributes and object-parts from images were also used for action recognition in [56]. However, all the above studies mainly focus on action recognition by means of attributes rather than general action attributes learning tasks. Although, in [7]

, authors have jointly tackled the classification and attributes annotation for group activities, it is still regarded as a separated feature extraction and attribute learning pipeline rather than an end-to-end framework as our multi-task DAP3D-Net. Besides, DAP3D-Net focuses on simultaneous localization and parsing of multiple actions, while in

[7] only global representations of group activities are considered.

3 Approach

In this section, we introduce the architecture of our proposed method, i.e., DAP3D-Net, for action parsing in videos. It is regarded as a multi-task learning scheme for jointly optimizing action localization, action categorization and action attributes learning using appearance-motion data from the NASA dataset. In the following subsections, we will first describe the construction of the NASA d ataset and then detail the DAP3D-Net.

3.1 NASA Dataset Construction

For deep action parsing in videos, a large number of aligned action clips with motion attributes are needed to train the models. However, most of the previous action datasets [26, 28, 20, 40, 15] are collected with realistic scenarios, in which complex background, camera noise, shift, scaling and occlusion always exist. Such action videos indeed benefit for evaluating the robustness of action recognition systems, but are not suitable as the training data to learn a model for action parsing, since they lack well localized bounding boxes and annotations for each individual action in the videos. In fact, most of previous action datasets are collected from websites with realistic scenes of activities rather than just aligned single actions. For instance, the largest action dataset so far, Sport1M [15], is collected from YouTube, which contains one million clips from 487 sport activity categories, but with no particular annotation for each individual action due to the fact that detailed annotation for large-scale video data is laborious and impractical. The similar circumstances also exist in other realistic action datasets [40, 28]. On the contrary, there are indeed some earlier action datasets [34, 2, 58] with the bounding boxes given or easily obtainable on relatively simple backgrounds, e.g., the KTH dataset (with 6 basic action categories), however, the size and diversity of these action datasets are too limited to train a sophisticated action parsing model.

To solve the shortage of aligned and annotated action data for training models with some specific usage, e.g., action parsing proposed in this paper, we contribute a new dataset, i.e., NASA, which provides 200,000 aligned action clips from over 300 action categories. For each clip, 33 action attributes are assigned in two hierarchical levels. The NASA dataset is superior to previous aligned action datasets in both scale and diversity.

3.1.1 Action Synthesis

As mentioned above, it is impractical to construct a large-scale aligned action dataset by either annotating actions from online data or newly recording action videos by actors. Naturally, computer graphics and animation techniques can be used to synthesize human actions [32] automatically. Therefore, we adopt Poser 10, which is a professional software for creating 3D animation and illustration. With Poser 10, we can build high quality character-based 3D action animations and further render the animations into photo-realistic videos or images. Moreover, a truebones motion capture database222Download from is also utilized for Poser 10, in which over 1600 different high-quality, clean and consistent motion models captured from realistic human performers are packed with BVH files. Specifically, a 3D model in a BVH file keeps movement parameters with 19 joints which are captured from 19 corresponding body parts of a real human with wearable devices. We further aggregated 1600 motion models into over 300 action categories and then imported these models into Poser 10 and generated the action clips, in each of which only one aligned action is included. In particular, for each 3D motion model, we projected it into 18 viewpoints with three different camera-heights and, in each camera-height, circle-distributed six viewpoints were captured from 3D action models. To increase the diversity of actions, Poser 10 also provides ten different characters with different genders, looks, body sizes, heights and clothing to perform the actions in various backgrounds. In this way, for a certain 3D action model from truebones motion capture database, we can create action clips. The spatial size of each clip is fixed as via Poser 10, while video lengths are assigned based on action categories. Since all action motion models are captured from real human body movements, the synthetic clips in NASA are visually very close to realistic actions performed by humans.

Figure 2: Video examples in our NASA dataset. Each of the labeled video is assigned with multiple semantic action attributes in hierarchical levels: low-level (H1) and high-level (H2).

3.1.2 Action Attribute Annotation

After data generation by Poser 10, we further annotate each clip with 33 action attributes in two hierarchical levels. Specifically, we define 19 low-level attributes (H1) to describe the basic body part movements and 14 high-level attributes (H2) of general action motion. Since the action clips generated from each truebones model are motion consistent, we regard these 33 attributes as the model-level attributes in our NASA dataset. In this way, all the data generated from one model will share the same 33 action attributes333Since each category in NASA is aggregated from multiple models, all attributes for a category will be diverse rather than sharing the same ones.. However, in some cases, the attributes of action clips are not dominant enough for their corresponding action categories due to the downloaded models with inaccurate motion capture. Thus, to control the data quality, we discard clips with ambiguous attributes and finally construct our NASA dataset with 200,000 action clips. Given a video clip, the semantic motion attributes can allow us to well parse an action by answering “How are these actions performed?” and also may benefit zero-shot learning [11, 21] for action recognition/retrieval in future work. Fig. 2 shows some video examples with two level attributes in the NASA dataset. The full list of each action clip’s attributes will be released in our project webpage later.

3.2 Modeling DAP3D-Net

In this sub-section, we introduce our proposed deep architecture, i.e., DAP3D-Net, for effective action parsing by solving the problems of “Where, What and How Actions Occur in Videos?”. In particular, the learning phase of the proposed approach consists of three steps: (1) Data augmentation via video spatial cropping and temporal scaling; (2) Appearance-motion data composition; (3) A 3D convolutional neural network for joint optimization of action localization, classification and attributes learning.

3.2.1 Preliminary work for training data

Data augmentation: To better resist overfitting and train a effective bounding box regressor (which will be explained in Section. 3.2.2) with DAP3D-Net for action localization, we crop the each aligned action clip in NASA into 5 subclips. Specifically, given a clip with , where , and indicate the width, height and temporal length of the clip, respectively, we crop it spatially into five subclips, i.e., top-left (), top-right (), bottom-left (), bottom-right () and central (), with the identical width of and height of . In detail, we define the spatial center coordinate of an original clip as and according to this center position, the relative location of each subclip can be denoted as: for , for , for , for and for . Since each action is well aligned in data, the relative location of subclips can be regarded as the lengths of spatial shifts along horizonal and vertical directions respectively from the central position of original clips in NASA. In our model, , , and for each clip in NASA we crop it into subclips by using a parameter randomly selected with . To further achieve temporally scale invariant action parsing via DAP3D-Net, we also resize these subclips along the temporal dimension with , where is a scaling factor randomly selected from for each subclip.

Figure 3: Illustration of appearance-motion data composition.

Appearance-motion data composition: Since our work focuses on action parsing rather than group activity recognition, temporal motion information appears to be more important than spatial appearance information from one action clip. Therefore, after data augmentation, we propose to construct appearance-motion data instead of using the traditional RGB videos as inputs to our DAP3D-Net. Specifically, we first extract the intensity information via gray-scale transformation as the appearance channel, and further compute the optical flows ( and ) via [24], which carry rich motion information from videos, as the motion channels. We then combine one intensity channel with two motion channels to construct our appearance-motion video data for DAP3D-Net training as shown in Fig. 3. In fact, appearance-motion data can achieve the theoretical similar effects as “early fusion” [15] of RGB and motion data in other deep models, and more importantly reduce the artifacts caused by synthetic data to the greatest extent, since in this composition appearance information is weakened but motion information is strengthened.

Figure 4: DAP3D-Net structure

. The rectified linear unit (RELU) is applied as active function after each convolution layer and fully connected layer. The detailed parameters of DAP3D-Net can be found in this figure. (Zoom in for better viewing.)

3.2.2 DAP3D-Net Structure and Model Setting

The proposed structure of DAP3D-Net contains 6 (spatio-temporal) convolution layers, 4 pooling layers and 2 fully connected (FC) layers. All the parameters are illustrated in detail in Fig. 4. As shown in Fig. 4

, an input appearance-motion sequence is fed to DAP3D-Net and mapped into feature vectors by fully connected layers (FC1 and FC2). After FC layers, we define four multi-task loss supervision (output) layers in DAP3D-Net for our action parsing task as shown in Fig. 

5. In more detail, a two-dimensional relative location vector, i.e., , as mentioned in Section 3.2.1

, is optimized via Euclidean square loss for real-valued bounding-box regression connected with FC2. Furthermore, a softmax loss layer is applied to predict probabilities,

, of action categories connected with FC2. Besides, following [36], the cross entropy loss is optimized for multi-output prediction of 14 high-level action attributes (H2), , connected with FC2, and similarly 19 low-level action attributes prediction (H1), , is connected with FC1. Particularly, here we also explain the reasons why we split attributes into two layers as follows: this is because FC1 is relatively closer to the low-level feature extraction layers in DAP3D-Net (i.e., Conv1,Conv2 and Conv3) and involves weak semantic information compared with FC2, thus FC1 is more suitable for learning attributes of basic body parts’ movements which are relatively lower-level and more specific. On the contrary, since FC2 is regarded as a combination of low-level information from FC1 and directly connected with the action category prediction layer, it contains more high-level, semantic and abstract information of actions. Thus, we can learn higher-level, general motion attributes from FC2. This hierarchical learning scheme proves to be effective and accurate for revealing the motion details of actions in our experiments.

Figure 5: The multi-task loss supervision (output) of DAP3D-Net.

For training DAP3D-Net, each input subclip is labeled with a relative location, a ground-truth class label and multiple attributes in H1 and H2. Then, we adopt a multi-task loss on each training datum to jointly optimize bounding-box regression, action categorization and attributes learning as follows:


where, indicates the softmax loss of action categorization which is defined the same as other deep classification nets [19, 15, 41]. indicates the cross entropy loss for hierarchical attributes learning as:


where denotes the number of the H1 attribute outputs, are ground-truth labels and are output probability predictions. A similar equation is also used for . Furthermore, following [9], the Euclidean square loss for bounding box regression can be defined as:


where is the predicted relative location and is the ground-truth relative location. Besides, the hyper-parameters in Eq. (1) are used to control the balance of multi-task losses. They are fixed as: , and in all our experiments. Note that, to fit DAP3D-Net, we adopt affine spatio-temporal warping (similar as in [9]) to resize all appearance-motion clips with the fixed-size as the inputs444In [46], they split videos into 16-frame long clips with a 8-frame overlap between two consecutive clips as the inputs of the deep net. The reason is [46] focuses on group activity recognition and the information of the activity scene on spatial dimensions of a clip is more important than information on the temporal dimension. Thus, splitting with 16-frames does not break the integrity of target activities in training data. However, each training video in NASA includes only one integrated action and any split will make it incomplete. Thus, we warp all training clips with the same size as the DAP3D-Net inputs instead of brutally splitting them. to the deep model. After attempting various network architectures with different parameter settings, the current 3D convolutional neural network structure shown in Fig. 4 proves to be the best option for our action parsing task.

Figure 6: Visualization of the feature map in Conv2 layer. (a) The three channels (from top to bottom - gray-scale, optical flow Vx and optical flow Vy) of 20-frame appearance-motion sequence: Turnaroundkick; (b) the feature map of Conv2 layer for action Turnaroundkick. In particular, Conv2 feature maps are illustrated in a total of blocks and each block is the visualization of a dimensional feature. The similar visualization of feature maps for a appearance-motion sequence: bowandarrow in Conv2 layer can also be seen in (c) and (d). (Zoom in for better viewing.)

3.3 Action Parsing via DAP3D-Net

Once DAP3D-Net is trained, for a new test video containing multiple actions with complex scene, we first apply an action proposal method to obtain a set of candidate action chunks with corresponding coarse locations motivated by [9]. In fact, a variety of recent papers offer methods [10, 30, 53, 57, 48] for generating action proposals to effectively reduce the searching space for detection-based research. In our framework, we adopt [57] to propose the action candidates (average 550 proposals per video) due to the impressive performance reported in their paper. We further construct the appearance-motion data and warp them into the identical size of for each obtained action proposal as in the training phase, and then feed them into the learned DAP3D-Net. Consequently, the relative location, action category, and action attributes of each proposed video chunk can be obtained through DAP3D-Net. Particularly, to further refine the localization of actions, we formulate the bounding box adjustment equation as:


where denotes the refined location of the action center, coordinate denotes the center location of the action proposal and coordinate denotes the relative location outputted from DAP3D-Net. Finally, a non-maximum suppression is applied to reject a proposal if it has an intersection-over-union (IoU) overlap with a higher scoring proposal and the IoU is larger than a threshold following [9]. For a faster action parsing, SVD can be applied with FC layers as [8].

















Tubelet [10] 9.8 54.0 40.4 88.3 71.6 52.7 50.0 61.7 15.8 82.8 43.5 57.1 63.4 18.5 19.4 64.1 94.8 82.3 65.5 75.8 55.6
SDPM [45] 20.2 52.9 21.3 75.0 54.4 47.2 43.1 58.2 14.9 61.1 49.3 78.1 56.1 38.5 21.8 60.2 87.6 57.3 46.8 62.8 50.3
WSAD [39] 8.2 43.7 19.3 63.8 57.6 34.7 36.9 68.8 14.0 62.7 31.6 49.5 52.6 7.9 16.4 41.7 62.1 65.5 36.0 64.2 41.8
NBMIM [58] 11.7 58.8 47.6 86.9 77.8 48.6 56.5 56.7 10.4 82.1 37.7 43.3 62.4 8.6 26.7 58.0 82.0 79.7 62.4 73.9 53.5
STIP [23]+VLAD [12] + linear SVM
27.0 45.9 12.2 38.9 52.0 57.0 42.1 23.1 42.5 42.1 56.1 52.9 20.0 24.8 50.9 29.9 23.8 48.5 14.8 33.2 37.0
DTF [49]+VLAD [12]+ linear SVM
58.1 63.5 39.0 77.0 72.8 41.7 57.8 36.9 24.2 50.0 54.7 49.0 58.3 38.5 48.9 45.0 84.0 77.4 53.8 67.1 54.9
Two-stream convnet [37]+Ft+score fusion
47.6 76.0 40.3 38.9 51.2 26.1 38.0 53.3 21.3 85.2 57.8 73.3 53.4 47.0 28.5 40.7 31.9 51.6 46.8 74.4 49.2
C3D [46]+linear SVM (FC6)
12.9 59.4 49.0 88.3 79.9 50.0 59.2 49.2 11.6 83.1 27.9 46.1 60.5 19.3 27.4 57.0 85.4 81.6 74.7 74.5 55.2
C3D [46]+Ft+ linear SVM (FC7)
57.8 67.8 38.5 79.3 77.0 49.0 55.9 50.1 26.6 81.6 66.0 36.0 60.9 47.3 53.3 61.3 86.0 78.6 65.6 69.2 60.4
C3D [46]+Ft+softmax
14.1 50.7 35.6 88.5 70.0 49.2 46.2 59.0 8.4 82.4 39.0 54.1 61.0 16.4 12.4 61.7 89.2 81.8 63.3 74.6 52.9
DAP3D-Net+Ft (RGB)+softmax
21.5 69.4 36.4 80.9 75.7 55.4 61.2 66.3 23.6 74.3 53.5 75.1 62.2 41.4 48.7 62.0 87.2 66.8 63.6 70.8 59.8
DAP3D-Net+Ft (AM)+softmax
26.9 71.7 41.0 73.4 81.1 61.5 67.3 71.9 31.3 79.6 58.1 82.9 67.5 47.4 50.9 66.2 88.6 72.9 66.1 73.5 64.0
DAP3D-Net+BBR+Ft (RGB)+softmax
25.2 73.4 41.8 81.5 80.0 58.8 64.8 70.2 29.3 78.5 56.7 80.4 65.9 44.0 51.7 65.6 83.4 70.7 67.4 74.9 63.2
DAP3D-Net+BBR+Ft (AM)+softmax
32.3 76.3 53.0 87.9 85.6 66.0 71.8 77.4 34.2 88.1 61.6 87.5 72.8 51.9 57.4 70.7 91.6 77.4 70.6 78.0 69.3

“Ft” indicates the deep model is fine-tuned with training data from the HAU dataset. “Two-stream convnet” is trained with original frames and their optical flows, and score fusion is used for final prediction. “RGB” denotes using original data clips to fine-tune our model, while “AM” denotes using appearance-motion data clips to fine-tune our model. ’soft’ indicates directly output the probabilities of each category via C3D deep net instead of using FC6/FC7 feature with linear SVM in [46]. BBR indicates using bounding-box regressor to refine location by the adjustment in Eq. (4). The localization of actions with the methods from 5th to 12th rows in this table is directly using the location (,,) of the proposals [57].

Table 1: Detection average precision (%) on the HAU dataset. Four previous works and some baselines have been evaluated against in this table. All the methods from the 5th row to the last row are based on action proposals in [57]. The last four rows are different versions of our proposed method.
(a) Attributes prediction via DAP3D-Net (RGB data) (b) Attributes prediction via DAP3D-Net (appearance-motion data)
Figure 7: AUC of each attribute predicted via DAP3D-Net and DAP3D-Net on HAU, respectively. Blue and red denote the predictions from low-level attributes (H1) and high-level attributes (H2). (Zoom in for better viewing.)

4 Experiments and results

In this section, we evaluate the proposed DAP3D-Net for action parsing in videos on two datasets: NASA and HAU. Our method is implemented using Caffe

[14] deep learning framework and we further modify various layers from [46] to specifically fit for our architecture. DAP3D-Net is trained on a workstation configured with GTX TITAN X GPU.

4.1 Evaluation on NASA dataset

The NASA dataset contains aligned action clips with over 300 categories. In our experiments, we only select a subset with the most frequent action categories, each of which has abundant relevant clips ranging from to , as the training set. We further randomly split the subset into training and test sets with a ratio of for evaluating the effectiveness of DAP3D-Net. Following Eq. (1

), we train our deep model by stochastic gradient descent (

) with the batch size of 40. The initial base learning rate of DAP3D-Net is , and updated as with the step size of iterations. The optimization is terminated at iterations. Since each action location in NASA is centrally aligned, in the testing phase of this experiment, the original test data are directly fed to DAP3D-Net feedforward pass without using action proposal to evaluate action categorization and attributes learning.

To better understand our deep model, two deep nets are trained on NASA with the same architecture of Fig. 4: (1) the model trained with original RGB data denoted as DAP3D-Net and (2) the model trained with previously mentioned appearance-motion data denoted as DAP3D-Net. The numeric values of action categorization accuracies and the mean Area Under the Curve (AUC) of attribute prediction on NASA test set are listed in Table 2. Compared with DAP3D-Net trained on RGB data, DAP3D-Net can achieve consistently better performance, since appearance-motion contains more information on human motion which is an important factor for action parsing in videos. Fig. 6 illustrates feature maps from Conv2 layer of DAP3D-Net, in which salient motion patterns are well learned and shared. We regard DAP3D-Net and DAP3D-Net learned from NASA as pre-trained models. In the next sub-section, we will evaluate the effectiveness of our method for multiple-action parsing in realistic videos.

Method Accuracy H1 attributes mean AUC H2 attributes mean AUC
DAP3D-Net 76.14 0.711 0.758
DAP3D-Net 80.08 0.753 0.792
Table 2: Action categorization accuracy (%) and attributes prediction AUC on NASA test data.

Figure 8: Illustrative example frames for action parsing on the HAU dataset. The ground-truth location is marked by a red bounding box and the detected bounding box is marked in light green. For attribute prediction, blue indicates corrected prediction, green indicates missed prediction and red indicates wrong prediction. (Better to view this figure in color print.)

4.2 Evaluation on HAU dataset

To further evaluate our DAP3D-Net on more challenging and realistic scenarios, we also collect a new Human Action Understanding (HAU) dataset specifically for multiple-action parsing with complex scenes. In particular, the HAU dataset consists of 104 long video sequences, each of which includes several actions performed by different people. The videos from HAU contain categories of actions: baseballpitch, basketballcontrol, basketballshooting, boilrstok, bowandarrow, clapping, crouchwalk, golfswing, jump, karatekick, pointing, pullrope, rightknee, rounhouse, scurryjog, sitting, skiing, soccer, surfing and walk. We annotate the total 1461 aligned actions in all videos from HAU with bounding boxes, action categories and attributes. For this dataset, we randomly select 40 videos as the training set, which contains all action types, to fine-tune our previously learned DAP3D-Net and DAP3D-Net, and then evaluate the action parsing task on the remaining 64 long videos. In detail, we extract 514 aligned action clips from 40 training videos based on their annotations and then construct the training data following Section  3.2.1 to fine-tune DAP3D-Nets with iterations (The base learning rate is 0.001, and decreases every 1000 iterations by 0.1.). Note that, the number of categories used to fine-tune our DAP3D-Net is , since we also add some background clips which are randomly extracted from the non-annotation areas of long training videos as an extra category. In this experiment, we evaluate action parsing on HAU with two aspects: action detection and action attributes learning. Following [58, 57], for the precision score, a correct action detection is determined if at least of the volume size overlaps a ground-truth. For the recall score, a retrieved ground-truth is determined if at least of its volume size is covered by at least one action detection.

Figure 9: ROC curves for action detection on the HAU dataset.

The action detection results on HAU are shown in Table 1. Specifically, we first compare our method with four state-of-the-art action detection methods: Tubelet [10], SDPM [45], WSAD [39] and NBMIM [58]. Furthermore, we also extract STIP [23] features and dense trajectory features (DTF) [49] for both annotated actions from training data and action proposals from test data, respectively. All the features are then embedded into long representations via VLAD [12] and fed to SVM. A 2D image-based two-stream deep net [37] pre-trained on UCF101 [40] is compared as well. Besides, C3D [46]

, as a state-of-the-art 3D deep model pre-trained on the Sport1M dataset, is also used as a feature extractor combining with a SVM classifier for action detection. Instead of extracting features from C3D, the accuracy directly outputted from C3D probability layer with the softmax classifier is also reported. For our method, we fine-tune DAP3D-Net on HAU with RGB and appearance-motion data, respectively. From the results, Tubelet produces the best performance (MAP

) among non-deep-learned methods. While, the STIP+VLAD+linear SVM takes the lowest MAP of 37.0% in Table 1 due to the less effective feature extraction. For C3D methods, the results of combining fine-tuned FC7 features with SVM outperform (5.2%-7.5% higher) other C3D related works, and the SVM classifier leads to better MAP than directly using softmax via C3D. Our method DAP3D-Net+BBR+Ft+softmax gives the highest MAP of 69.3% among all the compared methods. The relevant results also illustrate that DAP3D-Net is superior to DAP3D-Net, which is also reflected with the AUC via precision-recall curves in Fig. 9. Moreover, bounding box regressor (BBR) in DAP3D-Net can provide a good location adjustment for more accurate (3.4%5.3% improvement) action detection compared to DAP3D-Net without using BBR. Besides, Table 4 shows the results of DAP3D-Net+BBR+Ft+linear SVM on HAU. Combining SVM with extracted FC1 and FC2 features can lead to 1% improvement over using softmax classifier via DAP3D-Net

but cost much more time for detection, since the separated pipeline of deep feature extraction and SVM is more time-consuming than directly detecting actions via DAP3D-Net

with softmax classifier (i.e., average 52.34s in total for a 600-frame video).

Methods TrainData Attribute
hit rate
Original Low-level attributes (H1) 0.772 16/19 0.071
DAP3D-Net RGB High-level attributes (H2) 0.814 12/14 0.054
+Ft (ours) data All attributes (H1+H2) 0.726 26/33 0.084
Appearance- Low-level attributes (H1) 0.795 17/19 0.060
DAP3D-Net motion High-level attributes (H2) 0.847 12/14 0.042
+Ft (ours) data All attributes (H1+H2) 0.759 28/33 0.075
linear SVM C3D FC6 Low-level attributes (H1) - 0.678 14/19 0.195
linear SVM C3D FC6 High-level attributes (H2) - 0.704 11/14 0.140
linear SVM C3D+Ft FC7 Low-level attributes (H1) - 0.757 17/19 0.056
linear SVM C3D+Ft FC7 High-level attributes (H2) - 0.783 11/14 0.038
linear SVM DTF+VLAD Low-level attributes (H1) - 0.602 13/19 0.133
linear SVM DTF+VLAD High-level attributes (H2) - 0.631 9/14 0.098
Table 3: Comparison of attributes prediction results with baselines. We illustrate our hierarchical learning scheme outperforms learning all attributes from one layer (i.e., FC2) on the HAU dataset.

After action detection, attributes learning is used for parsing how the action occurs. Table 3 shows the comparison results on attributes learning between DAP3D-Net and some baseline methods on two-level attributes: H1 and H2. In particular, we extract FC6 features using non-fine-tuned C3D from both HAU ground-truth and proposals, and a linear SVM is then utilized to train independent classifiers with FC6 features on each single attribute. The same procedure is employed for fine-tuned FC7 features using C3D and DTF+VLAD, as well. In Table 3, in general, multi-output attributes learning via DAP3D-Net produces better performance than independent attributes learning via SVM, and the AUC calculated on H2 is always higher than that on H1. In detail, DAP3D-Net+Ft significantly outperforms the compared methods with maximum AUC improvements of 32.5% and 34.2% for attributes learning on H1 and H2, respectively. Out of 19 attributes on H1 and 14 attributes on H2, it can successfully hit 17 and 12 attributes, respectively. Meanwhile, to evaluate the effectiveness of our two-level hierarchical learning scheme, we also learn all 33 attributes together from the FC2 layer in DAP3D-Net (similar to [36, 35]), which proves to perform worse than learning the attributes hierarchically in terms of AUC and hit rates. Fig. 7 shows the AUC of each attribute prediction via DAP3D-Net and DAP3D-Net. Some representative frames of action parsing via DAP3D-Net+Ft+BBR are illustrated in Fig. 8. It is observed that DAP3D-Net can accurately localize the actions and simultaneously output the action categories and their motion attributes.

Classifier DAP3D-Net (FC2) DAP3D-Net (FC1) DAP3D-Net (FC2)
linear SVM - 70.4 (144.21s) 70.6 (158.67s)
softmax 69.3 (52.34s) - -

Total detection time for “linear SVM” includes action proposal [57], network feedforward pass, deep feature extraction and SVM classification. While, detection time for “softmax” includes only [57] and network feedforward pass.

Table 4: MAP (%) of DAP3D-Net+Ft+BBR with softmax vs. linear SVM and the average of total detection time (s) per video on HAU.

5 Conclusion

In this paper, we have developed a multi-task 3D deep convolutional network, i.e., DAP3D-Net, to achieve action parsing by answering where, what and how actions occur in videos, simultaneously. Moreover, two datasets NASA and HAU were contributed for learning and evaluating DAP3D-Net, respectively. Extensive experiments have demonstrated that DAP3D-Net can lead to outstanding performance on action parsing in videos and outperform state-of-the-art methods on action detection and motion attributes learning. In future work, the attributes learned from DAP3D-Net will be further explored with zero-shot learning for recognizing unseen actions.


  • [1] B. Antić and B. Ommer. Video parsing for abnormality detection. In ICCV, 2011.
  • [2] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-time shapes. In ICCV, 2005.
  • [3] L. Cao, Z. Liu, and T. S. Huang. Cross-dataset action detection. In CVPR, 2010.
  • [4] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005.
  • [5] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In CVPR, 2008.
  • [6] S. Fidler, R. Mottaghi, R. Urtasun, et al. Bottom-up segmentation for top-down detection. In CVPR, 2013.
  • [7] Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Attribute learning for understanding unstructured social activity. In ECCV. 2012.
  • [8] R. Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.
  • [9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • [10] M. Jain, J. Van Gemert, H. Jégou, P. Bouthemy, and C. G. Snoek. Action localization with tubelets from motion. In CVPR, 2014.
  • [11] D. Jayaraman and K. Grauman. Zero-shot recognition with unreliable attributes. In NIPS, 2014.
  • [12] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In CVPR, 2010.
  • [13] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. T-PAMI, 35(1):221–231, 2013.
  • [14] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.
  • [15] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
  • [16] Y. Ke, R. Sukthankar, and M. Hebert. Event detection in crowded videos. In ICCV, 2007.
  • [17] A. Kläser, M. Marszałek, C. Schmid, and A. Zisserman. Human focused action localization in video. In Trends and Topics in Computer Vision, pages 219–233. 2012.
  • [18] T. Ko. A survey on behavior analysis in video surveillance for homeland security applications. In

    Applied Imagery Pattern Recognition Workshop

    , 2008.
  • [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [20] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In ICCV, 2011.
  • [21] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014.
  • [22] T. Lan, Y. Wang, and G. Mori. Discriminative figure-centric models for joint action localization and recognition. In ICCV, 2011.
  • [23] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.
  • [24] C. Liu. Beyond pixels: exploring new representations and applications for motion analysis. PhD thesis, Massachusetts Institute of Technology, 2009.
  • [25] J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In CVPR, 2011.
  • [26] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos in the wild. In CVPR, 2009.
  • [27] L. Liu, L. Shao, X. Zhen, and X. Li. Learning discriminative key poses for action recognition. IEEE Transactions on Cybernetics, 43(6):1860–1870, 2013.
  • [28] M. Marszalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009.
  • [29] B. Ni, P. Moulin, and S. Yan. Pose adaptive motion feature pooling for human action analysis. IJCV, 111(2):229–248, 2015.
  • [30] D. Oneata, J. Revaud, J. Verbeek, and C. Schmid. Spatio-temporal object detection proposals. In ECCV. 2014.
  • [31] J. Qin, L. Liu, M. Yu, Y. Wang, and L. Shao. Fast action retrieval from videos via feature disaggregation. In BMVC, 2015.
  • [32] H. Ragheb, S. Velastin, P. Remagnino, and T. Ellis. Vihasi: virtual human action silhouette data for the performance evaluation of silhouette-based action recognition methods. In IEEE International Conference on Distributed Smart Cameras.
  • [33] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497, 2015.
  • [34] C. Schüldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In ICPR, 2004.
  • [35] S. Shankar, V. K. Garg, and R. Cipolla. Deep-carving: Discovering visual attributes by carving deep neural nets. CVPR, 2015.
  • [36] J. Shao, K. Kang, C. C. Loy, and X. Wang.

    Deeply learned attributes for crowded scene understanding.

    In CVPR, 2015.
  • [37] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [38] P. Siva and T. Xiang. Action detection in crowd. In BMVC, 2010.
  • [39] P. Siva and T. Xiang. Weakly supervised action detection. In BMVC, 2011.
  • [40] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [41] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2014.
  • [42] D. Tahmoush. Applying action attribute class validation to improve human activity recognition. In CVPR Workshops, 2015.
  • [43] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolutional learning of spatio-temporal features. In ECCV. 2010.
  • [44] T. H. Thi, J. Zhang, L. Cheng, L. Wang, and S. Satoh. Human action recognition and localization in video using structured learning of local space-time features. In AVSS, 2010.
  • [45] Y. Tian, R. Sukthankar, and M. Shah. Spatiotemporal deformable part models for action detection. In CVPR.
  • [46] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3d: generic features for video analysis. arXiv preprint arXiv:1412.0767, 2014.
  • [47] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 104(2):154–171, 2013.
  • [48] J. C. van Gemert, M. Jain, E. Gati, and C. G. Snoek. Apt: Action localization proposals from dense trajectories. In BMVC, 2015.
  • [49] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Action recognition by dense trajectories. In CVPR, 2011.
  • [50] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Learning actionlet ensemble for 3d human action recognition. T-PAMI, 36(5):914–927, 2014.
  • [51] L. Wang, Y. Qiao, and X. Tang. Video action detection with relational dynamic-poselets. In ECCV. 2014.
  • [52] X. Wang, M. Yang, S. Zhu, and Y. Lin. Regionlets for generic object detection. In ICCV, 2013.
  • [53] P. Weinzaepfel, Z. Harchaoui, and C. Schmid. Learning to track for spatio-temporal action localization. arXiv preprint arXiv:1506.01929, 2015.
  • [54] D. Xu, E. Ricci, Y. Yan, J. Song, N. Sebe, and F. B. Kessler. Learning deep representations of appearance and motion for anomalous event detection. 2015.
  • [55] Y. Xu, D. Xu, S. Lin, T. X. Han, X. Cao, and X. Li. Detection of sudden pedestrian crossings for driving assistance systems. IEEE Transactions onSystems, Man, and Cybernetics, Part B, 42(3):729–739, 2012.
  • [56] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei. Human action recognition by learning bases of action attributes and parts. In ICCV, 2011.
  • [57] G. Yu and J. Yuan. Fast action proposals for human action detection and search. In CVPR, 2015.
  • [58] J. Yuan, Z. Liu, and Y. Wu. Discriminative video pattern search for efficient action detection. T-PAMI, 33(9):1728–1743, 2011.
  • [59] J. Zhang, H. Liu, W. Nie, L. Chaisorn, Y. Wong, and M. S. Kankanhalli. Human action recognition bases on local action attributes. Journal of Electrical Engineering and Technology, 10(3):1264–1274, 2015.
  • [60] S. Zhang, M. Ang Jr, W. Xiao, and C. Tham. Detection of activities for daily life surveillance: Eating and drinking. In International Conference on e-health Networking, Applications and Services, 2008.
  • [61] Z. Zhang, C. Wang, B. Xiao, W. Zhou, and S. Liu. Attribute regularization based human action recognition. IEEE Transactions on Information Forensics and Security, 8(10):1600–1609, 2013.
  • [62] Z. Zhang, C. Wang, B. Xiao, W. Zhou, and S. Liu. Robust relative attributes for human action recognition. Pattern Analysis and Applications, 18(1):157–171, 2013.