Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

03/29/2017 ∙ by Fabien Baradel, et al. ∙ 0

We address human action recognition from multi-modal video data involving articulated pose and RGB frames and propose a two-stream approach. The pose stream is processed with a convolutional model taking as input a 3D tensor holding data from a sub-sequence. A specific joint ordering, which respects the topology of the human body, ensures that different convolutional layers correspond to meaningful levels of abstraction. The raw RGB stream is handled by a spatio-temporal soft-attention mechanism conditioned on features from the pose network. An LSTM network receives input from a set of image locations at each instant. A trainable glimpse sensor extracts features on a set of predefined locations specified by the pose stream, namely the 4 hands of the two people involved in the activity. Appearance features give important cues on hand motion and on objects held in each hand. We show that it is of high interest to shift the attention to different hands at different time steps depending on the activity itself. Finally a temporal attention mechanism learns how to fuse LSTM features over time. We evaluate the method on 3 datasets. State-of-the-art results are achieved on the largest dataset for human activity recognition, namely NTU-RGB+D, as well as on the SBU Kinect Interaction dataset. Performance close to state-of-the-art is achieved on the smaller MSR Daily Activity 3D dataset.



There are no comments yet.


page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

We recognize human activities fusing a model trained on pose sub-sequences and a spatio-temporal attention model on RGB video conditioned on pose features.

Human activity recognition is a field with many applications ranging from video surveillance, HCI, robotics, to automated driving and others. Consumer depth cameras are currently dominating the field for indoor applications with close ranges, as they allow to estimate articulated poses easily. We address similar settings, namely activity recognition problems where articulated pose is available. As complementary information we also use the RGB stream, which provides rich contextual cues on human activities, for instance on the objects held or interacted with.

Recognizing human actions accurately remains a challenging task, compared to other problems in computer vision and machine learning. We argue that this is in part due to the lack of large datasets. While large scale datasets have been available for a while for object recognition (ILSVRC 

[29]) and for general video classification (Sports-1M [16] and lately Youtube8M [1]), the more time-consuming acquisition process for videos showing close range human activities limited datasets of this type to several hundreds or a few thousand videos. As a consequence, the best performing methods on this kind of datasets are either based on handcrafted features or suspected to overfit on the small datasets after years the community spent on tuning methods. The recent introduction of datasets like NTU-RGB-D [30] ( 57 000 videos) will hopefully lead to better automatically learned representations.

One of the challenges is the high amount of information in videos. Downsampling is an obvious choice, but using the full resolution at certain positions may help extracting important cues on small or far away objects (or people). In this regard, models of visual attention [26, 7, 33] (see section 2 for a full discussion) have drawn considerable interest recently. Capable of focusing their attention to specific important points, parameters are not wasted on input which is considered of low relevance to the task at hand.

We propose a method for human activity recognition, which addresses this problem by fusing articulated pose and raw RGB input in a novel way. In our approach, pose has three complementary roles: i) it is used as an input stream in its own right, providing important cues for the discrimination of activity classes; ii) raw pose (joints) serves as an input for the model handling the RGB stream, selecting positions where glimpses are taken in the image; iii) features learned on pose serve as an input to the soft-attention mechanism, which weights each glimpse output according to an estimated importance w.r.t. the task at hand, in contrast to unconstrained soft-attention on RGB video [33].

The RGB stream model is recurrent (an LSTM), whereas our pose representation is learned using a convolutional neural network taking as input a sub-sequence of the video. The benefits are twofold: a pose representation over a large temporal range allows the attention model to assign an estimated importance for each glimpse point and each time instant taking into account knowledge of this temporal range. As an example, the pose stream might indicate that the hand of one person moves into the direction of a different person, which still leaves several possible choices for the activity class. These choices might require attention to be moved to this hand at a specific instant to verify what kind of object is held, which itself may help to discriminate activities.

The contributions of our work are as follows:

  • We propose a way to encode articulated pose data over time into 3D tensors which can be fed to CNNs as an alternative to recurrent neural networks. We propose a particular joint ordering which preserves neighborhood relationships between the joints in the body.

  • We propose a spatial attention mechanism on RGB videos which is conditioned on pose features from the full sub-sequence.

  • We propose a temporal attention mechanism which learns how to pool features output from the recurrent (LSTM) network over time in an adaptive way.

  • As an additional contribution, we experimentally show that knowledge transfer from a large activity dataset like NTU (57000 activities) to smaller datasets like MSR Daily Activitiy 3D (300 videos) is possible. Up to our knowledge, this ImageNet-style transfer has not been attempted on human activities.

Animated video can be found on the project page111https://fabienbaradel.github.io/pose_rgb_attention_human_action.

2 Related Work

Activities, gestures and multimodal data — Recent gesture/action recognition methods dealing with several modalities typically process 2D+T RGB and/or depth data as 3D. Sequences of frames are stacked into volumes and fed into convolutional layers at first stages [3, 15, 27, 28, 41]. When additional pose data is available, the 3D joint positions are typically fed into a separate network. Preprocessing pose is reported to improve performance in some situations, e.g. augmenting coordinates with velocities and acceleration [47]. Pose normalization (bone lengths and view point normalization) has been reported to help in certain situations [28]. Fusing pose and raw video modalities is traditionally done as late fusion [27], or early through fusion layers [41]. In [21], fusion strategies are learned together with model parameters with by stochastic regularization.

Recurrent architectures for action recognition

— Most recent activity recognition methods are based on recurrent neural networks in some form. In the variant Long Short-Term Memory (LSTM)  

[12], a gating mechanism over an internal memory cell learns long-term and short-term dependencies in the sequential input data. Part-aware LSTMs  [30] separate the memory cell into part-based sub-cells and let the network learn long-term representations individually for each part, fusing the parts for output. Similarly, Du et al [8] use bi-directional LSTM layers which fit anatomical hierarchy. Skeletons are split into anatomically-relevant parts (legs, arms, torso, etc), so that each subnetwork in the first layers gets specialized on one part. Features are progressively merged as they pass through layers.

Multi-dimensional LSTMs [11] are models with multiple recurrences from different dimensions. Originally introduced for images, they also have been applied to activity recognition from pose sequences [23]. One dimension is time, the second is a topological traversal of the joints in a bidirectional depth-first search, which preserves the neighborhood relationships in the graph. Our solution for pose features a similar joint traversal. However, our pose network is convolutional and not recurrent (whereas our RGB network is recurrent).

Attention mechanisms — Human perception focuses selectively on parts of the scene to acquire information at specific places and times. In machine learning, this kind of processes is referred to as attention mechanism, and has drawn increasing interest when dealing with languages, images and other data. Integrating attention can potentially lead to improved overall accuracy, as the system can focus on parts of the data, which are most relevant to the task.

In computer vision, visual attention mechanisms date as far back as the work of Itti et al for object detection [14]. Early models were highly related to saliency maps, i.e. pixelwise weighting of image parts that locally stand out, no learning was involved. Larochelle and Hinton [20]

pioneered the incorporation of attention into a learning architecture by coupling Restricted Boltzmann Machines with a foveal representation.

More recently, attention mechanisms were gradually categorized into two classes. Hard attention takes hard decisions when choosing parts of the input data. This leads to stochastic algorithms, which cannot be easily learned through gradient descent and back-propagation. In a seminal paper, Mnih et al [26]

proposed visual hard-attention for image classification built around a recurrent network, which implements the policy of a virtual agent. A reinforcement learning problem is thus solved during learning 

[40]. The model selects the next location to focus on, based on past information. Ba et al  [2] improved the approach to tackle multiple object recognition. In [19], a hard attention model generates saliency maps. Yeung et al [44] use hard-attention for action detection with a model, which decides both which frame to observe next as well as when to emit an action prediction.

On the other hand, soft attention

takes the entire input into account, weighting each part of the observations dynamically. The objective function is usually differentiable, making gradient-based optimization possible. Soft attention was used for various applications such as neural machine translation 

[5, 17] or image captioning [42]. Recently, soft attention was proposed for image [7] and video understanding  [33, 34, 43], with spatial, temporal and spatio-temporal variants. Sharma et al [33] proposed a recurrent mechanism for action recognition from RGB data, which integrates convolutional features from different parts of a space-time volume. Yeung et al. report a temporal recurrent attention model for dense labelling of videos [43]. At each time step, multiple input frames are integrated and soft predictions are generated for multiple frames. Bazzani et al [6] learn spatial saliency maps represented by mixtures of Gaussians, whose parameters are included into the internal state of a LSTM network. Saliency maps are then used to smoothly select areas with relevant human motion. Song et al [34] propose separate spatial and temporal attention networks for action recognition from pose. At each frame, the spatial attention model gives more importance to the joints most relevant to the current action, whereas the temporal model selects frames.

Up to our knowledge, no attention model has yet taken advantage of both articulated pose and RGB data simultaneously. Our method has slight similarities with hard attention in that hard choices are taken on locations in each frame. However, these choices are not learned, they depend on pose. On the other hand, we learn a soft-attention mechanism, which dynamically weights features from several locations. The mechanism is conditional on pose, which allows it to steer its focus depending on motion.

(a)                                                  (b)

Figure 2: (a) the topological ordering of joints (similar to [23]): blue arrows visit joints for the first time and orange arrows go back to the “middle spine”. (b) the ordering is reproduced in the matrix input to the pose learner)

3 Proposed Model

A single or multi-person activity is described by a sequence of two modalities: the set of RGB input images , and the set of articulated human poses . We do not use raw depth data in our method, although the extension would be straightforward. Both signals are indexed by time . Poses are defined by 3D coordinates of joints, for instance delivered by the middleware of a depth camera. The sheer amount of data per input sequence makes it difficult to train a classical (convolutional or recurrent) model directly on the sequence of inputs to predict activity classes

. We propose a two-stream model, which classifies activity sequences by extracting features from articulated human poses and RGB frames.

3.1 Convolutional pose features

At each time step , a subject is represented by the 3D coordinates of its body joints. In our case we restrict our application to activities involving one or two people and their interactions. The goal is to extract features which model i) the temporal behavior of the pose(s) and ii) correlations between different joints. An attention mechanism on poses could be an option, similar to [34]. We argue that the available pose information is sufficiently compact to learn a global representation and show that this is efficient. However, we also argue for the need to find a hierarchical representation which respects the spatio-temporal relationships of the data. In the particular case of pose data, joints also have strong neighborhood relationships in the human body.

In the lines of [23], we define a topological ordering of the joints in a human body as a connected cyclic path over joints (see figure 2a). The path itself is not Hamiltonian as each node can be visited multiple times: once during a forward pass over a limb, and once during a backward pass over the limb back to the joint it is attached to. The double entries in the path are important, since they ensure that the path preserves neighborhood relationships.

In [23]

, a similar path is used to define an order in a multi-dimensional LSTM network. In contrast, we propose a convolutional model which takes three-dimensional inputs (tensors) calculated by concatenating pose vectors over time. In particular, input tensors

are defined as , where is the time index, is the joint & coordinate index, and is a feature index (see figure 2b): each line corresponds to a time instant; the first three columns correspond to the , and coordinates of the first joint, followed by the , and coordinate of the second joint, which is a neighbor of the first etc. The first channel corresponds to raw coordinates, the second channel corresponds to first derivates of coordinates (velocities), the third channel to second derivates (accelerations). Poses of two people are stacked into a single tensor along the second dimension. This choice of tensor organization will be justified further below.

We learn a pose network with parameters on this input, resulting in a pose feature representation :


Here and in the rest of the paper, subscripts of mappings and their parameters choose a specific mapping, they are not indices. Subscripts of variables and tensors are indices.

is implemented as a convolutional neural network alternating convolutions and max-pooling. Combined with the topological ordering of the columns of the input tensor, this leads to a specific hierarchical representation of the feature maps. The first layer of convolutions will extract features from the correlations between coordinates, mostly of the same joints (or neighboring joints). Subsequent convolutions will extract features between neighboring joints, and even higher layers in the network correspond to extractions of features which are further away in the human body, in the sense of path lengths in the graph. The last layers correspond to features extracted between the two different poses corresponding to two different people.

One design choice of this representation is to stack different coordinates of the same joint into subsequent columns of the tensor, opposed to the alternative of distributing them over different channels. This ensures, that the first layer calculates features on different coordinates. Experiments have confirmed the interest of this choice. The double entries in the input tensor artificially increase its size, as some joints are represented multiple times. However, this cost is compensated by the fact that the early convolutional layers extract features on joint pairs which are neighbors in the graph (in the human body).

3.2 Spatial Attention on RGB videos

The sequence of RGB input images

is arguably not compact enough to easily extract an efficient global representation with a feed-forward neural network. We opt for a recurrent solution, where, at each time instant, a glimpse on the seen input is selected using an attention mechanism.

In some aspects similar to [26], we define a trainable bandwith limited sensor. However, in contrast to [26], our attention process is conditional to the pose input , thus limited to a set of discrete attention points. In our experiments, we selected attention points, which are the 4 hand joints of the two people involved in the interaction. The goal is to extract additional information about hand shape and about manipulated objects. A large number of activities such as Reading, Writing, Eating, Drinking are similar in motion but can be highly correlated to manipulated objects. As the glimpse location is not output by the network, this results in a differentiable soft-attention mechanism, which can be trained by gradient descent.

Figure 3: The spatial attention mechanism

The glimpse representation for a given attention point is a convolutional network with parameters , taking as inputs a crop taken from image at the position of joint from the set :


Here, is a (column) feature vector for time and hand . For a given time , we stack the vectors into a matrix , where is the index over hand joints and is the index over features. is a matrix (a 2D tensor), since t is fixed for a given instant.

A recurrent model receives inputs from the glimpse sensor sequentially and models the information from the seen sequence with a componential hidden state :


We chose a fully gated LSTM model including input, forget and output gates and a cell state. To keep the notation simple, we omitted the gates and the cell state from the equations. The input to the LSTM network is the context vector , defined further below, which corresponds to an integration of the different attention points (hands) in .

An obvious choice of integration are simple functions like sums and concatenations. While the former tends to squash feature dynamics by pooling strong feature activations in one hand with average or low activations in other hands, the latter leads to high capacity models with low generalization. The soft-attention mechanism dynamically weighs the integration process through a distribution , determining how much attention hand needs with a calculated weight . In contrast to unconstrained soft-attention mechanisms on RGB video [33], our attention distributions not only depends on the LSTM state , but also on the pose features extracted from the sub-sequence, through a learned mapping with parameters :


Attention distribution and features are integrated through a linear combination as


which is input to the LSTM network at time (see eq. (3)). The conditioning on the pose features in 4 is important, as it provides valuable context derived from motion. Note that the recurrent model itself (eq. (3)) is not conditional [25], this would significantly increase the amount of parameters.

Figure 4: The full recurrent model for RGB data (gates and memory cell are not shown). Pose is input to the attention mechanism. The spatial mechanism is detailed in figure 3.
Figure 5: Spatial attention over time: putting an object into the pocket of someone will make the attention shift to this hand.

3.3 Temporal Attention

Recurrent models can provide predictions for each time step . Most current work in sequence classification proceeds by temporal pooling of these predictions, e.g. through a sum or average [33]

. We show that it can be important to perform this pooling in an adaptive way. In recent work on dense activity labelling, temporal attention for dynamical pooling of LSTM logits has been proposed

[43]. In contrast, we perform temporal pooling directly on feature vector level. In particular, at each instant , features are calculated by a learned mapping given the current hidden state:


The features for all instants of the sub-sequence are stacked into a matrix , where is the index over the feature dimension. A temporal attention distribution is predicted through a learned mapping. To be efficient, this mapping should have seen the full sub-sequence before giving a prediction for an instant t, as giving a low weight to features at the beginning of a sequence might be caused by the need to give higher weights to features at the end. In the context of sequence-to-sequence alignment, this has been addressed with bi-directional recurrent networks [4]. To keep the model simple, we benefit from the fact that (sub) sequences are of fixed length and that spatial attention information is already available. We conjecture that (combined with pose) the spatial attention distributions over time t are a good indicator for temporal attention, and stack them into a single vector , input into the network predicting temporal attention:


This attention is used as weight for adaptive temporal pooling of the features , i.e. .

Figure 6: Spatial and temporal attention over time: giving something to other person will make the attention shift to the active hands in the action.

3.4 Stream fusion

Each stream, pose and RGB, leads to its own set of features, with the particularity that pose features are input to the attention mechanism for the RGB stream. Each representation is classified with its own set of parameters. We fuse both streams on logit level. More sophisticated techniques, which learn fusion [28], do not seem to be necessary.

4 Network architectures and Training

Architectures — The pose network consists of 3 convolutional layers of respective sizes , , . Inputs are of size and feature maps are, respectively, , and

. Max pooling is employed after each convolutional layer, activations are ReLU.

The glimpse senor is implemented as an Inception V3 network [35]. Each vector corresponds to the last layer before output and is of size 2048. The LSTM network has a single recurrent layer with 1024 units. The spatial attention network is an MLP with a single hidden layer of 256 units and sigmoid activation. The temporal attention network is an MLP with a single hidden layer of 512 units and sigmoid activation. The feature extractor is a single linear layer with ReLU activation. The output layers of both stream representations are linear layers followed by softmax activation. The full model (without glimpse sensor ) has 38 millions trainable parameters.

Training — All classification outputs are softmax activated and trained with cross-entropy loss. The glimpse sensor is trained on the ILSVRC 2012 data [29]

. The pose learner is trained discriminatively with an additional linear+softmax layer to predict action classes. The RGB stream model is trained with pose parameters

and glimpse parameters frozen. End-to-end training the model did not result into better performance.

5 Experiments

The proposed method has been evaluated on three datasets: NTU RGB+D, MSR Daily Activity 3D and SBU Kinect Interaction. We extensively tested on NTU and we shows two transfer experiments on the smaller datasets SBU and MSR.

NTU RGB+D Dataset (NTU) [30] — The largest dataset for human activity recognition has been acquired with a Kinect v2 sensor and contains more than 56K videos and 4 millions frames with 60 different activities including individual activities, interactions between 2 people and health related events. The actions have been performed by 40 subjects and with 80 viewpoints. We follow the cross-subject and cross-view split protocol from [30].

MSR Daily Activity3D Dataset (MSR) [38] — This dataset is among the most challenging benchmarks due to a high level of intra-class variation. It consists of 320 videos shot with a Kinect v1 sensor. 16 daily activities are performed twice each by 10 subjects from a single viewpoint. Following  [38], we use videos from subject 1, 3, 5, 7 and 9 for training, and the remaining ones for testing.

SBU Kinect Interaction Dataset (SBU) [45] — This interaction dataset features two subjects with in total 282 sequences (6822 frames) and 8 mutual activity classes shot with a Kinect v1 sensor. We follow the standard experimental protocol [45], which consists in 5-fold cross validation.

The MSR and SBU datasets are extremely challenging for methods performing representation learning, as only few videos are available for training (160 and 225, respectively).

Methods Pose RGB CS CV Avg
Lie Group [37] X - 50.1 52.8 51.5
Skeleton Quads [9] X - 38.6 41.4 40.0
Dynamic Skeletons [13] X - 60.2 65.2 62.7
HBRNN [8] X - 59.1 64.0 61.6
Deep LSTM [30] X - 60.7 67.3 64.0
Part-aware LSTM [30] X - 62.9 70.3 66.6
ST-LSTM + TrustG. [23] X - 69.2 77.7 73.5
STA-LSTM [34] X - 73.4 81.2 77.2
JTM [39] X - 76.3 81.1 78.7
DSSCA - SSLM [31] X X 74.9 - -
Ours (pose only) X - 77.1 84.5 80.8
Ours (RGB only) - X 75.6 80.5 78.1
Ours (pose +RGB) X X 84.8 90.6 87.7
Table 1: Results on the NTU RGB+D dataset with Cross-Subject (CS) and Cross-View (CV) settings (accuracies in %).
Methods Pose RGB Depth Acc.
Raw skeleton [45] X - - 49.7
Joint feature [45] X - - 80.3
Raw skeleton [46] X - - 79.4
Joint feature [46] X - - 86.9
Co-occurence RNN [49] X - - 90.4
STA-LSTM [34] X - - 91.5
ST-LSTM + Trust Gate [23] X - - 93.3
DSPM [22] - X X 93.4
Ours (Pose only) X - - 90.5
Ours (RGB only) - X - 72.0
Ours (Pose + RGB) X X - 94.1
Table 2: Results on SBU Kinect Interaction dataset (accuracies in %)
Methods Pose RGB Depth Acc.
Action Ensemble [38] X - - 68.0
Efficient Pose-Based [10] X - - 73.1
Moving Pose [47] X - - 73.8
Moving Poselets [36] X - - 74.5
Depth Fusion [48] - - X 88.8
MMMP [32] X - X 91.3
DL-GSGC [24] X - X 95.0
DSSCA - SSLM [31] - X X 97.5
Ours (Pose only) X - - 74.6
Ours (RGB only) - X - 75.3
Ours (Pose + RGB) X X - 90.0
Table 3: Results on MSR Daily Activity 3D dataset (accuracies in %)
Methods CS CV Avg
Random joint order 75.5 83.2 79.4
Topological order w/o double entries 76.2 83.9 80.0
Topological order 77.1 84.5 80.8
Table 4: Results on NTU: pose only, effect of joint ordering.
Methods Attention CS CV Avg
Conditional to pose
RGB only - 66.5 72.0 69.3
RGB only X 75.6 80.5 78.1
Multi-modal - 83.9 90.0 87.0
Multi-modal X 84.8 90.6 87.7

Table 5: Results on NTU: conditioning the attention mechanism on pose (RGB only, accuracies in %).
Methods Pose RGB Attention CS CV Avg
Spatial Temporal Pose
A Pose only X - - - - 77.1 84.5 80.8
B RGB only, no attention (sum of features) - X - - - 61.5 65.9 63.7
C RGB only, no attention (concat of features) - X - - - 63.2 67.2 65,2
E RGB only + spatial attention X X - X 67.4 71.2 69.3
G RGB only + spatio-temporal attention X X X X 75.6 80.5 78.1
H Multi-modal, no attention (A+B) X X - - - 83.0 88.5 85.3
I Multi-modal, spatial attention (A+E) X X X - X 84.1 90.0 87.1
K Multi-modal, spatio-temporal attention (A+G) X X X X X 84.8 90.6 87.7
Table 6: Results on NTU: effect of attention. means that pose is only used for the attention mechanism.

Implementation details — Following [30], we cut videos into sub sequences of 20 frames and sample sub-sequences. During training a single sub-sequence is sampled, during testing 10 sub-sequences and logits are averaged. We apply a normalization step on the joint coordinates by translating them to a body centered coordinate system with the “middle of the spine” joint as the origin (gray joint in figure 2). If only one subject is present in a frame, we set the coordinates of the second subject to zero. We crop sub images of static size on the positions of the hand joints ( for NTU, for MSR and SBU). Cropped images are then resized to and fed into the Inception model.

Training is done using the Adam Optimizer [18]

with an initial learning rate of 0.0001. We use minibatches of size 64 and dropout with a probability of 0.5. Following


, we sample 5% of the initial training set as a validation set, which is used for hyper-parameter optimization and for early stopping. All hyperparameters have been optimized on the validation sets of the respective datasets.

When transferring knowledge from NTU to MSR and SBU, the target networks were initialized with models pre-trained on NTU. Skeleton definitions are different and were adapted. All layers were finetuned on the smaller datasets with an initial learning rate 10 times smaller then the learning rate for pre-training.

Comparisons to the state-of-the-art — We show comparisons of our models to the state-of-the-art methods in table 1, table 2 and table 3, respectively. We achieve state of the art performance on the NTU dataset with the pose stream alone or with the full model fusing both streams. On the SBU dataset, we obtain state of the art performance with the full model, on the MSR dataset we are close.

As mentioned, the reported performances on the NTU and MSR datasets include a knowledge transfer from the NTU dataset. Results on MSR show the difficulty of training a fully learned representation on a tiny dataset. We outperform all methods in the first group of table 3, which correspond to hand-crafted approaches.

We conducted extensive ablation studies to understand the impact of our design choices.

Joint ordering — The joint ordering in the input tensor has an effect on performance, as shown in table 4. Following the topological order described in section 3.1 gains percentage point on the NTU dataset w.r.t. random joint order, which confirms the interest of a meaningful hierarchical representation. As anticipated, keeping the redundant double joint entries in the tensors gives an advantange, although it increases the amount of trainable parameters.

The effect of the attention mechanism — The attention mechanism on RGB data has a significant impact in term of performance as shown in table 6. We compare it to baseline summing (B) or concatenating (C) features. In these cases, hyper-parametres where optimized for these meta-architectures. The performance margin is particularly high in the case of the single stream RGB model (methods E and G). In the case of the multi-modal (two-stream) models, the advantage of attention is still high but not as high as for RGB alone. A part of the gain of the attention process seems to be complementary to the information in the pose stream, and it cannot be excluded that in the one stream setting a (small) part of the pose information is translated into direct cues for discrimination through an innovative (but admittedly not originally planned) use of the attention mechanism. However, the gain is still significant, with 2.5 percentage points compared to the baseline.

Figure 5 shows an example of the effect of the spatial attention process: during the activity of Putting an object into the pocket of somebody, the attention shifts to the “putting” hand at the point where the object is actually put.

Pose-conditioned attention mechanism — Making the spatial attention model conditional to the pose features is confirmed to be a key design choice, as can be seen in table 5. In the multi-modal setting, a full point is gained, 12 points in the RGB only case.

Runtime — For a sub-squence of 20 frames, we get the following runtimes for a single Titan-X (Maxwell) GPU and an i7-5930 CPU: A full prediction from features takes ms including pose feature extraction. This does not include RGB pre-processing, which takes additional 1sec (loading Full-HD video, cropping sub-windows and extracting Inception features). Classification can thus be done close to real-time. Fully training one model (w/o Inception) takes

4h on a Titan-X GPU. Hyper-parameters have been optimized on a computing cluster with 12 Titan-X GPUs. The proposed model has been implemented in Tensorflow.

6 Conclusion

We propose a general method for dealing with pose and RGB video data for human action recognition. A convolutional network on pose data processes specifically organized input tensors. A soft-attention mechanisms crops on hand joints allows the model to collect relevant features on hand shape and on manipulated objects. Adaptive temporal pooling further increases performance. Our method shows state-of-the-art results on several benchmarks and, up to our knowledge, is the first method performing attention on pose and RGB and the first method performing knowledge transfer in human action recognition.


  • [1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arxiv, 1609.08675, 2016.
  • [2] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. In ICLR, 2015.
  • [3] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt.

    Sequential deep learning for human action recognition.

    In HBU, 2011.
  • [4] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
  • [5] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
  • [6] L. Bazzani, H. Larochelle, and L. Torresani. Recurrent mixture density network for spatiotemporal visual attention. In ICLR, 2017 (to appear).
  • [7] K. Cho, A. Courville, and Y. Bengio. Describing multimedia content using attention-based encoder-decoder networks. IEEE-T-Multimedia, 17:1875 – 1886, 2015.
  • [8] Y. Du, W. Wang, and L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, June 2015.
  • [9] G. Evangelidis, G. Singh, and R. Horaud. Skeletal quads:human action recognition using joint quadruples. In ICPR, pages 4513–4518, 2014.
  • [10] A. Eweiwi, M. S. Cheema, C. Bauckhage, and J. Gall. Efficient pose-based action recognition. In ACCV, pages 428–443, 2014.
  • [11] A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In NIPS, 2009.
  • [12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • [13] J. Hu, W.-S. Zheng, J.-H. Lai, and J. Zhang. Jointly learning heterogeneous features for rgb-d activity recognition. In CVPR, pages 5344–5352, 2015.
  • [14] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI, 20(11):1254–1259, 1998.
  • [15] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE TPAMI, 35(1):221–231, 2013.
  • [16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
  • [17] Y. Kim, C. Denton, L. Hoang, and A. Rush. Structured attention networks. In ICLR, 2017 (to appear).
  • [18] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICML, 2015.
  • [19] J. Kuen, Z. Wang, and G. Wang. Recurrent attentional networks for saliency detection. In CVPR, pages 3668–3677, 2015.
  • [20] H. Larochelle and G. Hinton. Learning to combine foveal glimpses with a third-order Boltzmann machine. In NIPS, pages 1243–1251, 2010.
  • [21] F. Li, N. Neverova, C. Wolf, and G. Taylor. Modout: Learning to Fuse Face and Gesture Modalities with Stochastic Regularization. In FG, 2017.
  • [22] L. Lin, K. Wang, W. Zuo, M. Wang, J. Luo, and L. Zhang. A deep structured model with radius-margin bound for 3d human activity recognition. IJCV, 2015.
  • [23] J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In ECCV, pages 816–833, 2016.
  • [24] J. Luo, W. Wang, and H. Qi. Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In ICCV, 2013.
  • [25] T. Mikolov and G. Zweig. Context dependent recurrent neural network language model. In Spoken Language Technology Workshop, 2016.
  • [26] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In NIPS, pages 2204–2212, 2014.
  • [27] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In CVPR, pages 4207–4215, 2016.
  • [28] N. Neverova, C. Wolf, G. Taylor, and F. Nebout. Moddrop: adaptive multi-modal gesture recognition. IEEE TPAMI, 38(8):1692–1706, 2016.
  • [29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
  • [30] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In CVPR, pages 1010–1019, 2016.
  • [31] A. Shahroudy, T.-T. Ng, Y. Gong, and G. Wang. Deep multimodal feature analysis for action recognition in rgb+d videos. In arXiv, 2016.
  • [32] A. Shahroudy, T.-T. Ng, Q. Yang, and G. Wang. Deep multimodal feature analysis for action recognition in rgb+d videos. In TPAMI, 2016.
  • [33] S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. ICLR Workshop, 2016.
  • [34] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu. An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data. In AAAI Conf. on AI, 2016.
  • [35] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016.
  • [36] L. Tao and R. Vidal. Moving poselets: A discriminative and interpretable skeletal motion representation for action recognition. In ICCV Workshops, pages 303–311, 2015.
  • [37] R. Vemulapalli, F. Arrate, and R. Chellappa. Human action recognition by representing 3d skeletons as points in a lie group. In CVPR, pages 588–595, 2014.
  • [38] J. Wang, Z. Liu, Y. Wu, and J. Yuan. Mining actionlet ensemble for action recognition with depth cameras. In CVPR, pages 1290–1297, 2012.
  • [39] P. Wang, W. Li, C. Li, and Y. Hou. Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks. In ACM Conference on Multimedia, 2016.
  • [40] R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 2012.
  • [41] D. Wu, L. Pigou, P.-J. Kindermans, N. D.-H. Le, L. Shao, J. Dambre, and J. Odobez. Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE TPAMI, 38(8):1583–1597, 2016.
  • [42] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048–2057, 2015.
  • [43] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. Fei-Fei. Every moment counts: Dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738, 2015.
  • [44] S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. End-to-end Learning of Action Detection from Frame Glimpses in Videos. In CVPR, 2016.
  • [45] K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras. Two-person interaction detection using body-pose features and multiple instance learning. In CVPR Workshop, pages 28–35, 2012.
  • [46] K. Yun, J. Honorio, D. Chattopadhyay, T. L. Berg, and D. Samaras. Interactive body part contrast mining for human inter- action recognition. In ICMEW, 2014.
  • [47] M. Zanfir, M. Leordeanu, and C. Sminchisescu. The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. In ICCV, pages 2752–2759, 2013.
  • [48] W. Zhu, W. Chen, and G. Guo. Fusing multiple features fordepth-based action recognition. In ACM TIST, 2015.
  • [49] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. AAAI, abs/1603.07772, 2016.