Deep Action- and Context-Aware Sequence Learning for Activity Recognition and Anticipation

11/17/2016 ∙ by Mohammad Sadegh Aliakbarian, et al. ∙ EPFL Australian National University 0

Action recognition and anticipation are key to the success of many computer vision applications. Existing methods can roughly be grouped into those that extract global, context-aware representations of the entire image or sequence, and those that aim at focusing on the regions where the action occurs. While the former may suffer from the fact that context is not always reliable, the latter completely ignore this source of information, which can nonetheless be helpful in many situations. In this paper, we aim at making the best of both worlds by developing an approach that leverages both context-aware and action-aware features. At the core of our method lies a novel multi-stage recurrent architecture that allows us to effectively combine these two sources of information throughout a video. This architecture first exploits the global, context-aware features, and merges the resulting representation with the localized, action-aware ones. Our experiments on standard datasets evidence the benefits of our approach over methods that use each information type separately. We outperform the state-of-the-art methods that, as us, rely only on RGB frames as input for both action recognition and anticipation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Activity recognition and anticipation are crucial for the success of many real-life applications, such as autonomous navigation, sports analysis and personal robotics. It has therefore become increasingly popular in the computer vision literature. Nowadays, the most popular trend to tackle these tasks consists of extracting global representations for the entire image [6, 41, 42, 7], or video sequence [35, 18]. As such, these methods do not truly focus on the actions of interest, but rather compute a contex-aware representation. Unfortunately, context does not always bring reliable information about the action. For example, one can play guitar in a bedroom, a concert hall or a yard. Therefore, the resulting representations encompass much irrelevant noise.

By contrast, several methods have attempted to localize the feature extraction process to regions of interest. This, to some degree, is the case of methods exploiting dense trajectories 

[39, 38, 9] and optical flow [8, 41, 21]. By relying on motion, however, these methods can easily be distracted by irrelevant phenomena such as moving background or camera. Inspired by objectness, the notion of actionness [40, 4, 16, 44, 37, 25] has recently been proposed as a means to overcome this weakness by attempting to localize the regions where a generic action occurs. The resulting methods can then be thought of as extracting action-aware representations. In other words, these methods go to the other extreme and completely discard the notion of context. In many situations, however, context provides helpful information about the action class. For example, one typically plays soccer on a grass field.

In this paper, we propose to make the best of both world: We introduce an approach that leverages both context-aware and action-aware features for action recognition and anticipation. In particular, we make use of the output of the last layer of an image-based Convolutional Neural Network (CNN) as context-aware features. For the action-aware ones, inspired by the approach of 

[48] for object recognition and localization, we propose to exploit the class-specific activations of another CNN, which typically correspond to regions where the action occurs. The main challenge then consists of effectively leveraging the two types of features for recognition and anticipation. To this end, we introduce the novel multi-stage recurrent architecture depicted by Fig. 1. In a first stage, this model focuses on the global, context-aware features, and combines the resulting representation with the localized, action-aware ones to obtain the final prediction. In short, it first extracts the contextual information, and then merges it with the localized one.

To the best of our knowledge, our work constitutes the first attempt at explicitly bringing together these two types of information for action recognition. By leveraging RBG frames and optical flow, the two-stream approach of [29] exploits context and motion. As mentioned above, however, motion does not always correlate with the action of interest. While 3D CNNs [35] can potentially implicitly capture information about both context and action, they are difficult to train and computationally expensive. By contrast, our novel multi-stage LSTM model explicitly combines these two information sources, and provides us with an effective and efficient action recognition and anticipation framework.

As a result, our approach outperforms the state-of-the-art methods that, as us, rely only on RBG frames as input on all the standard benchmark datasets that we experimented with, including UCF-101 [33] and JHMDB21 [17]. Furthermore, our experiments clearly evidence the benefits of our multi-stage architecture over networks that exploit either context-aware or action-aware features separately, or combine them via other fusion strategies.

2 Related Work

Over the years, great progress has been made in activity recognition [20, 38, 7, 9, 31, 41, 32]. Unsurprisingly, while earlier approaches relied on handcrafted features [38, 20]

, recent ones have turned towards deep learning. Below, we focus on these approaches, which are most related to our work.

In this deep learning context, many methods rely on CNNs [35, 18, 8, 19, 26] to extract a global representation of images. These CNN-based methods, however, typically have small temporal support, and thus fail to capture long-range dynamics. For instance, the two-stream networks [41, 8, 29] act on single images in conjunction with optical flow information to model the temporal information. While 3D convolutional filters have been proposed [35], they are typically limited to acting on small sets of stacked video frames, 10 to 20 in practice.

By contrast, recurrent architectures, such as the popular Long-Short Term Memory networks 

[14], can, in principle, learn complex, long-range dynamics, and have therefore recently been investigated for action recognition [7, 24, 34, 32, 23, 34]. For instance, in [7], an LSTM was employed to model the dynamics of CNN activations; in [32], a bi-directional LSTM was combined with a multi-stream CNN to encode the long-term dynamics within and between activities in videos. Other works, such as [24], have proposed to exploit additional annotations, in the form of 3D skeletons, into an LSTM-based model. Such annotations, however, are not always available in practice, thus limiting the applicability of these methods.

Beyond recurrent models, rank pooling has also proven effective to model activities in videos [10, 3, 9]. In this context, [10] computes a representation encoding the dynamics of the video, and [3] introduces the concept of Dynamic Images to summarize the gist of a sequence.

In any event, whether based on CNNs, LSTMs or rank pooling, all of the above-mentioned methods compute one holistic representation over one image, or the sequence. While this has the advantage of retaining information about the context of the action, these methods may also easily be affected by the fact that context is not always reliable. Many actions can be performed in very different environments. In these cases, focusing on the action itself would therefore seem beneficial.

This, in essence, is the goal of methods based on the notion of actionness [40, 4, 16, 44, 37, 25]. Inspired by the concept of objectness [1, 36], commonly used in object detection, actionness aims at localizing the regions in a video where an action of interest occurs. In [40], this was achieved by exploiting appearance (RGB) and motion (optical flow) in a two-stream architecture. In [40], the resulting actionness map was then employed to generate action/bounding box proposals via an action detection framework based on [12]

, and classifying these proposals. The ActionTube approach of 

[27] follows a similar framework, but relies on [11] instead of [12]. More importantly, by focusing on the actions themselves, these methods throw away all the information about context. However, in many scenarios, such as to recognize different sports, context provides helpful information about the observed actions. Note that for extracting actionness in [40, 27], bounding box annotation are used as an extra supervision during the training process, while our approach requires no additional annotations.

In short, while one class of methods model images in a global manner, and may thus be sensitive to context diversity, the other ones focus solely on the action, and thus cannot benefit from context. Here, we introduce a novel multi-stage recurrent architecture that explicitly and effectively combines these two complementary information sources.

3 Preliminary

Here, we briefly talk about the main building block of our architecture, the LSTM. LSTM is a neural network that implements a memory cell which can maintain its state over time. Hence, the benefit of LSTM units is that they allow the recurrent network to remember long-term context dependencies and relations [14].

LSTM consists of three gates: (1) input gate , (2) output gate , and (3) forget gate – and a memory cell . At each time-step , LSTM first computes the activation of its gates and then updates its memory cell from to . It then computes the activation of the output gate

, and finally outputs a hidden representation

. There are two inputs to the LSTM that each time-step: (1) the observations and (2) the hidden representation from the previous time step . To update, LSTM applies the following equations:


To update the memory of the LSTM, input gate and forget gate will be involved. In more detail, what an input gate do is computing new values based on new observations that are going to be written in the memory cell and the forget gate participate in a part of memory cell to forget. Both output gate and memory are responsible for computing the representation of hidden units. What makes the gradient of LSTM gets propagated over a longer time before vanishing is that LSTM activations contain summation over time and also derivatives are distributed over the summations.

Figure 2: Our feature extraction network.

Our CNN model for feature extraction is based on the VGG-16 structure with some modifications. Up to conv5-2, the network is the same as VGG-16, pre-trained on ImageNet. The output of this layer is connected to two sub-models. The first one extracts context-aware features by providing a global image representation. The second one relies on a localization-type network to extract action-aware features.

4 Our Method

Our goal is to leverage both context-aware and action-aware features for action recognition and anticipation. To this end, we introduce a multi-stage recurrent architecture based on LSTMs, depicted by Fig. 1. In this section, we first discuss our approach to extracting both feature types, and then present our multi-stage recurrent network.

4.1 Feature Extraction

To extract context-aware and action-aware features, we introduce the two-stream architecture shown in Fig. 2. The first part of this network is shared by both streams and, up to conv5-2, corresponds to the VGG-16 network [30]

, pre-trained on ImageNet for object recognition. The output of this layer is connected to two sub-models: One for context-aware features and the other for action-aware ones. We then train these two sub-models for the same task of action recognition from a single image, using a cross-entropy loss function defined on the output of each stream. In practice, we found that training the entire model in an end-to-end manner did not yield a significant improvement over training only the two sub-models. In our experiments, we therefore opted for this latter strategy, which is less expensive computationally and memory-wise. Below, we first discuss the context-aware sub-network and then turn to the action-aware one.

4.1.1 Context-Aware Feature Extraction

The context-aware sub-model is similar to VGG-16 from conv5-3 up to the last fully connected layer, with the number of units in the last fully-connected layer changed from 1000 (the original 1000-way ImageNet classification problem) to the number of activities .

In essence, this sub-model focuses on extracting a deep representation of the whole scene for each activity and thus incorporates context. We then take the output of its fc7 layer as our context-aware features.

4.1.2 Action-Aware Feature Extraction

As mentioned before, the context of an action does not always correlate with the action itself. Our second sub-model therefore aims at extracting features that focus on the action itself. To this end, we draw inspiration from the object classification work of [48]. At the core of this work lies the idea of Class Activation Maps (CAMs). In our context, a CAM indicates the regions in the input image that contribute most to predicting each class label. In other words, it provides information about the location of an action. Importantly, this is achieved without requiring any additional annotations.

More specifically, CAMs are extracted from the activations in the last convolutional layer in the following manner. Let represent the activation of unit in the last convolutional layer at spatial location . A score for each class can be obtained by performing global average pooling [22] to obtain, for each unit , a feature , followed by a linear layer with weights . That is, . A CAM for class at location can then be computed as


which indicates the importance of the activations at location in the final score for class .

Figure 3: Action-aware feature extraction. Given the fine-tuned feature extraction network, we introduce a new layer that alters the output of conv5-3. This lets us filter out the conv5-3 features that are irrelevant, to focus on the action itself. Our action-aware features are then taken as the output of the last fully-connected layer shown here.

Here, we propose to make use of the CAMs to extract action-aware features. To this end, we use the CAMs in conjunction with the output of the conv5-3 layer of the model. The intuition behind this is that conv5-3 extracts high-level features that provide a very rich representation of the image [46] and typically correspond to the most discriminative parts of the object [2, 28], or, in our case, the action. Therefore, we incorporate a new layer to our sub-model, whose output can be expressed as


where . As illustrated in Fig. 3, this new layer is then followed by fully-connected ones, and we take our action-aware features as the output of the corresponding fc7 layer.

4.2 Sequence Learning for Action Recognition

To effectively combine the information contained in the context-aware and action-aware features described above, we design the novel multi-stage LSTM model depicted by Fig. 1. This model first focuses on the context-aware features, which encode global information about the entire image. It then combines the output of this first stage with our action-aware features to provide a refined class prediction.

To learn this model, we introduce a novel loss function motivated by the intuition that, while we would like the model to predict the correct class as early as possible in the sequence, some actions, such as running and high jump, are highly ambiguous after seeing only the first few frames. Ultimately, our network models long-range temporal information, and yields increasingly accurate predictions as it processes more frames. This therefore also provides us with an effective mechanism to forecast an action type given only limited input observations. Below, we discuss the two stages of our model.

4.2.1 Learning Context

The first stage of our model takes as input our context-aware features, and passes them through a layer of LSTM cells followed by a fully-connected layer that, via a softmax operation, outputs a probability for each action class. Let

be the probability of class at time predicted by the first stage. We then define the loss for a single training sample as


where encodes the true activity label at time , i.e., if the sample belongs to class and 0 otherwise.

This loss function consists of two terms. The first one is standard and aims at penalizing false negative with the same strength at any point in time. By contrast, the second term focuses on false positives, and its strength increases linearly over time, to reach the same weight as that on false negatives. The motivation behind this loss can be explained as follows. Early in the sequence, there can easily be ambiguities between several actions, such as running and high jump. Therefore, false positives are bound to happen, and should not be penalized too strongly. As we see more frames, however, these false positives should be encouraged to disappear. By contrast, we would like to have a high score for the correct class as early as possible. This is taken care of by the first term, which penalizes false negatives, and whose relative weight over the second term is larger at the beginning of the sequence.

4.2.2 Learning Context and Action

The second stage of our model aims at combining context-aware and action-aware information. Its structure is the same as that of the first stage, i.e., a layer of LSTM cells followed by a fully-connected layer to output class probabilities via a softmax operation. However, its input merges the output of the first stage with our action-aware features. This is achieved by concatenating the hidden activations of the LSTM layer with our action-aware features. We then make use of the same loss function as before, but defined on the final prediction. This can be expressed as


where is the probability for class predicted by the second stage.

The overall loss of our model can then be written as


To learn the parameters of our model, we then average this loss over the training samples in a mini-batch.

At inference time, the input RGB frames are forward-propagated though this model. We therefore obtain a probability vector for each class at each frame. While one could simply take the probabilities in the last frame to obtain the class label, via an

operation, we propose to increase robustness by leveraging the predictions of all the frames. To this end, we make use of an average pooling of these predictions over time.

5 Experiments

In this section, we first compare our method with state-of-the-art techniques on the task of action recognition, and then analyze various aspects of our model. For our experiments, we make use of the standard UCF-101 [33] and JHMDB-21 [17] benchmarks. The UCF-101 dataset consists of 13,320 videos of 101 action classes including a broad set of activities such as sports, musical instruments, and human-object interaction, with an average length of 7.2 seconds. UCF-101 gives a large diversity in terms of actions and with the presence of large variations in camera motion, cluttered background, illumination conditions, etc, is one of the most challenging data sets. The JHMDB-21 dataset is another challenging dataset of realistic videos from various sources, such as movies and web videos, containing 928 videos and 21 action classes.

Implementation details.

To fine-tune the network on these datasets, we used a number of data augmentation techniques, so as to reduce the effect of over-fitting. The input images were randomly flipped horizontally and rotated by a random amount in the range -8 to 8 degrees. We then extracted crops according to the following procedure:

  1. Compute the maximum cropping rectangle with given aspect ratio () that can fit within the input image.

  2. Scale the width and height of the cropping rectangle by a factor randomly selected in the range -.

  3. Select a random location for the cropping rectangle within the orignal input image and extract that subimage.

  4. Scale the subimage to .

After these geometric transformations, we further applied RGB channel shifting [43], followed by randomly adjusting image brightness, contrast and saturation.

The parameters of the CNN were found by using stochastic gradient descent with a fixed learning rate of 0.001, a momentum of 0.9, a weight decay of 0.0005, and mini-batches of size 32. To train our LSTMs, we similarly used stochastic gradient descent, but with a fixed learning rate of 0.01 and a momentum of 0.9 with mini-batch size of 32. To implement our method, we use Python and Keras 


5.1 Comparison with the State-of-the-Art

In Tables 1 and 2, we compare the results of our approach to state-of-the-art methods on UCF-101 and JHMDB-21, respectively by reporting the average accuracy over the given three training and testing partitions. For this comparison to be fair, we only report the results of the baselines that do not use any other information than the RGB image and the activity label. In other words, while it has been shown that additional, hand-crafted features, such as dense trajectories and optical flow, can help improve accuracy [38, 39, 21, 29, 3], our goal here is to truly evaluate the benefits of our method, not of these features. Note, however, that, as discussed in Section 5.2.5, our approach can still benefits from such features. As can be seen from the tables, our approach outperforms all these baselines on both datasets.

Method Accuracy
Dynamic Image Network [3] 70.0%
Dynamic Image Network + Static RGB [3] 76.9%
Rank Pooling [9] 72.2%
Discriminative Hierarchical Ranking [9] 78.8%
Realtime Action Recognition [47] 74.4%
LSTM [34] 74.5%
LRCN [7] 68.8%
C3D [35] 82.3%
Spatial Stream Network [29] 73.0%
Deep Network [18] 65.4%
ConvPool (Single frame) [45] 73.3%
ConvPool (30 frames) [45] 80.8%
ConvPool (120 frames) [45] 82.6%
Ours (pLGL, AvgPool, 2048 units) 83.3%
Comparison to State-of-the-Art +0.7%
Table 1: Comparison with state-of-the-art methods on UCF-101, three splits averaged. To provide a fair comparison, we focus on the baselines that, as us, only use the RGB frames as input (without any other information and/or hand-crafted features).
Method Accuracy
Spatial-CNN [13] 37.9%
Motion-CNN [13] 45.7%
Full Method [13] 53.3%
Actionness-Spatial [40] 42.6%
Actionness-Temporal [40] 54.8%
Actionness-Full Method [40] 56.4%
Ours (pLGL, AvgPool, 2048 units) 58.3%
Comparison to State-of-the-Art +1.9%
Table 2: Comparison with state-of-the-art methods on JHMDB-21, three splits averaged. Note that while we only use RGB frames as input, both baselines use motion/optical flow information.

5.2 Analysis

In this section, we analyze several aspects of our method in more detail, such as the importance of each feature type, the influence of the LSTM architecture and of the number and order of the input frames. Finally, we show the effectiveness of our approach at tackling the task of action anticipation, and study how optical flow can be employed to further improve our results. All the analytical experiments were conducted on the first split of UCF-101 dataset.

In the following analysis, we also evaluate the effectiveness of different losses. In particular, we make use of the standard cross-entropy (CE) loss, which only accounts for one activity label for each sequence (the activity label at time ). This loss can be expressed as


In [15], an exponentially growing loss (EGL) was proposed to penalize errors more strongly as more frames of the sequence are observed. This loss can be written as


The main drawback of this loss comes from the fact that it does not strongly encourage the model to make correct predictions as early as possible. To address this issue, we introduce a linearly growing loss (LGL). This loss is defined as


As shown in the experimental analysis, the linearity of this loss makes it more effective than the EGL. Our new loss, discussed in Section 4.2 and denoted by pLGL below, also makes use of a linearly-increasing term. This term, however, corresponds to the false positives, as opposed to the false negatives in the LGL. Since some actions are ambiguous in the first few frames, we find more intuitive not to penalize false positives too strongly at the beginning of the sequence. Our results below seem to support this, since, for a given model, our loss typically yields higher accuracies than the LGL.

Feature Sequence
Extraction Learning UCF-101 JHMDB-21
Context-Aware LSTM (CE) 72.38% 43.65%
Action-Aware LSTM (CE) 44.24% 50.06%
Context+Action MS-LSTM (CE) 78.93% 54.30%
Context-Aware LSTM (EGL) 72.41% 44.05%
Action-Aware LSTM (EGL) 77.20% 50.18%
Context+Action MS-LSTM (EGL) 80.38% 57.05
Context-Aware LSTM (LGL) 72.58% 44.72%
Action-Aware LSTM (LGL) 77.63% 50.34%
Context+Action MS-LSTM (LGL) 81.27% 57.70%
Context-Aware LSTM (pLGL) 72.71% 44.93%
Action-Aware LSTM (pLGL) 77.86% 51.00%
Context+Action MS-LSTM (pLGL) 83.37% 58.41%
Table 3: Importance of the different feature types using different losses. Note that combining both types of features consistently outperforms using a single one. Note also that, for a given model, our new pLGL loss yields higher accuracies the other ones.

5.2.1 Importance of the Feature Types

We first evaluate the importance of the different feature types, context-aware and action-aware, on recognition accuracy. To this end, we compare models trained using each feature type individually with our model that uses them jointly. For all models, we made use of LSTMs with 1024 units. Recall that our approach relies on a multi-stage LSTM, which we denote by MS-LSTM. The results of this experiment for different losses are reported in Table 3. These results clearly evidence the importance of using both feature types, which consistently outperforms using individual ones in all settings.

5.2.2 Influence of the LSTM Architecture

Our second set of experiments studies the importance of using a multi-stage LSTM architecture and the influence of the number of units in our MS-LSTM. For the first one of these analyses, we compare our MS-LSTM with a single-stage LSTM that takes as input the concatenation of our context-aware and action-aware features. Furthermore, to study the importance of the feature order, we also compare our approach with a model that first processes the action-aware features and, in a second stage, combines them with the context-aware ones. Finally, we also evaluate the use of two parallel LSTMs whose outputs are merged by concatenation and then fed to a dense layer distributed over time. The results of this comparison are provided in Table 4. Note that both multi-stage LSTMs outperform the single-stage one and the two parallel LSTMs, thus indicating the importance of treating the two types of features sequentially. Interestingly, processing context-aware features first, as we propose, yields higher accuracy than considering the action-aware ones at the beginning. This matches our intuition that context-aware features carry global information about the image and will thus yield noisy results, which can then be refined by exploiting the action-aware features.

Feature Sequence
Order Learning Accuracy
Concatenation LSTM (pLGL) 77.16%
Swapped LSTM (pLGL) 78.80%
Parallel 2 Parallel LSTMs (pLGL) 78.63%
Ours MS-LSTM (pLGL) 83.37%
Table 4: Comparison of our multi-stage LSTM model with diverse fusion strategies. We report the results of simple concatenation of the context-aware and action-aware features, their use in two parallel LSTMs with late fusion, and swapping their order in our multi-stage LSTM, i.e., action-aware first, followed by context-aware. Note that multi-stage architectures yield better results, with the best ones achieved by using context first, followed by action, as proposed in this paper.

Based on our experiments, we found that for large datasets such as UCF-101, the 512 hidden units that some baselines use (e.g. [7, 34]) do not suffice to capture the complexity of the data. Therefore, to study the influence of the number of units in the LSTM, we evaluated different versions of our model with 512, 1024, 2048, and 4096 hidden units and trained the model with 80% training data and validated on the remaining 20%. For a single LSTM, we found that using 1024 hidden units performs best. For our multi-stage LSTM, using 2048 hidden units yields the best results. We also evaluated the importance of relying on average pooling in the LSTM. The results of these different versions of our MS-LSTM framework are provided in Table 5. This shows that, typically, more hidden units and average pooling can improve accuracy slightly.

Average Hidden
Setup Pooling Units UCF-101 JHMDB-21
Ours (CE) wo/ 1024 77.26% 52.80%
Ours (CE) wo/ 2048 78.09% 53.43%
Ours (CE) w/ 2048 78.93% 54.30%
Ours (EGL) wo/ 1024 79.10% 55.33%
Ours (EGL) wo/ 2048 79.41% 56.12%
Ours (EGL) w/ 2048 80.38% 57.05%
Ours (LGL) wo/ 1024 79.76% 55.70%
Ours (LGL) wo/ 2048 80.10% 56.83%
Ours (LGL) w/ 2048 81.27% 57.70%
Ours (pLGL) wo/ 1024 81.94% 56.24%
Ours (pLGL) wo/ 2048 82.16% 57.92%
Ours (pLGL) w/ 2048 83.37% 58.41%
Table 5: Evaluating the effectiveness of applying average pooling on the softmax probabilities of all time-steps for each sample and the number of hidden units in different losses of our multi-stage LSTM model. Experiments are conducted on the first split of UCF-101 and JHMDB-21.

5.2.3 Influence of the Input Frames

As mentioned before, our model makes use of frames as input. Here, we therefore study the influence of the value and of the choice of frames on our results. To this end, we varied in the range . For each value, we then took either the first frames of the sequence, or randomly sampled ones to our model at both training and inference time. The results for the different values of and frame selection strategies are shown in Table 6. In essence, after , the gap between the different results becomes very small. These results also show that, while slightly better accuracies can be obtained by random sampling, using the first frames remains reliable, again, particularly for . This will be particularly important for action anticipation, where one only has access to the first frames.

Setup K First K Sampling K
Ours 10 77.4% 80.0%
Ours 20 78.5% 81.2%
Ours 30 81.9% 82.0%
Ours 40 82.2% 82.3%
Ours 50 83.4% 83.2%
Table 6: Influence of the Input Frames on activity recognition accuracy on UCF-101 dataset split 1.

5.2.4 Action Anticipation

Unlike action recognition, where the full sequence can be employed to predict the activity type, action anticipation aims at providing a class prediction as early as possible. In Fig. 4, we plot the accuracy of our model as a function of the number of observed frames for different losses and with/without average pooling over time of the softmax probabilities. In this experiment, all the models were trained from sequences of frames. These plots clearly show the importance of using average pooling, which provides more robustness to the prediction. More importantly, they also evidence the benefits of our novel loss, which was designed to encourage correct prediction as early as possible, over other losses for action anticipation.

Figure 4: Action Anticipation. Evaluating the performance of different losses on anticipating the activity given partial sequential information at time-step .

5.2.5 Exploiting Optical Flow

In the past, several methods have proposed to rely on optical flow to encode temporal information [41, 29, 8, 45, 34]. Here, we show that our approach can also benefit from this additional source of information. To extract optical flow features, we made use of the pre-trained temporal network of [29]. We then computed the CNN features from a stack of 20 optical flow frames (10 frames in the X-direction and 10 frames in the Y-direction), from to at each time . As these features are potentially loosely related to action (by focusing on motion), we merge them with the input to the second stage of our multi-stage LSTM. In Table 7, we compare the results of our modified approach with state-of-the-art methods also exploiting optical flow. Note that we consistently outperform these baselines, with the exception of those incorporating optical flow as input to two-stream network, thus learning how to exploit it and how to fuse it with RGB information. Designing an architecture that jointly leverages these motion-aware features with our context- and action-aware ones will be the topic of our future research.

Method Accuracy
Spatio-temporal ConvNet [18] 65.4%
LRCN [7] 82.9%
Two-Stream ConvNet [29] 88.0%
VLAD3 + Optical Flow [21] 84.1%
Two-Stream Conv.Pooling [45] 88.2%
LSTM [34] 84.3%
CNN features + Optical Flow [29] 73.9%
ConvPool (30 frames) + OpticalFlow [45] 87.6%
ConvPool (120 frames) + OpticalFlow [45] 88.2%
Two-Stream Net Fusion [8] 92.5%
TSN [41] 93.5%
Ours (pLGL, AvgPool, 2048 units) + Optical Flow 88.7%
Table 7: Comparison with the state-of-the-art approaches that use optical flow features. To provide a fair comparison, we focus on the baselines that, as us, only use the RGB frames and optical flow as input (without any other hand-crafted features).

6 Conclusion

In this paper, we have proposed to leverage both context- and action-aware features for action recognition and anticipation. The first type of features provides a global representation of the scene, but may suffer from the fact that some actions can occur in diverse contexts. By contrast, the second feature type focuses on the action itself, and can thus not leverage the information provided by the context. We have therefore introduced a multi-stage LSTM architecture that effectively combines these two sources of information. Furthermore, we have designed a novel loss function that encourages our model to make correct prediction as early as possible in the input sequence, thus making it particularly well-suited to action anticipation. Our experiments have evidenced the importance of our feature combination scheme and of our loss function using two standard benchmarks. Among the methods that only rely on RGB as input, our approach yields state-of-the-art action recognition accuracy. In the future, we intend to study new ways to incorporate additional sources of information, such as dense trajectories and human skeletons in our framework.


  • [1] B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2189–2202, 2012.
  • [2] G. Bertasius, J. Shi, and L. Torresani. Deepedge: A multi-scale bifurcated deep network for top-down contour detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4380–4389, 2015.
  • [3] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In IEEE International Conference on Computer Vision and Pattern Recognition CVPR, 2016.
  • [4] W. Chen, C. Xiong, R. Xu, and J. J. Corso. Actionness ranking with lattice conditional ordinal random fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 748–755, 2014.
  • [5] F. Chollet. keras., 2015.
  • [6] A. Diba, A. M. Pazandeh, H. Pirsiavash, and L. Van Gool. Deepcamp: Deep convolutional action & attribute mid-level patterns.
  • [7] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2634, 2015.
  • [8] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. 2016.
  • [9] B. Fernando, P. Anderson, M. Hutter, and S. Gould. Discriminative hierarchical rank pooling for activity recognition. In Proc. CVPR, 2016.
  • [10] B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars. Rank pooling for action recognition. 2016.
  • [11] R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440–1448, 2015.
  • [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
  • [13] G. Gkioxari and J. Malik. Finding action tubes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 759–768, 2015.
  • [14] S. Hochreiter and J. Schmidhuber. Long short-term memory. volume 9, pages 1735–1780. MIT Press, 1997.
  • [15] A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena. Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 3118–3125. IEEE, 2016.
  • [16] M. Jain, J. Van Gemert, H. Jégou, P. Bouthemy, and C. G. Snoek. Action localization with tubelets from motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 740–747, 2014.
  • [17] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 3192–3199, 2013.
  • [18] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
  • [19] I. Kviatkovsky, E. Rivlin, and I. Shimshoni. Online action recognition using covariance of shape and motion. Computer Vision and Image Understanding, 129:15–26, 2014.
  • [20] I. Laptev. On space-time interest points. International Journal of Computer Vision, 64(2-3):107–123, 2005.
  • [21] Y. Li, W. Li, V. Mahadevan, and N. Vasconcelos.

    Vlad3: Encoding dynamics of deep features for action recognition.

  • [22] M. Lin, Q. Chen, and S. Yan. Network in network. volume abs/1312.4400, 2013.
  • [23] J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, pages 816–833. Springer, 2016.
  • [24] B. Mahasseni and S. Todorovic. Regularizing long short term memory with 3d human-skeleton sequences for action recognition.
  • [25] D. Oneata, J. Revaud, J. Verbeek, and C. Schmid. Spatio-temporal object detection proposals. In European Conference on Computer Vision, pages 737–752. Springer, 2014.
  • [26] X. Peng and C. Schmid. Multi-region two-stream r-cnn for action detection. In European Conference on Computer Vision, pages 744–759. Springer, 2016.
  • [27] S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529, 2016.
  • [28] F. Saleh, M. S. Aliakbarian, M. Salzmann, L. Petersson, S. Gould, and J. M. Alvarez. Built-in foreground/background prior for weakly-supervised semantic segmentation. In European Conference on Computer Vision, pages 413–432. Springer, 2016.
  • [29] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pages 568–576, 2014.
  • [30] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. volume abs/1409.1556, 2014.
  • [31] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [32] B. Singh and M. Shao. A multi-stream bi-directional recurrent neural network for fine-grained action detection.
  • [33] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. 2012.
  • [34] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2, 2015.
  • [35] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497. IEEE, 2015.
  • [36] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
  • [37] M. Van den Bergh, G. Roig, X. Boix, S. Manen, and L. Van Gool. Online video seeds for temporal window objectness. In Proceedings of the IEEE International Conference on Computer Vision, pages 377–384, 2013.
  • [38] H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, pages 3551–3558, 2013.
  • [39] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4305–4314, 2015.
  • [40] L. Wang, Y. Qiao, X. Tang, and L. Van Gool. Actionness estimation using hybrid fully convolutional networks. arXiv preprint arXiv:1604.07279, 2016.
  • [41] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision, pages 20–36. Springer, 2016.
  • [42] X. Wang, A. Farhadi, and A. Gupta. Actions ~ transformations. In IEEE International Conference on Computer Vision and Pattern Recognition CVPR. IEEE, 2016.
  • [43] R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876, 7(8), 2015.
  • [44] G. Yu and J. Yuan. Fast action proposals for human action detection and search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1302–1311, 2015.
  • [45] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4694–4702, 2015.
  • [46] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833. Springer, 2014.
  • [47] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang. Real-time action recognition with enhanced motion vector cnns. arXiv preprint arXiv:1604.07669, 2016.
  • [48] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In IEEE International Conference on Computer Vision and Pattern Recognition CVPR, 2016.