Activity recognition and anticipation are crucial for the success of many real-life applications, such as autonomous navigation, sports analysis and personal robotics. It has therefore become increasingly popular in the computer vision literature. Nowadays, the most popular trend to tackle these tasks consists of extracting global representations for the entire image [6, 41, 42, 7], or video sequence [35, 18]. As such, these methods do not truly focus on the actions of interest, but rather compute a contex-aware representation. Unfortunately, context does not always bring reliable information about the action. For example, one can play guitar in a bedroom, a concert hall or a yard. Therefore, the resulting representations encompass much irrelevant noise.
By contrast, several methods have attempted to localize the feature extraction process to regions of interest. This, to some degree, is the case of methods exploiting dense trajectories[39, 38, 9] and optical flow [8, 41, 21]. By relying on motion, however, these methods can easily be distracted by irrelevant phenomena such as moving background or camera. Inspired by objectness, the notion of actionness [40, 4, 16, 44, 37, 25] has recently been proposed as a means to overcome this weakness by attempting to localize the regions where a generic action occurs. The resulting methods can then be thought of as extracting action-aware representations. In other words, these methods go to the other extreme and completely discard the notion of context. In many situations, however, context provides helpful information about the action class. For example, one typically plays soccer on a grass field.
In this paper, we propose to make the best of both world: We introduce an approach that leverages both context-aware and action-aware features for action recognition and anticipation. In particular, we make use of the output of the last layer of an image-based Convolutional Neural Network (CNN) as context-aware features. For the action-aware ones, inspired by the approach of for object recognition and localization, we propose to exploit the class-specific activations of another CNN, which typically correspond to regions where the action occurs. The main challenge then consists of effectively leveraging the two types of features for recognition and anticipation. To this end, we introduce the novel multi-stage recurrent architecture depicted by Fig. 1. In a first stage, this model focuses on the global, context-aware features, and combines the resulting representation with the localized, action-aware ones to obtain the final prediction. In short, it first extracts the contextual information, and then merges it with the localized one.
To the best of our knowledge, our work constitutes the first attempt at explicitly bringing together these two types of information for action recognition. By leveraging RBG frames and optical flow, the two-stream approach of  exploits context and motion. As mentioned above, however, motion does not always correlate with the action of interest. While 3D CNNs  can potentially implicitly capture information about both context and action, they are difficult to train and computationally expensive. By contrast, our novel multi-stage LSTM model explicitly combines these two information sources, and provides us with an effective and efficient action recognition and anticipation framework.
As a result, our approach outperforms the state-of-the-art methods that, as us, rely only on RBG frames as input on all the standard benchmark datasets that we experimented with, including UCF-101  and JHMDB21 . Furthermore, our experiments clearly evidence the benefits of our multi-stage architecture over networks that exploit either context-aware or action-aware features separately, or combine them via other fusion strategies.
2 Related Work
, recent ones have turned towards deep learning. Below, we focus on these approaches, which are most related to our work.
In this deep learning context, many methods rely on CNNs [35, 18, 8, 19, 26] to extract a global representation of images. These CNN-based methods, however, typically have small temporal support, and thus fail to capture long-range dynamics. For instance, the two-stream networks [41, 8, 29] act on single images in conjunction with optical flow information to model the temporal information. While 3D convolutional filters have been proposed , they are typically limited to acting on small sets of stacked video frames, 10 to 20 in practice.
By contrast, recurrent architectures, such as the popular Long-Short Term Memory networks, can, in principle, learn complex, long-range dynamics, and have therefore recently been investigated for action recognition [7, 24, 34, 32, 23, 34]. For instance, in , an LSTM was employed to model the dynamics of CNN activations; in , a bi-directional LSTM was combined with a multi-stream CNN to encode the long-term dynamics within and between activities in videos. Other works, such as , have proposed to exploit additional annotations, in the form of 3D skeletons, into an LSTM-based model. Such annotations, however, are not always available in practice, thus limiting the applicability of these methods.
Beyond recurrent models, rank pooling has also proven effective to model activities in videos [10, 3, 9]. In this context,  computes a representation encoding the dynamics of the video, and  introduces the concept of Dynamic Images to summarize the gist of a sequence.
In any event, whether based on CNNs, LSTMs or rank pooling, all of the above-mentioned methods compute one holistic representation over one image, or the sequence. While this has the advantage of retaining information about the context of the action, these methods may also easily be affected by the fact that context is not always reliable. Many actions can be performed in very different environments. In these cases, focusing on the action itself would therefore seem beneficial.
This, in essence, is the goal of methods based on the notion of actionness [40, 4, 16, 44, 37, 25]. Inspired by the concept of objectness [1, 36], commonly used in object detection, actionness aims at localizing the regions in a video where an action of interest occurs. In , this was achieved by exploiting appearance (RGB) and motion (optical flow) in a two-stream architecture. In , the resulting actionness map was then employed to generate action/bounding box proposals via an action detection framework based on 
, and classifying these proposals. The ActionTube approach of follows a similar framework, but relies on  instead of . More importantly, by focusing on the actions themselves, these methods throw away all the information about context. However, in many scenarios, such as to recognize different sports, context provides helpful information about the observed actions. Note that for extracting actionness in [40, 27], bounding box annotation are used as an extra supervision during the training process, while our approach requires no additional annotations.
In short, while one class of methods model images in a global manner, and may thus be sensitive to context diversity, the other ones focus solely on the action, and thus cannot benefit from context. Here, we introduce a novel multi-stage recurrent architecture that explicitly and effectively combines these two complementary information sources.
Here, we briefly talk about the main building block of our architecture, the LSTM. LSTM is a neural network that implements a memory cell which can maintain its state over time. Hence, the benefit of LSTM units is that they allow the recurrent network to remember long-term context dependencies and relations .
LSTM consists of three gates: (1) input gate , (2) output gate , and (3) forget gate – and a memory cell . At each time-step , LSTM first computes the activation of its gates and then updates its memory cell from to . It then computes the activation of the output gate
, and finally outputs a hidden representation. There are two inputs to the LSTM that each time-step: (1) the observations and (2) the hidden representation from the previous time step . To update, LSTM applies the following equations:
To update the memory of the LSTM, input gate and forget gate will be involved. In more detail, what an input gate do is computing new values based on new observations that are going to be written in the memory cell and the forget gate participate in a part of memory cell to forget. Both output gate and memory are responsible for computing the representation of hidden units. What makes the gradient of LSTM gets propagated over a longer time before vanishing is that LSTM activations contain summation over time and also derivatives are distributed over the summations.
4 Our Method
Our goal is to leverage both context-aware and action-aware features for action recognition and anticipation. To this end, we introduce a multi-stage recurrent architecture based on LSTMs, depicted by Fig. 1. In this section, we first discuss our approach to extracting both feature types, and then present our multi-stage recurrent network.
4.1 Feature Extraction
To extract context-aware and action-aware features, we introduce the two-stream architecture shown in Fig. 2. The first part of this network is shared by both streams and, up to conv5-2, corresponds to the VGG-16 network 
, pre-trained on ImageNet for object recognition. The output of this layer is connected to two sub-models: One for context-aware features and the other for action-aware ones. We then train these two sub-models for the same task of action recognition from a single image, using a cross-entropy loss function defined on the output of each stream. In practice, we found that training the entire model in an end-to-end manner did not yield a significant improvement over training only the two sub-models. In our experiments, we therefore opted for this latter strategy, which is less expensive computationally and memory-wise. Below, we first discuss the context-aware sub-network and then turn to the action-aware one.
4.1.1 Context-Aware Feature Extraction
The context-aware sub-model is similar to VGG-16 from conv5-3 up to the last fully connected layer, with the number of units in the last fully-connected layer changed from 1000 (the original 1000-way ImageNet classification problem) to the number of activities .
In essence, this sub-model focuses on extracting a deep representation of the whole scene for each activity and thus incorporates context. We then take the output of its fc7 layer as our context-aware features.
4.1.2 Action-Aware Feature Extraction
As mentioned before, the context of an action does not always correlate with the action itself. Our second sub-model therefore aims at extracting features that focus on the action itself. To this end, we draw inspiration from the object classification work of . At the core of this work lies the idea of Class Activation Maps (CAMs). In our context, a CAM indicates the regions in the input image that contribute most to predicting each class label. In other words, it provides information about the location of an action. Importantly, this is achieved without requiring any additional annotations.
More specifically, CAMs are extracted from the activations in the last convolutional layer in the following manner. Let represent the activation of unit in the last convolutional layer at spatial location . A score for each class can be obtained by performing global average pooling  to obtain, for each unit , a feature , followed by a linear layer with weights . That is, . A CAM for class at location can then be computed as
which indicates the importance of the activations at location in the final score for class .
Here, we propose to make use of the CAMs to extract action-aware features. To this end, we use the CAMs in conjunction with the output of the conv5-3 layer of the model. The intuition behind this is that conv5-3 extracts high-level features that provide a very rich representation of the image  and typically correspond to the most discriminative parts of the object [2, 28], or, in our case, the action. Therefore, we incorporate a new layer to our sub-model, whose output can be expressed as
where . As illustrated in Fig. 3, this new layer is then followed by fully-connected ones, and we take our action-aware features as the output of the corresponding fc7 layer.
4.2 Sequence Learning for Action Recognition
To effectively combine the information contained in the context-aware and action-aware features described above, we design the novel multi-stage LSTM model depicted by Fig. 1. This model first focuses on the context-aware features, which encode global information about the entire image. It then combines the output of this first stage with our action-aware features to provide a refined class prediction.
To learn this model, we introduce a novel loss function motivated by the intuition that, while we would like the model to predict the correct class as early as possible in the sequence, some actions, such as running and high jump, are highly ambiguous after seeing only the first few frames. Ultimately, our network models long-range temporal information, and yields increasingly accurate predictions as it processes more frames. This therefore also provides us with an effective mechanism to forecast an action type given only limited input observations. Below, we discuss the two stages of our model.
4.2.1 Learning Context
The first stage of our model takes as input our context-aware features, and passes them through a layer of LSTM cells followed by a fully-connected layer that, via a softmax operation, outputs a probability for each action class. Letbe the probability of class at time predicted by the first stage. We then define the loss for a single training sample as
where encodes the true activity label at time , i.e., if the sample belongs to class and 0 otherwise.
This loss function consists of two terms. The first one is standard and aims at penalizing false negative with the same strength at any point in time. By contrast, the second term focuses on false positives, and its strength increases linearly over time, to reach the same weight as that on false negatives. The motivation behind this loss can be explained as follows. Early in the sequence, there can easily be ambiguities between several actions, such as running and high jump. Therefore, false positives are bound to happen, and should not be penalized too strongly. As we see more frames, however, these false positives should be encouraged to disappear. By contrast, we would like to have a high score for the correct class as early as possible. This is taken care of by the first term, which penalizes false negatives, and whose relative weight over the second term is larger at the beginning of the sequence.
4.2.2 Learning Context and Action
The second stage of our model aims at combining context-aware and action-aware information. Its structure is the same as that of the first stage, i.e., a layer of LSTM cells followed by a fully-connected layer to output class probabilities via a softmax operation. However, its input merges the output of the first stage with our action-aware features. This is achieved by concatenating the hidden activations of the LSTM layer with our action-aware features. We then make use of the same loss function as before, but defined on the final prediction. This can be expressed as
where is the probability for class predicted by the second stage.
The overall loss of our model can then be written as
To learn the parameters of our model, we then average this loss over the training samples in a mini-batch.
At inference time, the input RGB frames are forward-propagated though this model. We therefore obtain a probability vector for each class at each frame. While one could simply take the probabilities in the last frame to obtain the class label, via anoperation, we propose to increase robustness by leveraging the predictions of all the frames. To this end, we make use of an average pooling of these predictions over time.
In this section, we first compare our method with state-of-the-art techniques on the task of action recognition, and then analyze various aspects of our model. For our experiments, we make use of the standard UCF-101  and JHMDB-21  benchmarks. The UCF-101 dataset consists of 13,320 videos of 101 action classes including a broad set of activities such as sports, musical instruments, and human-object interaction, with an average length of 7.2 seconds. UCF-101 gives a large diversity in terms of actions and with the presence of large variations in camera motion, cluttered background, illumination conditions, etc, is one of the most challenging data sets. The JHMDB-21 dataset is another challenging dataset of realistic videos from various sources, such as movies and web videos, containing 928 videos and 21 action classes.
To fine-tune the network on these datasets, we used a number of data augmentation techniques, so as to reduce the effect of over-fitting. The input images were randomly flipped horizontally and rotated by a random amount in the range -8 to 8 degrees. We then extracted crops according to the following procedure:
Compute the maximum cropping rectangle with given aspect ratio () that can fit within the input image.
Scale the width and height of the cropping rectangle by a factor randomly selected in the range -.
Select a random location for the cropping rectangle within the orignal input image and extract that subimage.
Scale the subimage to .
After these geometric transformations, we further applied RGB channel shifting , followed by randomly adjusting image brightness, contrast and saturation.
The parameters of the CNN were found by using stochastic gradient descent with a fixed learning rate of 0.001, a momentum of 0.9, a weight decay of 0.0005, and mini-batches of size 32. To train our LSTMs, we similarly used stochastic gradient descent, but with a fixed learning rate of 0.01 and a momentum of 0.9 with mini-batch size of 32. To implement our method, we use Python and Keras.
5.1 Comparison with the State-of-the-Art
In Tables 1 and 2, we compare the results of our approach to state-of-the-art methods on UCF-101 and JHMDB-21, respectively by reporting the average accuracy over the given three training and testing partitions. For this comparison to be fair, we only report the results of the baselines that do not use any other information than the RGB image and the activity label. In other words, while it has been shown that additional, hand-crafted features, such as dense trajectories and optical flow, can help improve accuracy [38, 39, 21, 29, 3], our goal here is to truly evaluate the benefits of our method, not of these features. Note, however, that, as discussed in Section 5.2.5, our approach can still benefits from such features. As can be seen from the tables, our approach outperforms all these baselines on both datasets.
|Dynamic Image Network ||70.0%|
|Dynamic Image Network + Static RGB ||76.9%|
|Rank Pooling ||72.2%|
|Discriminative Hierarchical Ranking ||78.8%|
|Realtime Action Recognition ||74.4%|
|Spatial Stream Network ||73.0%|
|Deep Network ||65.4%|
|ConvPool (Single frame) ||73.3%|
|ConvPool (30 frames) ||80.8%|
|ConvPool (120 frames) ||82.6%|
|Ours (pLGL, AvgPool, 2048 units)||83.3%|
|Comparison to State-of-the-Art||+0.7%|
|Full Method ||53.3%|
|Actionness-Full Method ||56.4%|
|Ours (pLGL, AvgPool, 2048 units)||58.3%|
|Comparison to State-of-the-Art||+1.9%|
In this section, we analyze several aspects of our method in more detail, such as the importance of each feature type, the influence of the LSTM architecture and of the number and order of the input frames. Finally, we show the effectiveness of our approach at tackling the task of action anticipation, and study how optical flow can be employed to further improve our results. All the analytical experiments were conducted on the first split of UCF-101 dataset.
In the following analysis, we also evaluate the effectiveness of different losses. In particular, we make use of the standard cross-entropy (CE) loss, which only accounts for one activity label for each sequence (the activity label at time ). This loss can be expressed as
In , an exponentially growing loss (EGL) was proposed to penalize errors more strongly as more frames of the sequence are observed. This loss can be written as
The main drawback of this loss comes from the fact that it does not strongly encourage the model to make correct predictions as early as possible. To address this issue, we introduce a linearly growing loss (LGL). This loss is defined as
As shown in the experimental analysis, the linearity of this loss makes it more effective than the EGL. Our new loss, discussed in Section 4.2 and denoted by pLGL below, also makes use of a linearly-increasing term. This term, however, corresponds to the false positives, as opposed to the false negatives in the LGL. Since some actions are ambiguous in the first few frames, we find more intuitive not to penalize false positives too strongly at the beginning of the sequence. Our results below seem to support this, since, for a given model, our loss typically yields higher accuracies than the LGL.
5.2.1 Importance of the Feature Types
We first evaluate the importance of the different feature types, context-aware and action-aware, on recognition accuracy. To this end, we compare models trained using each feature type individually with our model that uses them jointly. For all models, we made use of LSTMs with 1024 units. Recall that our approach relies on a multi-stage LSTM, which we denote by MS-LSTM. The results of this experiment for different losses are reported in Table 3. These results clearly evidence the importance of using both feature types, which consistently outperforms using individual ones in all settings.
5.2.2 Influence of the LSTM Architecture
Our second set of experiments studies the importance of using a multi-stage LSTM architecture and the influence of the number of units in our MS-LSTM. For the first one of these analyses, we compare our MS-LSTM with a single-stage LSTM that takes as input the concatenation of our context-aware and action-aware features. Furthermore, to study the importance of the feature order, we also compare our approach with a model that first processes the action-aware features and, in a second stage, combines them with the context-aware ones. Finally, we also evaluate the use of two parallel LSTMs whose outputs are merged by concatenation and then fed to a dense layer distributed over time. The results of this comparison are provided in Table 4. Note that both multi-stage LSTMs outperform the single-stage one and the two parallel LSTMs, thus indicating the importance of treating the two types of features sequentially. Interestingly, processing context-aware features first, as we propose, yields higher accuracy than considering the action-aware ones at the beginning. This matches our intuition that context-aware features carry global information about the image and will thus yield noisy results, which can then be refined by exploiting the action-aware features.
|Parallel||2 Parallel LSTMs (pLGL)||78.63%|
Based on our experiments, we found that for large datasets such as UCF-101, the 512 hidden units that some baselines use (e.g. [7, 34]) do not suffice to capture the complexity of the data. Therefore, to study the influence of the number of units in the LSTM, we evaluated different versions of our model with 512, 1024, 2048, and 4096 hidden units and trained the model with 80% training data and validated on the remaining 20%. For a single LSTM, we found that using 1024 hidden units performs best. For our multi-stage LSTM, using 2048 hidden units yields the best results. We also evaluated the importance of relying on average pooling in the LSTM. The results of these different versions of our MS-LSTM framework are provided in Table 5. This shows that, typically, more hidden units and average pooling can improve accuracy slightly.
5.2.3 Influence of the Input Frames
As mentioned before, our model makes use of frames as input. Here, we therefore study the influence of the value and of the choice of frames on our results. To this end, we varied in the range . For each value, we then took either the first frames of the sequence, or randomly sampled ones to our model at both training and inference time. The results for the different values of and frame selection strategies are shown in Table 6. In essence, after , the gap between the different results becomes very small. These results also show that, while slightly better accuracies can be obtained by random sampling, using the first frames remains reliable, again, particularly for . This will be particularly important for action anticipation, where one only has access to the first frames.
|Setup||K||First K||Sampling K|
5.2.4 Action Anticipation
Unlike action recognition, where the full sequence can be employed to predict the activity type, action anticipation aims at providing a class prediction as early as possible. In Fig. 4, we plot the accuracy of our model as a function of the number of observed frames for different losses and with/without average pooling over time of the softmax probabilities. In this experiment, all the models were trained from sequences of frames. These plots clearly show the importance of using average pooling, which provides more robustness to the prediction. More importantly, they also evidence the benefits of our novel loss, which was designed to encourage correct prediction as early as possible, over other losses for action anticipation.
5.2.5 Exploiting Optical Flow
In the past, several methods have proposed to rely on optical flow to encode temporal information [41, 29, 8, 45, 34]. Here, we show that our approach can also benefit from this additional source of information. To extract optical flow features, we made use of the pre-trained temporal network of . We then computed the CNN features from a stack of 20 optical flow frames (10 frames in the X-direction and 10 frames in the Y-direction), from to at each time . As these features are potentially loosely related to action (by focusing on motion), we merge them with the input to the second stage of our multi-stage LSTM. In Table 7, we compare the results of our modified approach with state-of-the-art methods also exploiting optical flow. Note that we consistently outperform these baselines, with the exception of those incorporating optical flow as input to two-stream network, thus learning how to exploit it and how to fuse it with RGB information. Designing an architecture that jointly leverages these motion-aware features with our context- and action-aware ones will be the topic of our future research.
|Spatio-temporal ConvNet ||65.4%|
|Two-Stream ConvNet ||88.0%|
|VLAD3 + Optical Flow ||84.1%|
|Two-Stream Conv.Pooling ||88.2%|
|CNN features + Optical Flow ||73.9%|
|ConvPool (30 frames) + OpticalFlow ||87.6%|
|ConvPool (120 frames) + OpticalFlow ||88.2%|
|Two-Stream Net Fusion ||92.5%|
|Ours (pLGL, AvgPool, 2048 units) + Optical Flow||88.7%|
In this paper, we have proposed to leverage both context- and action-aware features for action recognition and anticipation. The first type of features provides a global representation of the scene, but may suffer from the fact that some actions can occur in diverse contexts. By contrast, the second feature type focuses on the action itself, and can thus not leverage the information provided by the context. We have therefore introduced a multi-stage LSTM architecture that effectively combines these two sources of information. Furthermore, we have designed a novel loss function that encourages our model to make correct prediction as early as possible in the input sequence, thus making it particularly well-suited to action anticipation. Our experiments have evidenced the importance of our feature combination scheme and of our loss function using two standard benchmarks. Among the methods that only rely on RGB as input, our approach yields state-of-the-art action recognition accuracy. In the future, we intend to study new ways to incorporate additional sources of information, such as dense trajectories and human skeletons in our framework.
-  B. Alexe, T. Deselaers, and V. Ferrari. Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2189–2202, 2012.
G. Bertasius, J. Shi, and L. Torresani.
Deepedge: A multi-scale bifurcated deep network for top-down contour
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4380–4389, 2015.
-  H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould. Dynamic image networks for action recognition. In IEEE International Conference on Computer Vision and Pattern Recognition CVPR, 2016.
-  W. Chen, C. Xiong, R. Xu, and J. J. Corso. Actionness ranking with lattice conditional ordinal random fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 748–755, 2014.
-  F. Chollet. keras. https://github.com/fchollet/keras, 2015.
-  A. Diba, A. M. Pazandeh, H. Pirsiavash, and L. Van Gool. Deepcamp: Deep convolutional action & attribute mid-level patterns.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2634, 2015.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. 2016.
-  B. Fernando, P. Anderson, M. Hutter, and S. Gould. Discriminative hierarchical rank pooling for activity recognition. In Proc. CVPR, 2016.
-  B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuytelaars. Rank pooling for action recognition. 2016.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440–1448, 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
-  G. Gkioxari and J. Malik. Finding action tubes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 759–768, 2015.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. volume 9, pages 1735–1780. MIT Press, 1997.
-  A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena. Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 3118–3125. IEEE, 2016.
-  M. Jain, J. Van Gemert, H. Jégou, P. Bouthemy, and C. G. Snoek. Action localization with tubelets from motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 740–747, 2014.
-  H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 3192–3199, 2013.
-  A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
-  I. Kviatkovsky, E. Rivlin, and I. Shimshoni. Online action recognition using covariance of shape and motion. Computer Vision and Image Understanding, 129:15–26, 2014.
-  I. Laptev. On space-time interest points. International Journal of Computer Vision, 64(2-3):107–123, 2005.
Y. Li, W. Li, V. Mahadevan, and N. Vasconcelos.
Vlad3: Encoding dynamics of deep features for action recognition.
-  M. Lin, Q. Chen, and S. Yan. Network in network. volume abs/1312.4400, 2013.
-  J. Liu, A. Shahroudy, D. Xu, and G. Wang. Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, pages 816–833. Springer, 2016.
-  B. Mahasseni and S. Todorovic. Regularizing long short term memory with 3d human-skeleton sequences for action recognition.
-  D. Oneata, J. Revaud, J. Verbeek, and C. Schmid. Spatio-temporal object detection proposals. In European Conference on Computer Vision, pages 737–752. Springer, 2014.
-  X. Peng and C. Schmid. Multi-region two-stream r-cnn for action detection. In European Conference on Computer Vision, pages 744–759. Springer, 2016.
-  S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin. Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529, 2016.
-  F. Saleh, M. S. Aliakbarian, M. Salzmann, L. Petersson, S. Gould, and J. M. Alvarez. Built-in foreground/background prior for weakly-supervised semantic segmentation. In European Conference on Computer Vision, pages 413–432. Springer, 2016.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pages 568–576, 2014.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. volume abs/1409.1556, 2014.
-  B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao. A multi-stream bi-directional recurrent neural network for fine-grained action detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  B. Singh and M. Shao. A multi-stream bi-directional recurrent neural network for fine-grained action detection.
-  K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. 2012.
-  N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. CoRR, abs/1502.04681, 2, 2015.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497. IEEE, 2015.
-  J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. International journal of computer vision, 104(2):154–171, 2013.
-  M. Van den Bergh, G. Roig, X. Boix, S. Manen, and L. Van Gool. Online video seeds for temporal window objectness. In Proceedings of the IEEE International Conference on Computer Vision, pages 377–384, 2013.
-  H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, pages 3551–3558, 2013.
-  L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4305–4314, 2015.
-  L. Wang, Y. Qiao, X. Tang, and L. Van Gool. Actionness estimation using hybrid fully convolutional networks. arXiv preprint arXiv:1604.07279, 2016.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision, pages 20–36. Springer, 2016.
-  X. Wang, A. Farhadi, and A. Gupta. Actions ~ transformations. In IEEE International Conference on Computer Vision and Pattern Recognition CVPR. IEEE, 2016.
-  R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun. Deep image: Scaling up image recognition. arXiv preprint arXiv:1501.02876, 7(8), 2015.
-  G. Yu and J. Yuan. Fast action proposals for human action detection and search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1302–1311, 2015.
-  J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4694–4702, 2015.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833. Springer, 2014.
-  B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang. Real-time action recognition with enhanced motion vector cnns. arXiv preprint arXiv:1604.07669, 2016.
-  B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In IEEE International Conference on Computer Vision and Pattern Recognition CVPR, 2016.